Skip to content

UrbanInstitute/nccsdata

Repository files navigation

nccsdata nccsdata hex logo

R-CMD-check test-coverage

nccsdata provides tools to download, filter, and analyze nonprofit organization data from the National Center for Charitable Statistics (NCCS). It reads IRS Business Master File (BMF) data stored as parquet files in a public S3 bucket, with support for predicate-pushdown filtering by state, county, NTEE subsector, and exempt organization type.

Note: This is version 2.0.0, a ground-up rewrite of the package. The v1 API (get_data(), preview_sample(), parse_ntee()) has been replaced. See the migration section below.

Installation

Install the development version from GitHub:

# install.packages("devtools")
devtools::install_github("UrbanInstitute/nccsdata")

Usage

Reading BMF data

nccs_read() downloads BMF data from S3 with optional filters. Filtering happens at the Arrow level via predicate pushdown, so only matching rows are read into memory.

library(nccsdata)

# All Pennsylvania nonprofits (default columns)
pa <- nccs_read(state = "PA")

# Arts nonprofits in New York
ny_arts <- nccs_read(state = "NY", ntee_subsector = "ART")

# Select specific columns
pa_slim <- nccs_read(
  state = "PA",
  columns = c("ein", "org_name_display", "geo_county", "income_amount")
)

# Lazy query for custom dplyr pipelines
query <- nccs_read(state = "PA", collect = FALSE)
result <- query |>
  dplyr::filter(geo_county == "Lackawanna County") |>
  dplyr::collect()

Summarizing data

nccs_summary() produces grouped count summaries from a collected data frame.

pa <- nccs_read(state = "PA")

# Total count
nccs_summary(pa)

# Count by county
nccs_summary(pa, group_by = "geo_county")

# Count by county and subsector, export to CSV
nccs_summary(pa, group_by = c("geo_county", "nteev2_subsector"),
             output_csv = "pa_counts.csv")

Discovering valid filter values

nccs_catalog() lists valid values for nccs_read() filters without any network calls.

nccs_catalog("state")
nccs_catalog("ntee_subsector")
nccs_catalog("exempt_org_type")

# Pass `labels = TRUE` for a code + description tibble, sourced from the
# bundled BMF lookup tables.
nccs_catalog("ntee_subsector", labels = TRUE)
nccs_catalog("foundation_code", labels = TRUE)

Cleaning external data

The BMF returned by nccs_read() is already normalized upstream, but two helpers are exposed for users joining external CSVs or API responses against it:

# Coerce EINs in any format to canonical XX-XXXXXXX
nccs_normalize_ein(c("123456789", "12-3456789", 1234567))
#> [1] "12-3456789" "12-3456789" "00-1234567"

# Coerce IRS binary-indicator columns to logical
nccs_as_indicator(c("Y", "N", "1", "2"))
#> [1]  TRUE FALSE  TRUE FALSE

# e-file indicator accepts E/P (2015, 2018+) and Y/N (2016-2017)
nccs_as_indicator(c("E", "P", "Y", "N"), scheme = "efile")

Browsing the data dictionary

nccs_dictionary() returns a tibble describing all BMF columns, with optional pattern filtering.

# All columns
nccs_dictionary()

# Find geocoding-related columns
nccs_dictionary("geo")

# Find NTEE-related columns
nccs_dictionary("ntee")

Scope and design

nccsdata is intentionally a lean reader. A few principles that shape what is — and is not — in the package:

  • No re-cleaning of upstream data. The BMF and CORE Series parquet files are cleaned by the sibling ETL pipelines (nccs-data-bmf, nccs-data-core). EIN normalization, NTEE decoding, geocoding, and subsection labeling are done before publication. We don’t re-implement them here.
  • The two exceptions are helpers for external data. nccs_normalize_ein() and nccs_as_indicator() exist so you can bring your own CSVs (member rosters, survey extracts, donor lists) into the same shape as the package’s output before joining.
  • One opinionated analytic helper: inflation adjustment. nccs_deflate() and the bundled annual cpi_u series are included because real-dollar conversion needs a reference table the user otherwise has to fetch themselves, and the conversion itself is mechanical and uncontroversial.
  • No canonical financial ratios. Operating margin, program-expense ratio, fundraising efficiency, months of operating reserves, and similar measures are deliberately not bundled. Their definitions vary by analyst (which numerator, which denominator, which exclusions), and shipping one canonical version would make the package take editorial sides. They’re also one-line mutate() calls on the columns CORE already provides.
  • Lean dependencies. Hard imports are arrow, dplyr, utils. Anything heavier (sf, tigris, ggplot2, data.table) belongs in vignettes that show how to combine nccsdata with those packages, not as a dependency.

If you want to build analytic functionality on top of this package, the right pattern is a downstream package or notebook that imports nccsdata and adds your team’s preferred ratio definitions.

Migrating from v1

v1 function v2 replacement
get_data() nccs_read()
preview_sample() nccs_summary()
ntee_preview() / parse_ntee() nccs_catalog("ntee_subsector")

Key changes:

  • Data source moved from legacy Core/BMF CSVs to geocoded BMF parquet files on S3.
  • Filtering now uses Arrow predicate pushdown instead of downloading full files.
  • Dependencies reduced from 12 packages to 3 (arrow, dplyr, utils).

Documentation

Full documentation is available at https://urbaninstitute.github.io/nccsdata/.

Getting help

About

Data Processing Package For NCCS Data

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages