nccsdata provides tools to download, filter, and analyze nonprofit organization data from the National Center for Charitable Statistics (NCCS). It reads IRS Business Master File (BMF) data stored as parquet files in a public S3 bucket, with support for predicate-pushdown filtering by state, county, NTEE subsector, and exempt organization type.
Note: This is version 2.0.0, a ground-up rewrite of the package. The v1 API (
get_data(),preview_sample(),parse_ntee()) has been replaced. See the migration section below.
Install the development version from GitHub:
# install.packages("devtools")
devtools::install_github("UrbanInstitute/nccsdata")nccs_read() downloads BMF data from S3 with optional filters.
Filtering happens at the Arrow level via predicate pushdown, so only
matching rows are read into memory.
library(nccsdata)
# All Pennsylvania nonprofits (default columns)
pa <- nccs_read(state = "PA")
# Arts nonprofits in New York
ny_arts <- nccs_read(state = "NY", ntee_subsector = "ART")
# Select specific columns
pa_slim <- nccs_read(
state = "PA",
columns = c("ein", "org_name_display", "geo_county", "income_amount")
)
# Lazy query for custom dplyr pipelines
query <- nccs_read(state = "PA", collect = FALSE)
result <- query |>
dplyr::filter(geo_county == "Lackawanna County") |>
dplyr::collect()nccs_summary() produces grouped count summaries from a collected data
frame.
pa <- nccs_read(state = "PA")
# Total count
nccs_summary(pa)
# Count by county
nccs_summary(pa, group_by = "geo_county")
# Count by county and subsector, export to CSV
nccs_summary(pa, group_by = c("geo_county", "nteev2_subsector"),
output_csv = "pa_counts.csv")nccs_catalog() lists valid values for nccs_read() filters without
any network calls.
nccs_catalog("state")
nccs_catalog("ntee_subsector")
nccs_catalog("exempt_org_type")
# Pass `labels = TRUE` for a code + description tibble, sourced from the
# bundled BMF lookup tables.
nccs_catalog("ntee_subsector", labels = TRUE)
nccs_catalog("foundation_code", labels = TRUE)The BMF returned by nccs_read() is already normalized upstream, but
two helpers are exposed for users joining external CSVs or API responses
against it:
# Coerce EINs in any format to canonical XX-XXXXXXX
nccs_normalize_ein(c("123456789", "12-3456789", 1234567))
#> [1] "12-3456789" "12-3456789" "00-1234567"
# Coerce IRS binary-indicator columns to logical
nccs_as_indicator(c("Y", "N", "1", "2"))
#> [1] TRUE FALSE TRUE FALSE
# e-file indicator accepts E/P (2015, 2018+) and Y/N (2016-2017)
nccs_as_indicator(c("E", "P", "Y", "N"), scheme = "efile")nccs_dictionary() returns a tibble describing all BMF columns, with
optional pattern filtering.
# All columns
nccs_dictionary()
# Find geocoding-related columns
nccs_dictionary("geo")
# Find NTEE-related columns
nccs_dictionary("ntee")nccsdata is intentionally a lean reader. A few principles that shape
what is — and is not — in the package:
- No re-cleaning of upstream data. The BMF and CORE Series parquet
files are cleaned by the sibling ETL pipelines (
nccs-data-bmf,nccs-data-core). EIN normalization, NTEE decoding, geocoding, and subsection labeling are done before publication. We don’t re-implement them here. - The two exceptions are helpers for external data.
nccs_normalize_ein()andnccs_as_indicator()exist so you can bring your own CSVs (member rosters, survey extracts, donor lists) into the same shape as the package’s output before joining. - One opinionated analytic helper: inflation adjustment.
nccs_deflate()and the bundled annualcpi_useries are included because real-dollar conversion needs a reference table the user otherwise has to fetch themselves, and the conversion itself is mechanical and uncontroversial. - No canonical financial ratios. Operating margin, program-expense
ratio, fundraising efficiency, months of operating reserves, and
similar measures are deliberately not bundled. Their definitions
vary by analyst (which numerator, which denominator, which
exclusions), and shipping one canonical version would make the package
take editorial sides. They’re also one-line
mutate()calls on the columns CORE already provides. - Lean dependencies. Hard imports are
arrow,dplyr,utils. Anything heavier (sf, tigris, ggplot2, data.table) belongs in vignettes that show how to combinenccsdatawith those packages, not as a dependency.
If you want to build analytic functionality on top of this package, the
right pattern is a downstream package or notebook that imports
nccsdata and adds your team’s preferred ratio definitions.
| v1 function | v2 replacement |
|---|---|
get_data() |
nccs_read() |
preview_sample() |
nccs_summary() |
ntee_preview() / parse_ntee() |
nccs_catalog("ntee_subsector") |
Key changes:
- Data source moved from legacy Core/BMF CSVs to geocoded BMF parquet files on S3.
- Filtering now uses Arrow predicate pushdown instead of downloading full files.
- Dependencies reduced from 12 packages to 3 (
arrow,dplyr,utils).
Full documentation is available at https://urbaninstitute.github.io/nccsdata/.
- Browse the getting started vignette
- Open an issue on GitHub
- Contact the maintainer at
tpoongundranar@urban.org