pressR

Automating the collection of U.S. Congress members' press releases.

pressR scrapes press releases from U.S. House (*.house.gov) and Senate (*.senate.gov) members' websites over a date range and returns a tidy data frame. Instead of guessing among dozens of CSS/XPath selectors per site, it detects the content-management system behind each site and routes to a dedicated extractor. Coverage is ~99% of the House and ~97% of the Senate.

Why CMS detection

Member sites cluster into a few vendor families. The House skews to the official Drupal template; the Senate skews to WordPress (often with a custom press-release post type). Approximate House shares:

CMS	Share	How it's scraped
Drupal (official House template)	~54%	`/media/press-releases` listing, `?page=N` pagination
ASP.NET ("DocumentID")	~20%	`documentquery.aspx` listing, `documentsingle.aspx` items
WordPress	~12% (most of the Senate)	`wp-json` REST API — the `congress_press_release` category, a press-release custom post type, or an Elementor HTML fallback when REST is blocked
press-releases vendor	small	`/press-releases` (or `/media-center`) listing with `?ID=<GUID>` (year-less dates inferred) or `/YYYY/M/slug` item links; pager param auto-probed
headless WordPress (Next.js SPA)	small	public WPGraphQL endpoint (`/graphql`); releases at `/posts/<slug>`
Other	~14%	generic heuristic extractor (heading-link or long-text titles + nearest date, incl. `/YYYY/M/slug` fallback)

Five vendor extractors plus one generic fallback cover both chambers. Pages are fetched statically by default; a headless-browser fallback ([render_html()], via the suggested chromote package) is available for the rare JS-rendered site.

Installation

# install.packages("pak")
pak::pak("zaynesember/pressR")

Usage

library(pressR)

# Current members and their sites
members <- list_members()
#> # A tibble: 437 × 7
#>   name         state   district party committee           url                          chamber
#>   <chr>        <chr>   <chr>    <chr> <chr>               <chr>                        <chr>
#> 1 Moore, Barry Alabama 1st      R     Agriculture;Judici… https://barrymoore.house.gov house
#> ...

# One member
moore <- scrape_member("barrymoore.house.gov", from = "2026-01-01")

# A set of members (metadata carried through)
res <- scrape_pressers(
  members[1:10, ],
  from = "2026-01-01",
  to   = Sys.Date(),
  log_fails = TRUE          # writes fails.csv
)

# The whole House (optionally capped for a quick sample)
all <- scrape_house(from = "2026-01-01", max_members = 50)

# The Senate works the same way (same extractors, plus a `chamber` column)
senators <- list_senators()
sen <- scrape_senate(from = "2026-01-01", max_members = 50)

Every release-returning function yields columns date, title, body, tags, url, cms (plus any member metadata). scrape_pressers() / scrape_house() also attach a failures table:

attr(res, "failures")   # tibble(url, stage, message)

How it works

list_members() parses https://www.house.gov/representatives.
detect_cms() classifies a homepage via its <meta generator> tag and markup fingerprints.
The matching extractor finds the press-release listing, walks its pages newest-first until the window's start, and pulls each release's body.
scrape_member() ties these together for one site; scrape_pressers() runs many, isolating per-site failures.

Politeness & configuration

Requests carry an identifying user agent and are throttled and retried via httr2. Tunable via options:

options(pressR.throttle = 20)            # requests/minute (default 20)
options(pressR.cache_dir = "~/.cache/pressR")  # enable on-disk HTTP cache

Archiving

Scrapes return in-memory tibbles; to build historical coverage, append runs to a local, year-partitioned, de-duplicated store (xz-compressed RDS, keyed on url so re-scrapes refresh rather than duplicate):

res <- scrape_house(from = "2026-01-01")
archive_releases(res)                       # -> tools::R_user_dir("pressR","data")
read_archive(from = "2026-01-01", to = Sys.Date())

Set options(pressR.archive_dir = "~/pressR-archive") to choose the location. Because the corpus grows (~33 MB/year compressed, body text included) it lives on disk, not in the package. Prebuilt snapshots are published as GitHub release assets; fetch them without scraping via download_archive() (and, for maintainers with write access, publish_archive()). Both need the suggested piggyback package.

Development

devtools::load_all()
devtools::test()     # offline unit tests run against saved HTML fixtures
devtools::check()

The original grad-school notebook implementation is preserved under legacy/ for reference.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
R		R
legacy		legacy
man		man
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.md		README.md
pressR.Rproj		pressR.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pressR

Why CMS detection

Installation

Usage

How it works

Politeness & configuration

Archiving

Development

License

About

Licenses found

Uh oh!

Releases 1

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pressR

Why CMS detection

Installation

Usage

How it works

Politeness & configuration

Archiving

Development

License

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases 1

Uh oh!

Contributors

Uh oh!

Languages