-
Notifications
You must be signed in to change notification settings - Fork 145
Add CDCWonder_NNDSS_Infectious_Weekly scripts and schema mappings #1973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
abhishekjaisw
wants to merge
11
commits into
datacommonsorg:master
Choose a base branch
from
abhishekjaisw:statvar_imports/cdc/CDCWonder_NNDSS_InfectiousWeekly
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
134cb5b
Add NNDSS Infectious Weekly scripts and schema mappings
abhishekjaisw ee7192f
Add NNDSS Infectious Weekly scripts and schema mappings v1
abhishekjaisw 4345dfd
Merge branch 'master' into statvar_imports/cdc/CDCWonder_NNDSS_Infect…
abhishekjaisw 2cfa6c0
Merge branch 'master' into statvar_imports/cdc/CDCWonder_NNDSS_Infect…
abhishekjaisw 9f6fdfd
Merge branch 'master' into statvar_imports/cdc/CDCWonder_NNDSS_Infect…
abhishekjaisw b433d40
Fix: Remove 'python' prefix from manifest script execution and implem…
abhishekjaisw f3e9e48
Merge branch 'master' into statvar_imports/cdc/CDCWonder_NNDSS_Infect…
abhishekjaisw f5155bf
Fix: Remove 'python' prefix from manifest script execution and implem…
abhishekjaisw b8f4966
Merge branch 'statvar_imports/cdc/CDCWonder_NNDSS_InfectiousWeekly' o…
abhishekjaisw dab0b39
Merge branch 'master' of https://github.com/datacommonsorg/data into …
abhishekjaisw 58a44fa
Address PR comments: add bounds check, replace print with logging, ad…
abhishekjaisw File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
39 changes: 39 additions & 0 deletions
39
statvar_imports/cdc/CDCWonder_NNDSS_InfectiousWeekly/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| # CDCWonder_NNDSS_Infectious_Weekly | ||
|
|
||
| ## Overview | ||
| Notifiable Infectious Diseases Data: Weekly tables from CDC WONDER which has the incident counts of different infectious diseases per Previous 52 week that are reported by the 50 states, New York City, the District of Columbia, and the U.S. territories. | ||
|
|
||
| ## Data Source | ||
| **Source URL:** | ||
| `https://data.cdc.gov/api/views/x9gk-5huc/rows.csv?accessType=DOWNLOAD&api_foundry=true` | ||
|
|
||
| ## How To Download Input Data | ||
| To download and process the data, you'll need to run the provided preprocess script, `preprocess.py`. This script will automatically create an "input_files" folder where you should place the file to be processed.By using this script, we are creating one more columns in the input files such as 'observationDate'. | ||
|
|
||
| statvars: Infectious Diseases | ||
|
|
||
| ## Download the data: | ||
| For download and preprocess the source data, run: | ||
| ```python3 preprocess.py``` | ||
|
|
||
| ## Processing Instructions | ||
| To process data and generate statistical variables, use the following command from the "data" directory: | ||
|
|
||
| **For Test Data Run** | ||
| ``` | ||
| python3 tools/statvar_importer/stat_var_processor.py \ | ||
| --input_data=statvar_imports/cdc/CDCWonder_NNDSS_InfectiousWeekly/testdata/NNDSS_Weekly_Data.csv \ | ||
| --pv_map=statvar_imports/cdc/CDCWonder_NNDSS_InfectiousWeekly/nndss_weekly_pvmap.csv \ | ||
| --config_file=statvar_imports/cdc/CDCWonder_NNDSS_InfectiousWeekly/nndss_weekly_metadata.csv \ | ||
| --output_path=statvar_imports/cdc/CDCWonder_NNDSS_InfectiousWeekly/testdata/nndss_weekly_output | ||
| ``` | ||
|
|
||
| **For Main data run** | ||
| ```bash | ||
| python3 tools/statvar_importer/stat_var_processor.py \ | ||
| --input_data=statvar_imports/cdc/CDCWonder_NNDSS_InfectiousWeekly/input_files/NNDSS_Weekly_Data.csv \ | ||
| --pv_map=statvar_imports/cdc/CDCWonder_NNDSS_InfectiousWeekly/nndss_weekly_pvmap.csv \ | ||
| --config_file=statvar_imports/cdc/CDCWonder_NNDSS_InfectiousWeekly/nndss_weekly_metadata.csv \ | ||
| --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf \ | ||
| --output_path=statvar_imports/cdc/CDCWonder_NNDSS_InfectiousWeekly/output/nndss_weekly_output | ||
| ``` |
36 changes: 36 additions & 0 deletions
36
statvar_imports/cdc/CDCWonder_NNDSS_InfectiousWeekly/manifest.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| { | ||
| "import_specifications": [ | ||
| { | ||
| "import_name": "CDCWonder_NNDSS_Infectious_Weekly", | ||
| "curator_emails": [ | ||
| "support@datacommons.org" | ||
| ], | ||
| "provenance_url": "https://data.cdc.gov/api/views/x9gk-5huc/rows.csv?accessType=DOWNLOAD&api_foundry=true", | ||
| "provenance_description": "Notifiable Infectious Diseases Data: Weekly tables from CDC WONDER which has the incident counts of different infectious diseases per week that are reported by the 50 states, New York City, the District of Columbia, and the U.S. territories.", | ||
| "scripts": [ | ||
| "preprocess.py", | ||
| "../../../tools/statvar_importer/stat_var_processor.py --input_data=input_files/NNDSS_Weekly_Data.csv --pv_map='nndss_weekly_pvmap.csv' --config_file=nndss_weekly_metadata.csv --output_path=output/nndss_weekly_output" | ||
| ], | ||
| "import_inputs": [ | ||
| { | ||
| "template_mcf": "output/nndss_weekly_output.tmcf", | ||
| "cleaned_csv": "output/nndss_weekly_output.csv", | ||
| "node_mcf": "output/*.mcf" | ||
| } | ||
| ], | ||
| "source_files": [ | ||
| "input_files/NNDSS_Weekly_Data.csv" | ||
| ], | ||
| "cron_schedule": "00 11 1,15 * *", | ||
| "resource_limits": {"cpu": 8, "memory": 32, "disk": 100} | ||
| } | ||
| ], | ||
| "config_override": { | ||
| "invoke_import_validation": true, | ||
| "invoke_import_tool": true, | ||
| "invoke_differ_tool": true, | ||
| "skip_input_upload": false, | ||
| "skip_gcs_upload": false, | ||
| "cleanup_gcs_volume_mount": false | ||
| } | ||
| } |
7 changes: 7 additions & 0 deletions
7
statvar_imports/cdc/CDCWonder_NNDSS_InfectiousWeekly/nndss_weekly_metadata.csv
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| parameter,val | ||
| mapped_rows,1 | ||
| mapped_columns,5 | ||
| header_rows,1 | ||
| #places_resolved_csv, | ||
| input_columns,8 | ||
| #input_rows,1000 |
367 changes: 367 additions & 0 deletions
367
statvar_imports/cdc/CDCWonder_NNDSS_InfectiousWeekly/nndss_weekly_pvmap.csv
Large diffs are not rendered by default.
Oops, something went wrong.
158 changes: 158 additions & 0 deletions
158
statvar_imports/cdc/CDCWonder_NNDSS_InfectiousWeekly/preprocess.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,158 @@ | ||
| # Copyright 2025 Google LLC | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
|
|
||
| import os, sys | ||
| import pandas as pd | ||
| from absl import app, logging | ||
| from pathlib import Path | ||
| import datetime | ||
| import importlib.util | ||
| import shutil | ||
|
|
||
| script_dir = os.path.dirname(os.path.abspath(__file__)) | ||
| util_script_path = os.path.abspath(os.path.join(script_dir, '../../../util/download_util_script.py')) | ||
| spec = importlib.util.spec_from_file_location('download_util_script', util_script_path) | ||
| if spec is None or spec.loader is None: | ||
| raise ImportError(f'Could not load download_util_script from {util_script_path}') | ||
| download_util_script = importlib.util.module_from_spec(spec) | ||
| spec.loader.exec_module(download_util_script) | ||
| download_file = download_util_script.download_file | ||
| INPUT_DIR = os.path.join(script_dir, "input_files") | ||
| Path(INPUT_DIR).mkdir(parents=True, exist_ok=True) | ||
| INPUT_FILE = os.path.join(INPUT_DIR, "rows.csv") | ||
| NEW_FILE = os.path.join(INPUT_DIR, "NNDSS_Weekly_Data.csv") | ||
| SOURCE_URL = "https://data.cdc.gov/api/views/x9gk-5huc/rows.csv?accessType=DOWNLOAD&api_foundry=true" | ||
|
|
||
| def _start_date_of_year(year: int) -> datetime.date: | ||
| """Return the first day of the first MMWR week for a given year. | ||
|
|
||
| The first MMWR week starts on the Sunday of the week containing Jan 4. | ||
| """ | ||
| jan_one = datetime.date(year, 1, 1) | ||
| diff = 7 * (jan_one.isoweekday() > 3) - jan_one.isoweekday() | ||
| return jan_one + datetime.timedelta(days=diff) | ||
|
|
||
| def get_mmwr_week_start_date(year, week) -> datetime.date: | ||
| """Compute the start date for a given MMWR year and week. | ||
|
|
||
| Args: | ||
| year: The MMWR year value from the CDC dataset. | ||
| week: The MMWR week value from the CDC dataset. | ||
|
|
||
| Returns: | ||
| A datetime.date object for the first day of the specified week, or None if invalid. | ||
| """ | ||
| try: | ||
| year = int(year) | ||
| week = int(week) | ||
| except (ValueError, TypeError): | ||
| return None | ||
|
|
||
| if not (1 <= week <= 53): | ||
| logging.warning(f"Invalid MMWR WEEK found: {week}. Skipping date calculation.") | ||
| return None | ||
|
|
||
| day_one = _start_date_of_year(year) | ||
| diff = 7 * (week - 1) | ||
| return day_one + datetime.timedelta(days=diff) | ||
|
|
||
| def preprocess_data(filepath: str): | ||
| """Read a CDC CSV in chunks, add observation dates, and save safely. | ||
|
|
||
| Args: | ||
| filepath: Path to the downloaded CDC CSV file. | ||
| """ | ||
| temp_filepath = filepath + ".tmp" | ||
| chunk_size = 100000 | ||
| first_chunk = True | ||
| chunk_count = 0 | ||
|
|
||
| try: | ||
| logging.info(f"Opening pandas reader on {filepath}...") | ||
|
|
||
| # Added safety flags: low_memory=False and on_bad_lines='skip' | ||
| # to prevent C-level SIGABRT crashes on bad rows. | ||
| reader = pd.read_csv(filepath, chunksize=chunk_size, low_memory=False, on_bad_lines='skip') | ||
|
|
||
| for chunk in reader: | ||
| chunk_count += 1 | ||
| logging.info(f"Processing chunk {chunk_count}...") | ||
|
|
||
| if first_chunk: | ||
| required_cols = ['Current MMWR Year', 'MMWR WEEK'] | ||
| if not all(col in chunk.columns for col in required_cols): | ||
| raise KeyError(f"The file must contain the columns: {required_cols}.") | ||
|
|
||
| chunk['observationDate'] = chunk.apply( | ||
| lambda row: get_mmwr_week_start_date(row['Current MMWR Year'], row['MMWR WEEK']), | ||
|
abhishekjaisw marked this conversation as resolved.
|
||
| axis=1 | ||
| ) | ||
|
|
||
| cols = list(chunk.columns) | ||
| cols.remove('observationDate') | ||
| mmwr_week_index = cols.index('MMWR WEEK') | ||
| cols.insert(mmwr_week_index + 1, 'observationDate') | ||
| chunk = chunk[cols] | ||
|
|
||
| chunk.to_csv(temp_filepath, mode='a' if not first_chunk else 'w', | ||
| header=first_chunk, index=False) | ||
| first_chunk = False | ||
|
|
||
| logging.info("All chunks processed. Moving temp file...") | ||
| shutil.move(temp_filepath, filepath) | ||
| logging.info(f"Success: File '{filepath}' updated safely.") | ||
|
|
||
| except Exception as e: | ||
| if os.path.exists(temp_filepath): os.remove(temp_filepath) | ||
| logging.error(f"Error during Pandas processing: {e}") | ||
| logging.fatal(f"An unexpected error occurred: {e}") | ||
| raise RuntimeError(f"Import job failed An unexpected error occurred: {e}") | ||
|
|
||
| def main(argv): | ||
| """Download CDC data, validate it, preprocess it, and rename the output.""" | ||
| logging.info("Starting download phase...") | ||
| try: | ||
| download_file(url=SOURCE_URL, | ||
| output_folder=INPUT_DIR, | ||
| unzip=False, | ||
| headers= None, | ||
| tries= 3, | ||
| delay= 5, | ||
| backoff= 2) | ||
| logging.info("Download function completed.") | ||
| except Exception as e: | ||
| logging.error(f"Failed during download: {e}") | ||
| logging.fatal(f"Failed to download NNDSS weekly data file,{e}") | ||
| raise RuntimeError(f"Failed to download NNDSS weekly data file,{e}") | ||
|
|
||
| # Check if file actually downloaded and check its size | ||
| if not os.path.exists(INPUT_FILE): | ||
| logging.fatal("The file 'rows.csv' was never downloaded.") | ||
| sys.exit(1) | ||
|
|
||
| file_size_mb = os.path.getsize(INPUT_FILE) / (1024 * 1024) | ||
| logging.info(f"Downloaded file size is {file_size_mb:.2f} MB.") | ||
|
|
||
| # Prevent Pandas from processing tiny error files | ||
| if file_size_mb < 0.1: | ||
| logging.error("File is suspiciously small! CDC likely returned an HTML error page.") | ||
| with open(INPUT_FILE, 'r') as f: | ||
| logging.error(f"Preview of bad file:\n{f.read(500)}") | ||
| sys.exit(1) | ||
|
|
||
| logging.info("Handing off to Pandas chunker...") | ||
| preprocess_data(INPUT_FILE) | ||
|
|
||
| logging.info("Renaming final file...") | ||
| try: | ||
| if os.path.exists(INPUT_FILE): | ||
| if os.path.exists(NEW_FILE): | ||
| os.remove(NEW_FILE) | ||
| os.rename(INPUT_FILE, NEW_FILE) | ||
| logging.info("Successfully renamed file.") | ||
| except Exception as e: | ||
| logging.error(f"Failed to rename file: {e}") | ||
| sys.exit(1) | ||
|
|
||
| if __name__ == "__main__": | ||
| app.run(main) | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.