📊 This project is Part 1 of a two-part portfolio.
See Healthcare Executive Dashboard for the Tableau visualization layer built on these findings.
Structured SQL analysis on 169,635 real Medicare records to answer the cost and quality questions hospital executives and health insurers actually act on.
This project loads three publicly available CMS Medicare datasets — covering 3,000+ US hospitals, 6 tracked diagnoses, and one full fiscal year of readmission and cost data — into a local SQLite database and applies 20 layered SQL queries to surface operational and financial insights.
The analysis is organized the way a real analytics engagement would be: starting with data quality and exploration, moving through aggregations and segmentation, and ending with five executive-ready business question answers backed by specific numbers. The findings below are drawn directly from the query outputs — not approximations.
| # | Finding | Key Number |
|---|---|---|
| 1 | Septicemia is the #1 national cost driver — DRG 871 alone generates more Medicare spend than the next two conditions combined | $10.3B estimated total payments |
| 2 | Heart Failure has the largest readmission burden — nearly 1 in 5 patients returns within 30 days | 19.4% 30-day readmission rate across 309,000 discharges |
| 3 | CABG is the worst cost-quality combination — the most expensive service line still carries above-benchmark readmission rates | $49,121 avg cost per discharge, ERR > 1.0 |
| 4 | AdventHealth Orlando is the highest national penalty risk — penalised on 5 of 6 tracked diagnoses simultaneously | ~$911K estimated annual CMS penalty exposure |
| 5 | Massachusetts is structurally failing on readmissions — worst-performing state in 4 of 6 HRRP diagnoses despite high hospital density | ERR > 1.05 across AMI, COPD, HF, Pneumonia |
| 6 | Critical-tier hospitals identified in Alaska and Alabama — excess readmission ratios above 1.20, flagged across multiple diagnoses | ERR up to 1.23 (Central Peninsula General, AK) |
All data is sourced from data.cms.gov — public domain, no license or account required.
| Dataset | What It Contains |
|---|---|
| Medicare Inpatient Hospitals by Provider and Service | Average charges, total and Medicare payments by DRG code and hospital |
| Hospital Readmissions Reduction Program (HRRP) FY2024 | Excess readmission ratios, predicted vs. expected rates, patient volumes by facility and diagnosis |
| Hospital General Information | Facility type, ownership, state, CMS star rating — one row per hospital |
Raw CSVs are excluded from this repo via
.gitignore. To run immediately without downloading data, usepython scripts/create_sample_db.pyto generate a sample database — all 20 queries will execute against it.
| Category | Techniques |
|---|---|
| Exploration | Row counts, NULL profiling, distinct value checks, schema validation |
| Aggregations | GROUP BY, HAVING, conditional aggregation with CASE WHEN inside SUM |
| Segmentation | Cost quartiles with NTILE(), diagnosis performance tiers, facility risk classification |
| JOINs | INNER JOIN and LEFT JOIN across 3 tables; scalar subqueries in SELECT clause |
| Window Functions | RANK() and PARTITION BY for within-state ranking, PERCENT_RANK() for national percentile, LAG() for period comparison, NTILE() for distribution buckets |
| CTEs | Multi-step CTEs (4+ stages) for readmission risk scoring and cost efficiency indexing |
| Business Questions | Five end-to-end queries combining all of the above into executive-ready outputs |
BQ16: Patient Segments by Readmission Risk

BQ17: Top Cost Drivers by Service Line

BQ18: Cost vs Quality Analysis

BQ19: Worst Performing Hospitals by State

BQ20: Executive Summary Dashboard

-
Clone the repo
git clone https://github.com/barvepriyamvada/SQL-Analytics.git cd SQl-Analytics -
Install dependencies
pip install -r requirements.txt
-
Generate the sample database (no data download needed)
python scripts/create_sample_db.py
-
Run any query against the database
sqlite3 data/healthcare_analytics.db < queries/07_business_questions/16_bq_highest_readmission_segments.sqlOr open
data/healthcare_analytics.dbin DB Browser for SQLite and run any.sqlfile interactively. -
Optional — load real CMS data for full results Download the three CSVs from data.cms.gov, place them in
data/raw/, then:python scripts/load_data.py
├── data/
│ ├── raw/ # CMS CSVs — not committed to git
│ └── healthcare_analytics.db # SQLite database (generated)
│
├── queries/
│ ├── 01_exploration/ # Row counts, data quality checks
│ ├── 02_aggregations/ # National averages, cost summaries
│ ├── 03_segmentation/ # Facility tiers, cost quartiles
│ ├── 04_joins/ # Multi-table analysis
│ ├── 05_window_functions/ # RANK, LAG, NTILE, PERCENT_RANK
│ ├── 06_cte_analysis/ # Multi-step CTEs, composite scoring
│ └── 07_business_questions/ # Executive-ready answers to BQs 1–5
│
├── outputs/ # Query result CSVs and screenshots
├── scripts/
│ ├── create_sample_db.py # Generates sample database instantly
│ └── load_data.py # Full data loading and cleaning pipeline
└── README.md
Built as part of a data analytics portfolio. Data sourced from CMS.gov — public domain.