🏥 Predictive Modeling: U.S. Health Insurance Cost Analysis

STAT 311 | Advanced Regression & Analytics | Fall 2025

📌 Executive Summary

Medical insurance charges in the U.S. are notoriously skewed and influenced by complex lifestyle interactions. This project bridges the gap between raw demographic data and accurate cost forecasting by developing a Log-Transformed Interaction Model. By stabilizing variance and addressing non-normality, I achieved a highly accurate predictive model with a CV of 4.77%.

🧠 The "Gap" Analysis: Why simple models failed

Standard linear models often fail on healthcare data because costs are not distributed normally. My analysis identified two critical "blind spots" in basic modeling:

The Variance Gap: Insurance charges are highly skewed; a Box-Cox transformation was required to normalize residuals.
The BMI-Smoker Interaction: I discovered that BMI does not affect everyone equally—it has a dramatically higher impact on costs for smokers than for non-smokers.

🛠️ Technical Methodology

I utilized JMP Statistical Software to build and validate three nested models:

Model 1 (Baseline): Main effects only (Age, BMI, Smoking status).
Model 2 (Interaction): Added $BMI \times Smoker$ interaction terms to capture non-linear cost spikes.
Model 3 (Selected Final): Applied a Natural Log Transformation ($ln(charges)$) to address heteroscedasticity and stabilize variance.

Model Performance Metrics

Metric	Model 1	Model 3 (Final)
Model Type	Linear Main Effects	Log-Transformed Interaction
Predictive Accuracy	Baseline	CV = 4.77%
Variance Stability	Poor (Heteroscedastic)	Stable (Homoscedastic)
Key Insight	Smoking is a predictor	Smoking + BMI interaction drives 80%+ of cost

🗂️ Repository Structure

STAT311_Final_Report.pdf: Comprehensive 20+ page analysis including variable screening (Stepwise), diagnostics, and influence/leverage plots.
JMP_Files/: Complete reproducible environment including .jmpjournal files and model fit scripts.
insurance.csv: Cleaned dataset (1,335 observations).

🚀 How to Reproduce

Open insurance.jmp in JMP.
Run the saved scripts for Model 3.
Model Specifications: $Y = ln(charges)$; $X = Age, BMI, Children, Smoker, (BMI \times Smoker)$.
Diagnostics used: VIF for multicollinearity, Box-Cox for transformation, and Cook's D for influence.

Course: STAT 311 – Regression Analysis
Instructor: Dr. Iresha Premarathna
Author: Feifei Li (GPA: 3.9)

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
BMI , Smoking and Health Insurance Cost (1).pptx		BMI , Smoking and Health Insurance Cost (1).pptx
BMI , Smoking and Health Insurance Cost.pptx		BMI , Smoking and Health Insurance Cost.pptx
Final Project EDA and Model Development_ Feifei Li (1).pdf		Final Project EDA and Model Development_ Feifei Li (1).pdf
Final Project EDA and Model Development_ Feifei Li .docx		Final Project EDA and Model Development_ Feifei Li .docx
Group05_FeifeiLi_ Report .pdf		Group05_FeifeiLi_ Report .pdf
README.md		README.md
STAT 311 final project proposal.pdf		STAT 311 final project proposal.pdf
STAT311 Final Project Report .pdf		STAT311 Final Project Report .pdf
insurance - Fit Least Squares Model cross-validation.jrp		insurance - Fit Least Squares Model cross-validation.jrp
insurance 3 - Fit Least Squares Model1.jrn		insurance 3 - Fit Least Squares Model1.jrn
insurance 3 - Fit Least Squares Model2.jrn		insurance 3 - Fit Least Squares Model2.jrn
insurance 3 - Fit Least Squares Model3.jrn		insurance 3 - Fit Least Squares Model3.jrn
insurance.csv		insurance.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏥 Predictive Modeling: U.S. Health Insurance Cost Analysis

📌 Executive Summary

🧠 The "Gap" Analysis: Why simple models failed

🛠️ Technical Methodology

Model Performance Metrics

🗂️ Repository Structure

🚀 How to Reproduce

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🏥 Predictive Modeling: U.S. Health Insurance Cost Analysis

📌 Executive Summary

🧠 The "Gap" Analysis: Why simple models failed

🛠️ Technical Methodology

Model Performance Metrics

🗂️ Repository Structure

🚀 How to Reproduce

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages