A small data-processing project for practicing pandas and numpy on synthetic banking transactions.
The pipeline:
- Loads transaction data from
data/fake_banking_transactions.csv - Cleans the data (remove duplicates, fill missing values)
- Adds a feature flag (
is_large) for high-value transactions - Applies a simple fraud rule (
fraud_flag) - Saves processed output to
data/processed/clean_transactions.csv
main.py - Entry point for the pipeline
src/clean.py - Data cleaning helpers
src/features.py - Feature engineering (is_large)
src/fraud_rules.py - Rule-based fraud tagging
data/fake_banking_transactions.csv - Raw synthetic dataset
data/processed/clean_transactions.csv - Processed output
notebooks/analysis.ipynb - Notebook for exploratory analysis
Input CSV columns:
customer_id(int)amount(float)merchant(str)category(str)date(YYYY-MM-DD)
The synthetic data intentionally includes:
- Missing values
- Duplicate rows
- Outliers (very large and some negative transaction amounts)
Current rule in src/fraud_rules.py:
- Mark transaction as fraud (
fraud_flag = 1) when:amount > 5000andcategory == "crypto"
- Otherwise
fraud_flag = 0
Feature in src/features.py:
is_large = 1whenamount > 2000, else0
python -m venv .venv
.\.venv\Scripts\Activate.ps1pip install -r requirements.txt.\.venv\Scripts\python.exe main.pyYou should see:
- Preview rows in terminal
- Fraud count summary
- Output written to
data/processed/clean_transactions.csv
- Replace rule-based logic with anomaly detection / ML baseline
- Add tests for cleaning and fraud rules
- Add configuration (thresholds, input/output paths) via
.envor CLI arguments