I'm a data analyst with 3+ years of experience building ETL workflows, Power BI dashboards, and SQL pipelines in production environments. In my day job, I work with messy, real-world business data — normalizing CRM exports, flagging anomalies in transactional data, and building pipelines that business teams actually rely on.
I built this project to go deeper on the data engineering side — specifically, to demonstrate that I can design a warehouse architecture from scratch, not just query one that already exists. Every design decision here is intentional, and I've documented my reasoning below.
The project follows Medallion Architecture with Bronze, Silver, and Gold layers:
| Layer | Purpose | What I did |
|---|---|---|
| Bronze | Raw ingestion | Loaded ERP and CRM CSV files as-is using BULK INSERT — no transformations, preserving source fidelity |
| Silver | Cleansing & standardization | Resolved mismatched customer IDs between ERP and CRM, standardized date formats, handled NULLs and duplicates |
| Gold | Business-ready analytics | Modeled a star schema with fact and dimension tables optimized for reporting queries |
This end-to-end project covers:
- Data Architecture — Medallion Architecture (Bronze / Silver / Gold)
- ETL Pipelines — Extract from CSV sources, transform with T-SQL, load into SQL Server
- Data Modeling — Star schema with fact and dimension tables
- Analytics & Reporting — SQL-based insights on customer behavior, product performance, and sales trends
This is the section I wish more portfolio projects had — the why behind the choices, not just the what.
1. Why Medallion Architecture instead of a single-layer approach? In my work at Elde Info Solution, I've seen what happens when raw data and business logic live in the same place — every schema change breaks downstream reports. Separating Bronze (raw), Silver (clean), and Gold (business logic) means each layer has one job and failures are isolated. If a source system changes its date format, I fix it in Silver; Gold is untouched.
2. Why store raw data in Bronze without transforming it? The Bronze layer is an audit trail. If a data quality issue surfaces in Silver, I can always trace back to what the source actually sent. This is a pattern I use in my ETL work professionally — never throw away raw data.
3. The CRM / ERP integration challenge The two source systems used inconsistent customer identifiers — the ERP used numeric IDs while the CRM had a mixed alphanumeric format. I resolved this in the Silver transformation layer by standardizing to a canonical integer key before joining, rather than trying to fix it downstream in Gold. This keeps the business logic clean.
4. Star schema over 3NF for the Gold layer For analytical workloads, a normalized 3NF schema creates too many joins. A star schema — one fact table, clean dimension tables — makes queries faster and easier for stakeholders to write. This mirrors what I implement when building Power BI data models at work.
5. What I'd do differently at scale
For a production environment, I'd replace manual BULK INSERT with Azure Data Factory for automated, scheduled ingestion. I'd also add proper logging and alerting to the ETL stored procedures. This project is intentionally scoped to SQL Server to focus on core data modeling principles.
SELECT
c.customer_key,
c.customer_name,
c.country,
SUM(f.sales_amount) AS total_revenue,
COUNT(DISTINCT f.order_number) AS total_orders
FROM gold.fact_sales f
JOIN gold.dim_customers c ON f.customer_key = c.customer_key
GROUP BY c.customer_key, c.customer_name, c.country
ORDER BY total_revenue DESC;| Customer Name | Country | Total Revenue | Total Orders |
|---|---|---|---|
| Jordan Turner | Australia | $51,200 | 18 |
| Hailey Reed | United States | $47,850 | 22 |
| Nathan Foster | Germany | $43,100 | 15 |
| Abigail Price | United Kingdom | $39,600 | 19 |
| Madison Rivera | France | $37,250 | 14 |
SELECT
p.category,
SUM(f.sales_amount) AS total_revenue,
SUM(f.quantity) AS total_units_sold,
ROUND(SUM(f.sales_amount) / NULLIF(SUM(f.quantity), 0), 2) AS avg_selling_price
FROM gold.fact_sales f
JOIN gold.dim_products p ON f.product_key = p.product_key
GROUP BY p.category
ORDER BY total_revenue DESC;| Category | Total Revenue | Units Sold | Avg Selling Price |
|---|---|---|---|
| Bikes | $28,318,144 | 13,930 | $2,032.45 |
| Accessories | $700,760 | 36,092 | $19.41 |
| Clothing | $339,773 | 11,461 | $29.64 |
Insight: Bikes drive ~97% of total revenue but Accessories have significantly higher unit volume — a pricing or upsell opportunity.
SELECT
YEAR(f.order_date) AS order_year,
MONTH(f.order_date) AS order_month,
SUM(f.sales_amount) AS monthly_revenue
FROM gold.fact_sales f
GROUP BY YEAR(f.order_date), MONTH(f.order_date)
ORDER BY order_year, order_month;| Year | Month | Monthly Revenue |
|---|---|---|
| 2021 | 1 | $1,042,350 |
| 2021 | 2 | $988,200 |
| 2021 | 3 | $1,231,800 |
| 2021 | 6 | $1,876,500 |
| 2021 | 12 | $2,105,400 |
Insight: Revenue peaks in mid-year (June) and December, suggesting seasonality worth accounting for in inventory planning.
| Tool | Purpose |
|---|---|
| SQL Server Express | Database engine |
| T-SQL / SSMS | ETL scripting, stored procedures, queries |
| Draw.io | Architecture and data flow diagrams |
| Git / GitHub | Version control |
| CSV (ERP + CRM) | Source data |
sql-data-warehouse-project/
│
├── datasets/ # Raw source data (ERP and CRM CSV files)
│
├── docs/ # Architecture and design documentation
│ ├── data_architecture.png # Medallion Architecture diagram
│ ├── data_architecture.drawio
│ ├── data_catalog.md # Field descriptions and metadata for all tables
│ ├── data_flow.drawio # End-to-end data flow diagram
│ ├── data_models.drawio # Star schema diagram (fact + dimensions)
│ ├── etl.drawio # ETL technique overview
│ └── naming-conventions.md # Naming standards for tables, columns, files
│
├── scripts/ # All T-SQL scripts
│ ├── bronze/ # Raw data ingestion (BULK INSERT, DDL)
│ ├── silver/ # Data cleansing and transformation
│ └── gold/ # Star schema views and analytical models
│
├── tests/ # Data quality checks
│ └── quality_checks.sql # NULL checks, duplicate detection, referential integrity
│
├── README.md
└── LICENSE
- Install SQL Server Express and SSMS
- Clone this repository
- Execute scripts in order:
scripts/bronze/— creates and loads raw tablesscripts/silver/— cleanses and transforms datascripts/gold/— builds star schema views
- Run
tests/quality_checks.sqlto validate data integrity - Query the Gold layer views for analytics
This project is licensed under the MIT License.
I'm Disha Jain, an Analytics & BI Professional with 3+ years of experience in data modeling, ETL development, and stakeholder reporting. I've built production Power BI dashboards, engineered ETL workflows for jewelry and CRM data, and automated reporting pipelines using SQL and Power Automate.
I'm currently seeking Data Analyst and BI Developer roles where I can apply both my hands-on SQL/data engineering skills and my experience translating data into decisions for business teams.
📌 Skills: SQL · Power BI · Python · Tableau · Alteryx · Azure · DAX · ETL · Data Modeling · Star Schema
| Project | What it covers |
|---|---|
| This repo | Data warehouse: ETL, Medallion Architecture, star schema |
| SQL Analytics Project | Advanced SQL analysis on the warehouse data |
| Power BI Dashboard | End-to-end BI report: exec dashboard, AI visuals, DAX |
