Open Source data engineering demo project using dbt, DuckDB, dlt, Dagster and Metabase. Two storage modes for the delta tables are supported: local and Microsoft Fabric Onelake.
-
Updated
Jun 2, 2026 - Python
Open Source data engineering demo project using dbt, DuckDB, dlt, Dagster and Metabase. Two storage modes for the delta tables are supported: local and Microsoft Fabric Onelake.
SCD2 implementation using pyspark
A modern banking data pipeline built with Dagster and DBT!
end-to-end data pipeline system built as part of the Coursera open-source Data Engineering program. It unifies diverse data sources, implements SCD2 historical tracking, and orchestrates workflows using industry-standard tools.
P&C insurance claims lakehouse: Azure ADLS + Databricks (PySpark/Delta) + Snowflake + dbt, real-time FNOL fraud signals via Kafka, Airflow-orchestrated, Terraform-provisioned, OIDC-secured, with data contracts, lineage, and ADRs throughout.
Advanced Healthcare Claims Pipeline using Snowflake, Snowpipe, Streams, Tasks, SCD Type 2, and AWS S3. Automates ingestion, CDC, dimensional modeling, and data quality checks for healthcare patient and claims data.
Fortune-500-grade banking analytics platform: OLTP -> medallion lakehouse -> Kimball star schema -> semantic layer -> 9-tab executive dashboard + 5 ML models (churn, fraud, segmentation, forecasting). Production-ready, governed, fully tested.
Production-grade parameterized ETL pipeline implementing SCD Type 2 for travel booking data using Databricks, Delta Lake, and ADLS — includes data quality checks, incremental fact table build, Z-Order optimization, and SQL reporting.
End-to-end Medicare data engineering pipeline: API ingestion, PostgreSQL 17, dbt, dimensional modeling (Kimball/SCD2), Apache Airflow orchestration, and Evidence.dev dashboard. Built on a QEMU/KVM Rocky Linux VM.
Production-grade CDC pipeline: MySQL → Debezium → Kinesis → S3 → AWS Glue (PySpark) → Redshift + Postgres + OpenSearch. Multi-sink fanout with SCD2, idempotency tracking, and 13 modular Terraform modules.
Batch retail data lakehouse on Databricks: Delta Live Tables (bronze → silver → gold), Unity Catalog, synthetic data generator, and an executive analytics dashboard.
Modern data stack reference: dbt + BigQuery + Airflow (Cloud Composer) with medallion layering, SCD2 snapshots, exposures, freshness SLAs, and 45× cost reduction via partition + cluster + incremental tuning.
This is a data engineering pipeline built on Databricks + Delta Lake + PySpark that ingests travel booking and customer master data, applies SCD Type 2 logic, and delivers analytics-ready tables. It includes data quality enforcement, dimension versioning, fact aggregation, and performance tuning.
reference snowflake ingestion patterns: streams and tasks, and dynamic tables with scd2 and deduplication. provisioned with terraform, plus a dbt sandbox.
Add a description, image, and links to the scd2 topic page so that developers can more easily learn about it.
To associate your repository with the scd2 topic, visit your repo's landing page and select "manage topics."