Skip to content

drealchux/analytics_engine_sn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Airbnb Analytics Engine — dbt + Snowflake

A multi-layer dbt project that models Airbnb booking, listing, and host data on Snowflake. The pipeline moves raw data through bronze (ingestion), silver (cleaning), and gold (analytics-ready) layers, producing both a denormalized One Big Table (OBT) and a star-schema fact table backed by SCD2 dimension snapshots.

Pipeline overview: S3 to Snowflake bronze, silver, gold with dbt snapshots


Table of Contents


Architecture Overview

Snowflake (AIRBNB database)
│
├── staging schema          ← raw source tables (pre-existing, not managed by dbt)
│   ├── listings
│   ├── hosts
│   └── bookings
│
├── BRONZE schema           ← incremental ingestion, no transformation
│   ├── bronze_listings
│   ├── bronze_hosts
│   └── bronze_bookings
│
├── SILVER schema           ← cleaned, typed, business-logic enriched
│   ├── silver_listings
│   ├── silver_hosts
│   └── silver_bookings
│
└── GOLD schema             ← analytics-ready outputs
    ├── obt                 ← One Big Table (denormalized, all entities joined)
    ├── fact                ← Fact table (joins obt with SCD2 dimensions)
    ├── dim_listings        ← SCD2 snapshot (created by dbt snapshot)
    ├── dim_hosts           ← SCD2 snapshot (created by dbt snapshot)
    └── dim_bookings        ← SCD2 snapshot (created by dbt snapshot)

Modeling approach

The gold layer is a hybrid design:

  • OBT-centric path: silver_* → obt — a single wide table joining all three entities, suitable for flat BI tool consumption (Tableau, Looker, Power BI).
  • Star schema path: silver_* → snapshots (dim_*) → fact — a proper dimensional model with SCD2 history on listings and hosts, suitable for time-accurate historical analysis.

Both outputs coexist. Use obt for current-state reporting and fact when you need point-in-time accuracy (e.g. what was the listing price at the time of booking).


Data Sources

Defined in models/sources/sources.yml.

Source name Database Schema Tables
staging AIRBNB staging listings, hosts, bookings

These tables are owned by an upstream process and are not managed by this dbt project. dbt reads from them via {{ source('staging', '<table>') }}.


Project Structure

aws_dbt_snowflake_project/
├── analyses/
│   ├── exploration.sql         # Ad-hoc OBT query (not compiled into runs)
│   ├── if_else.sql             # Jinja control flow reference
│   └── loop.sql                # Jinja loop reference
├── macros/
│   ├── generate_schema_name.sql  # Overrides dbt default schema naming
│   ├── multiply.sql              # Rounds the product of two columns
│   ├── tag.sql                   # Bucketing macro (low / medium / high)
│   └── trimmer.sql               # Trim + uppercase utility
├── models/
│   ├── sources/
│   │   └── sources.yml
│   ├── bronze/
│   │   ├── bronze_bookings.sql
│   │   ├── bronze_hosts.sql
│   │   └── bronze_listings.sql
│   ├── silver/
│   │   ├── silver_bookings.sql
│   │   ├── silver_hosts.sql
│   │   └── silver_listings.sql
│   └── gold/
│       ├── obt.sql
│       ├── fact.sql
│       └── ephemeral/
│           ├── bookings.sql
│           ├── hosts.sql
│           └── listings.sql
├── snapshots/
│   ├── dim_bookings.yml
│   ├── dim_hosts.yml
│   └── dim_listings.yml
├── tests/
│   └── source_tests.sql        # Warns when booking_amount < 200
├── dbt_project.yml
└── profiles.yml

Layer Reference

Bronze

Schema: AIRBNB.BRONZE | Materialization: incremental

Thin ingestion layer. Each model is a SELECT * from its corresponding staging source with an incremental filter on CREATED_AT. No business logic is applied here — the purpose is to land raw data into the warehouse under dbt management so that upstream source changes are isolated.

Model Source table Incremental key
bronze_bookings staging.bookings CREATED_AT
bronze_hosts staging.hosts CREATED_AT
bronze_listings staging.listings CREATED_AT

Silver

Schema: AIRBNB.SILVER | Materialization: incremental (upsert via unique_key)

Cleaning and enrichment layer. Each model reads from its bronze counterpart via ref() and applies column selection, renaming, type casting, and derived fields.

silver_bookings

Column Notes
BOOKING_ID Unique key
LISTING_ID FK to listings
BOOKING_DATE
TOTAL_AMOUNT round(NIGHTS_BOOKED * BOOKING_AMOUNT, 2) via multiply macro
SERVICE_FEE
CLEANING_FEE
BOOKING_STATUS
CREATED_AT

silver_hosts

Column Notes
HOST_ID Unique key
HOST_NAME Spaces replaced with underscores
HOST_SINCE
IS_SUPERHOST
RESPONSE_RATE Raw numeric rate
RESPONSE_RATE_QUALITY Derived: VERY GOOD / GOOD / FAIR / POOR
CREATED_AT

silver_listings

Column Notes
LISTING_ID Unique key
HOST_ID FK to hosts
PROPERTY_TYPE
ROOM_TYPE
CITY
COUNTRY
ACCOMMODATES
BEDROOMS
BATHROOMS
PRICE_PER_NIGHT
PRICE_PER_NIGHT_TAG Derived: low / medium / high via tag macro
CREATED_AT

Gold

Schema: AIRBNB.GOLD | Materialization: table

obt — One Big Table

A fully denormalized join of all three silver entities at booking grain. One row per booking, enriched with all listing and host attributes. Built with a Jinja loop over a configuration dict for dynamic FROM / JOIN clause generation.

Grain: one row per BOOKING_ID

Dependencies: silver_bookings → joined to silver_listings → joined to silver_hosts

Use when: you need a flat, current-state dataset for BI tools or ad-hoc analysis.

fact — Fact Table

Joins obt with the SCD2 dimension tables (dim_listings, dim_hosts). Structured for time-accurate reporting — dimension attributes reflect their state at a point in time rather than their current state.

Grain: one row per BOOKING_ID

Dependencies: obt, dim_listings (snapshot), dim_hosts (snapshot)

Use when: you need historical accuracy, e.g. the listing price or host superhost status at the time of booking.

fact requires the dimension snapshots to exist in Snowflake. Run dbt snapshot before dbt run --select fact. See Running the Project.


Snapshots

Schema: AIRBNB.GOLD | Strategy: timestamp SCD2

Snapshots capture slowly changing dimension history on listings, hosts, and bookings. Each snapshot reads from its corresponding silver table and writes rows with DBT_VALID_FROM / DBT_VALID_TO columns. DBT_VALID_TO is set to 9999-12-31 for currently active records.

Snapshot Source Unique key updated_at
dim_listings silver_listings LISTING_ID CREATED_AT
dim_hosts silver_hosts HOST_ID CREATED_AT
dim_bookings silver_bookings BOOKING_ID CREATED_AT

Snapshots are not built by dbt run. They must be run explicitly:

dbt snapshot

Ephemeral Models

Located in models/gold/ephemeral/. These compile to CTEs — they have no physical existence in the database and are inlined into any model that references them via ref().

Model Selects from Purpose
bookings obt Booking-scoped subset of the OBT
hosts obt Host-scoped subset of the OBT
listings obt Listing-scoped subset of the OBT

Ephemeral models cannot be snapshotted, queried directly in Snowflake, or used as snapshot sources.


Macros

generate_schema_name

Overrides dbt's default schema naming convention. By default, dbt prefixes custom schemas with the target schema name (e.g. dev_bronze). This macro strips the prefix so schemas resolve to their bare names (bronze, silver, gold) across all environments.

{{ generate_schema_name(custom_schema_name, node) }}
-- returns: custom_schema_name if set, else target.schema

Because all environments resolve to the same schema names, running dbt in a dev target writes to the same BRONZE/SILVER/GOLD schemas as production. Add target.name logic to the macro if dev/prod isolation is needed.

multiply

Multiplies two column expressions and rounds the result to a given decimal precision.

{{ multiply('NIGHTS_BOOKED', 'BOOKING_AMOUNT', 2) }}
-- compiles to: round(NIGHTS_BOOKED * BOOKING_AMOUNT, 2)

tag

Buckets a numeric column expression into low, medium, or high string labels.

{{ tag('CAST(PRICE_PER_NIGHT AS INT)') }}
-- compiles to:
--   CASE
--     WHEN <expr> < 100 THEN 'low'
--     WHEN <expr> < 200 THEN 'medium'
--     ELSE 'high'
--   END

trimmer

Trims whitespace and uppercases a column value.

{{ trimmer('host_name') }}
-- compiles to: host_name trimmed and uppercased

Tests

File: tests/source_tests.sql

A singular test that returns rows where BOOKING_AMOUNT < 200 from the staging bookings source. Configured with severity: warn — the run will not fail, but dbt will log a warning for any matching rows.

dbt test

Analyses

Ad-hoc SQL files in analyses/ are compiled by dbt but never executed as part of a run. Useful for exploratory queries and Jinja reference examples.

File Purpose
exploration.sql Full OBT select — useful for sanity-checking the gold layer
if_else.sql Jinja if/else control flow reference
loop.sql Jinja for loop reference

Compile an analysis without running it:

dbt compile --select exploration
# output written to: target/compiled/.../analyses/exploration.sql

Prerequisites

  • Python 3.8+
  • dbt-snowflake 1.11+
  • A Snowflake account with:
    • Database: AIRBNB
    • Schema staging containing listings, hosts, and bookings tables
    • A warehouse and role with READ on staging and CREATE TABLE on BRONZE, SILVER, GOLD

Setup

1. Clone the repository

git clone <repo-url>
cd aws_dbt_snowflake_project

2. Create a virtual environment and install dbt

python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install dbt-snowflake

3. Configure your connection

Edit profiles.yml with your Snowflake credentials:

aws_dbt_snowflake_project:
  outputs:
    dev:
      type: snowflake
      account: <your-account-identifier>
      user: <your-username>
      password: <your-password>
      role: <your-role>
      database: AIRBNB
      warehouse: <your-warehouse>
      schema: dbt_schema
      threads: 1
  target: dev

Use an environment variable for the password rather than a plaintext value:

password: "{{ env_var('DBT_SNOWFLAKE_PASSWORD') }}"

Then set it in your shell:

export DBT_SNOWFLAKE_PASSWORD=your_password   # Windows: $env:DBT_SNOWFLAKE_PASSWORD="your_password"

4. Verify the connection

dbt debug

Running the Project

First-time full build

# Step 1 — build all models (bronze → silver → gold)
dbt run

# Step 2 — build SCD2 dimension snapshots (creates DIM_LISTINGS, DIM_HOSTS, DIM_BOOKINGS)
dbt snapshot

# Step 3 — rebuild fact now that the dimension tables exist
dbt run --select fact

# Step 4 — run tests
dbt test

Day-to-day incremental refresh

# Incrementally refresh bronze and silver, rebuild gold tables
dbt run

# Update snapshots with any changed dimension records
dbt snapshot

Selective builds

# Build a single model and all upstream dependencies
dbt run --select +obt

# Build only silver models
dbt run --select silver.*

# Build only gold models
dbt run --select gold.*

# Compile to inspect generated SQL without executing
dbt compile --select fact
# → target/compiled/aws_dbt_snowflake_project/models/gold/fact.sql

Development Guidelines

  • Always use ref() for model-to-model references. Never hardcode DATABASE.SCHEMA.TABLE paths. ref() wires the dependency graph, enforces build order, and respects generate_schema_name.
  • Always use source() for staging table references. Defined in models/sources/sources.yml.
  • Snapshots require dbt snapshot. They are not built by dbt run. After any schema change to a silver model that feeds a snapshot, run dbt snapshot to propagate.
  • Ephemeral models cannot be snapshotted. They have no database existence. If you need a snapshot source, it must be a table or view materialization.
  • generate_schema_name strips environment prefixes. All targets write to bronze, silver, gold. Update the macro if you need isolated dev schemas.
  • Bronze models override project-level materialization. dbt_project.yml declares bronze as table, but each model overrides this with incremental. The model-level config takes precedence.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages