Building a Data Lakehouse with DuckDB and dbt

December 29, 202415 min read

Data EngineeringdbtDuckDBTutorial

In this tutorial, I'll walk you through building a modern data lakehouse from scratch using DuckDB and dbt. We'll implement the medallion architecture (Bronze → Silver → Gold) to process NYC Taxi Trip Records, a real-world dataset with millions of trips.

By the end, you'll have a production-ready data pipeline with incremental processing, data quality testing, and CI/CD automation. The best part? Zero infrastructure required. DuckDB runs as a single file on your machine.

Why DuckDB + dbt?

Traditional data warehouses like Snowflake or BigQuery are powerful but come with infrastructure overhead and costs. For learning, prototyping, or smaller-scale analytics, DuckDB offers a compelling alternative:

Zero infrastructure: Single-file database, no server needed
Blazing fast: Columnar storage optimized for analytics
Native Parquet support: Read directly from raw files
Full SQL support: Window functions, CTEs, aggregations, all work

Combined with dbt (data build tool), you get version-controlled transformations, documentation, and testing. All the practices of modern data engineering.

Project Setup

Prerequisites

Python 3.11+
UV package manager (or pip)
~10GB disk space for raw data

Clone and Install

bash

# Clone the repository
git clone https://github.com/AlharbiAbdullah/data-lakehouse
cd data-lakehouse

# Install dependencies with UV
uv sync

# Or with pip
pip install -r requirements.txt

Project Structure

The project follows a clean separation between data ingestion (Python scripts) and transformation (dbt):

text

data-lakehouse/
|-- data/
|   |-- raw/                    # Parquet files (gitignored)
|   +-- warehouse/              # DuckDB database
|
|-- dbt_project/
|   |-- models/
|   |   |-- staging/            # Bronze layer
|   |   |-- intermediate/       # Silver layer
|   |   +-- marts/              # Gold layer
|   |-- macros/                 # Reusable SQL
|   +-- seeds/                  # Reference data
|
|-- scripts/
|   |-- download_data.py        # Fetch NYC Taxi data
|   |-- setup_warehouse.py      # Initialize DuckDB
|   +-- run_pipeline.py         # Orchestrate everything
|
+-- .github/workflows/
    +-- dbt_ci.yml              # CI/CD pipeline

Architecture: The Medallion Pattern

The medallion architecture organizes data into three layers, each with increasing quality and structure:

text

+==============================================================================+
|                            MEDALLION LAYERS                                  |
+==============================================================================+
|                                                                              |
|    +--------------+       +------------------+       +--------------+        |
|    |    BRONZE    |       |      SILVER      |       |     GOLD     |        |
|    |   (staging)  | ----> |  (intermediate)  | ----> |    (marts)   |        |
|    +--------------+       +------------------+       +--------------+        |
|                                                                              |
|     Raw ingestion          Cleaned &                  Analytics-ready        |
|     Type casting           Validated                  Aggregated             |
|     Hash-based IDs         Zone enrichment            Incremental            |
|                            Calculated fields          Daily metrics          |
|                                                                              |
+==============================================================================+

Bronze (Staging): Raw data with minimal transformation. We cast types, standardize column names, and generate deterministic IDs.

Silver (Intermediate): Cleaned and enriched data. We join with reference tables, calculate derived fields, and filter out invalid records.

Gold (Marts): Analytics-ready aggregations. Pre-computed metrics optimized for BI tools and dashboards.

Bronze Layer: Raw Ingestion

Data Sources

We're using NYC Taxi & Limousine Commission (TLC) trip data for January–March 2024:

Yellow Taxi: ~3-4M trips/month (Manhattan-centric)
Green Taxi: ~300-500K trips/month (outer boroughs)
FHV: ~5-6M trips/month (rideshare services)

The download_data.py script fetches these files asynchronously:

bash

# Download all data
uv run python scripts/download_data.py

# Set up DuckDB warehouse
uv run python scripts/setup_warehouse.py

Staging Models

Each trip type gets its own staging model. Here's the yellow taxi staging model:

stg_yellow_trips.sql

{{
    config(
        materialized='view'
    )
}}

with source as (
    select * from {{ source('raw', 'yellow_tripdata') }}
),

with_base_hash as (
    select
        -- Deterministic ID from key fields
        {{ generate_trip_id(
            'tpep_pickup_datetime',
            'tpep_dropoff_datetime',
            'PULocationID',
            'DOLocationID',
            'fare_amount',
            'trip_distance',
            'passenger_count'
        ) }} as base_hash,

        -- Trip type identifier
        'yellow' as trip_type,

        -- Timestamps
        tpep_pickup_datetime as pickup_datetime,
        tpep_dropoff_datetime as dropoff_datetime,

        -- Locations
        PULocationID as pickup_zone_id,
        DOLocationID as dropoff_zone_id,

        -- Trip details
        cast(passenger_count as integer) as passenger_count,
        cast(trip_distance as double) as trip_distance,

        -- Fare components
        cast(fare_amount as double) as fare_amount,
        cast(tip_amount as double) as tip_amount,
        cast(total_amount as double) as total_amount,

        -- Metadata
        current_timestamp as loaded_at

    from source
    where tpep_pickup_datetime is not null
      and tpep_dropoff_datetime is not null
),

staged as (
    select
        -- Unique trip ID: base_hash + row number for duplicates
        base_hash || '_' || cast(row_number() over (
            partition by base_hash order by pickup_datetime
        ) as varchar) as trip_id,
        * exclude (base_hash)
    from with_base_hash
)

select * from staged

Key Pattern: Deterministic IDs

One challenge with trip data is generating unique IDs that are reproducible across runs. We use MD5 hashing of key fields:

macros/generate_trip_id.sql

{% macro generate_trip_id(pickup_datetime, dropoff_datetime, pickup_zone,
                          dropoff_zone, extra_field_1, extra_field_2, extra_field_3) %}
    md5(
        coalesce(cast({{ pickup_datetime }} as varchar), '') ||
        '|' ||
        coalesce(cast({{ dropoff_datetime }} as varchar), '') ||
        '|' ||
        coalesce(cast({{ pickup_zone }} as varchar), '') ||
        '|' ||
        coalesce(cast({{ dropoff_zone }} as varchar), '') ||
        '|' ||
        coalesce(cast({{ extra_field_1 }} as varchar), '') ||
        '|' ||
        coalesce(cast({{ extra_field_2 }} as varchar), '') ||
        '|' ||
        coalesce(cast({{ extra_field_3 }} as varchar), '')
    )
{% endmacro %}

This ensures the same trip always gets the same ID, enabling incremental processing and data reconciliation.

Silver Layer: Cleaning & Enrichment

The silver layer transforms raw data into a clean, analysis-ready format:

Unioning Trip Types

First, we union all three trip types into a single dataset with int_trips_unioned.sql.

Zone Enrichment

Then we join with reference data and calculate derived fields:

int_trips_enriched.sql

{{
    config(
        materialized='view'
    )
}}

with trips as (
    select * from {{ ref('int_trips_unioned') }}
),

zones as (
    select * from {{ ref('stg_taxi_zones') }}
),

enriched as (
    select
        t.trip_id,
        t.trip_type,
        t.pickup_datetime,
        t.dropoff_datetime,
        t.pickup_zone_id,
        t.dropoff_zone_id,
        t.passenger_count,
        t.trip_distance,
        t.fare_amount,
        t.tip_amount,
        t.total_amount,
        t.loaded_at,

        -- Pickup zone info
        pz.zone_name as pickup_zone_name,
        pz.borough as pickup_borough,

        -- Dropoff zone info
        dz.zone_name as dropoff_zone_name,
        dz.borough as dropoff_borough,

        -- Calculated fields
        datediff('minute', t.pickup_datetime, t.dropoff_datetime)
            as trip_duration_minutes,

        -- Average speed (mph)
        case
            when t.trip_distance > 0
                 and datediff('minute', t.pickup_datetime, t.dropoff_datetime) > 0
            then t.trip_distance / (datediff('minute', t.pickup_datetime,
                                             t.dropoff_datetime) / 60.0)
            else null
        end as avg_speed_mph,

        -- Tip percentage
        case
            when t.fare_amount > 0 and t.tip_amount is not null
            then (t.tip_amount / t.fare_amount) * 100
            else null
        end as tip_percentage

    from trips t
    left join zones pz on t.pickup_zone_id = pz.zone_id
    left join zones dz on t.dropoff_zone_id = dz.zone_id
)

select * from enriched

Data Validation

The int_trips_validated.sql model filters out outliers:

Trip duration between 1–180 minutes
Positive fares (for yellow/green)
Valid zone IDs
Non-negative distances

Gold Layer: Analytics-Ready

The gold layer contains pre-aggregated metrics for fast querying. The key innovation here is incremental processing.

fct_daily_trips.sql

{{
    config(
        materialized='incremental',
        unique_key=['trip_date', 'pickup_borough', 'trip_type'],
        incremental_strategy='merge'
    )
}}

with trips as (
    select * from {{ ref('int_trips_validated') }}

    {% if is_incremental() %}
    -- 3-day lookback for late-arriving data
    where pickup_datetime >= (
        select dateadd('day', -3, max(trip_date))
        from {{ this }}
    )
    {% endif %}
),

daily_aggregates as (
    select
        cast(date_trunc('day', pickup_datetime) as date) as trip_date,
        pickup_borough,
        trip_type,

        -- Trip counts
        count(*) as total_trips,
        sum(coalesce(passenger_count, 0)) as total_passengers,

        -- Distance metrics
        sum(coalesce(trip_distance, 0)) as total_distance_miles,
        avg(trip_distance) as avg_distance_miles,

        -- Fare metrics
        sum(coalesce(fare_amount, 0)) as total_fare,
        avg(fare_amount) as avg_fare,
        sum(coalesce(tip_amount, 0)) as total_tips,
        avg(tip_percentage) as avg_tip_percentage,

        -- Duration metrics
        avg(trip_duration_minutes) as avg_duration_minutes,

        -- Metadata
        current_timestamp as updated_at

    from trips
    where pickup_borough is not null
    group by 1, 2, 3
)

select * from daily_aggregates

Key Pattern: 3-Day Lookback

Notice the is_incremental() block. On subsequent runs, we only reprocess the last 3 days of data. This handles late-arriving records while keeping processing time minimal.

Without this pattern, you'd either miss late data or reprocess everything. Both are costly.

Data Quality Testing

dbt's testing framework validates data at each layer. Tests are defined in YAML schema files:

_staging.yml

version: 2

models:
  - name: stg_yellow_trips
    columns:
      - name: trip_id
        tests:
          - unique
          - not_null
      - name: pickup_datetime
        tests:
          - not_null
      - name: dropoff_datetime
        tests:
          - not_null
      - name: trip_type
        tests:
          - accepted_values:
              values: ['yellow']

Run tests with:

bash

cd dbt_project
uv run dbt test

Tests catch issues like duplicate IDs, null timestamps, invalid trip types, and out-of-range values before they propagate downstream.

CI/CD with GitHub Actions

Every push triggers the full pipeline, ensuring data quality is validated automatically:

.github/workflows/dbt_ci.yml

name: dbt CI

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  dbt-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install UV
        run: curl -LsSf https://astral.sh/uv/install.sh | sh

      - name: Install dependencies
        run: uv sync

      - name: Download January data (CI subset)
        run: uv run python scripts/download_data.py --months 1

      - name: Setup warehouse
        run: uv run python scripts/setup_warehouse.py

      - name: Run dbt
        run: |
          cd dbt_project
          uv run dbt deps
          uv run dbt seed
          uv run dbt run
          uv run dbt test

Running the Pipeline

Execute the complete pipeline with a single command:

bash

# Full pipeline
uv run python scripts/run_pipeline.py

# Or step by step:
uv run python scripts/download_data.py   # ~10GB download
uv run python scripts/setup_warehouse.py

cd dbt_project
uv run dbt deps     # Install dbt packages
uv run dbt seed     # Load reference data
uv run dbt run      # Execute transformations
uv run dbt test     # Validate data quality

Sample Queries

Once complete, query the gold layer:

sql

-- Daily trip summary by borough
SELECT
    trip_date,
    pickup_borough,
    trip_type,
    total_trips,
    avg_fare,
    avg_duration_minutes
FROM marts.fct_daily_trips
WHERE trip_date >= '2024-01-01'
ORDER BY trip_date, total_trips DESC;

-- Busiest zones
SELECT
    z.zone_name,
    z.borough,
    SUM(m.pickups) as total_pickups
FROM marts.fct_zone_metrics m
JOIN marts.dim_zones z ON m.zone_id = z.zone_id
GROUP BY z.zone_name, z.borough
ORDER BY total_pickups DESC
LIMIT 10;

Conclusion

In this tutorial, we built a complete data lakehouse featuring:

Medallion architecture: Bronze → Silver → Gold layers with clear data contracts
Deterministic IDs: MD5 hashing for reproducible processing
Incremental processing: 3-day lookback for efficient updates
Data quality testing: Automated validation at every layer
CI/CD: GitHub Actions for continuous validation

The beauty of DuckDB + dbt is that this entire pipeline runs locally with zero infrastructure. You can experiment, iterate, and learn modern data engineering patterns without cloud costs.

Next Steps

Connect a BI tool like Evidence or Metabase to the gold layer
Add more data sources (weather, events) for richer analysis
Deploy to a cloud warehouse when scale demands it

Check out the full source code on GitHub.