Building a Data Lakehouse with DuckDB and dbt
In this tutorial, I'll walk you through building a modern data lakehouse from scratch using DuckDB and dbt. We'll implement the medallion architecture (Bronze → Silver → Gold) to process NYC Taxi Trip Records, a real-world dataset with millions of trips.
By the end, you'll have a production-ready data pipeline with incremental processing, data quality testing, and CI/CD automation. The best part? Zero infrastructure required. DuckDB runs as a single file on your machine.
Why DuckDB + dbt?
Traditional data warehouses like Snowflake or BigQuery are powerful but come with infrastructure overhead and costs. For learning, prototyping, or smaller-scale analytics, DuckDB offers a compelling alternative:
- Zero infrastructure: Single-file database, no server needed
- Blazing fast: Columnar storage optimized for analytics
- Native Parquet support: Read directly from raw files
- Full SQL support: Window functions, CTEs, aggregations, all work
Combined with dbt (data build tool), you get version-controlled transformations, documentation, and testing. All the practices of modern data engineering.
Project Setup
Prerequisites
- Python 3.11+
- UV package manager (or pip)
- ~10GB disk space for raw data
Clone and Install
# Clone the repository
git clone https://github.com/AlharbiAbdullah/data-lakehouse
cd data-lakehouse
# Install dependencies with UV
uv sync
# Or with pip
pip install -r requirements.txtProject Structure
The project follows a clean separation between data ingestion (Python scripts) and transformation (dbt):
data-lakehouse/
|-- data/
| |-- raw/ # Parquet files (gitignored)
| +-- warehouse/ # DuckDB database
|
|-- dbt_project/
| |-- models/
| | |-- staging/ # Bronze layer
| | |-- intermediate/ # Silver layer
| | +-- marts/ # Gold layer
| |-- macros/ # Reusable SQL
| +-- seeds/ # Reference data
|
|-- scripts/
| |-- download_data.py # Fetch NYC Taxi data
| |-- setup_warehouse.py # Initialize DuckDB
| +-- run_pipeline.py # Orchestrate everything
|
+-- .github/workflows/
+-- dbt_ci.yml # CI/CD pipelineArchitecture: The Medallion Pattern
The medallion architecture organizes data into three layers, each with increasing quality and structure:
+==============================================================================+
| MEDALLION LAYERS |
+==============================================================================+
| |
| +--------------+ +------------------+ +--------------+ |
| | BRONZE | | SILVER | | GOLD | |
| | (staging) | ----> | (intermediate) | ----> | (marts) | |
| +--------------+ +------------------+ +--------------+ |
| |
| Raw ingestion Cleaned & Analytics-ready |
| Type casting Validated Aggregated |
| Hash-based IDs Zone enrichment Incremental |
| Calculated fields Daily metrics |
| |
+==============================================================================+Bronze (Staging): Raw data with minimal transformation. We cast types, standardize column names, and generate deterministic IDs.
Silver (Intermediate): Cleaned and enriched data. We join with reference tables, calculate derived fields, and filter out invalid records.
Gold (Marts): Analytics-ready aggregations. Pre-computed metrics optimized for BI tools and dashboards.
Bronze Layer: Raw Ingestion
Data Sources
We're using NYC Taxi & Limousine Commission (TLC) trip data for January–March 2024:
- Yellow Taxi: ~3-4M trips/month (Manhattan-centric)
- Green Taxi: ~300-500K trips/month (outer boroughs)
- FHV: ~5-6M trips/month (rideshare services)
The download_data.py script fetches these files asynchronously:
# Download all data
uv run python scripts/download_data.py
# Set up DuckDB warehouse
uv run python scripts/setup_warehouse.pyStaging Models
Each trip type gets its own staging model. Here's the yellow taxi staging model:
{{
config(
materialized='view'
)
}}
with source as (
select * from {{ source('raw', 'yellow_tripdata') }}
),
with_base_hash as (
select
-- Deterministic ID from key fields
{{ generate_trip_id(
'tpep_pickup_datetime',
'tpep_dropoff_datetime',
'PULocationID',
'DOLocationID',
'fare_amount',
'trip_distance',
'passenger_count'
) }} as base_hash,
-- Trip type identifier
'yellow' as trip_type,
-- Timestamps
tpep_pickup_datetime as pickup_datetime,
tpep_dropoff_datetime as dropoff_datetime,
-- Locations
PULocationID as pickup_zone_id,
DOLocationID as dropoff_zone_id,
-- Trip details
cast(passenger_count as integer) as passenger_count,
cast(trip_distance as double) as trip_distance,
-- Fare components
cast(fare_amount as double) as fare_amount,
cast(tip_amount as double) as tip_amount,
cast(total_amount as double) as total_amount,
-- Metadata
current_timestamp as loaded_at
from source
where tpep_pickup_datetime is not null
and tpep_dropoff_datetime is not null
),
staged as (
select
-- Unique trip ID: base_hash + row number for duplicates
base_hash || '_' || cast(row_number() over (
partition by base_hash order by pickup_datetime
) as varchar) as trip_id,
* exclude (base_hash)
from with_base_hash
)
select * from stagedKey Pattern: Deterministic IDs
One challenge with trip data is generating unique IDs that are reproducible across runs. We use MD5 hashing of key fields:
{% macro generate_trip_id(pickup_datetime, dropoff_datetime, pickup_zone,
dropoff_zone, extra_field_1, extra_field_2, extra_field_3) %}
md5(
coalesce(cast({{ pickup_datetime }} as varchar), '') ||
'|' ||
coalesce(cast({{ dropoff_datetime }} as varchar), '') ||
'|' ||
coalesce(cast({{ pickup_zone }} as varchar), '') ||
'|' ||
coalesce(cast({{ dropoff_zone }} as varchar), '') ||
'|' ||
coalesce(cast({{ extra_field_1 }} as varchar), '') ||
'|' ||
coalesce(cast({{ extra_field_2 }} as varchar), '') ||
'|' ||
coalesce(cast({{ extra_field_3 }} as varchar), '')
)
{% endmacro %}This ensures the same trip always gets the same ID, enabling incremental processing and data reconciliation.
Silver Layer: Cleaning & Enrichment
The silver layer transforms raw data into a clean, analysis-ready format:
Unioning Trip Types
First, we union all three trip types into a single dataset with int_trips_unioned.sql.
Zone Enrichment
Then we join with reference data and calculate derived fields:
{{
config(
materialized='view'
)
}}
with trips as (
select * from {{ ref('int_trips_unioned') }}
),
zones as (
select * from {{ ref('stg_taxi_zones') }}
),
enriched as (
select
t.trip_id,
t.trip_type,
t.pickup_datetime,
t.dropoff_datetime,
t.pickup_zone_id,
t.dropoff_zone_id,
t.passenger_count,
t.trip_distance,
t.fare_amount,
t.tip_amount,
t.total_amount,
t.loaded_at,
-- Pickup zone info
pz.zone_name as pickup_zone_name,
pz.borough as pickup_borough,
-- Dropoff zone info
dz.zone_name as dropoff_zone_name,
dz.borough as dropoff_borough,
-- Calculated fields
datediff('minute', t.pickup_datetime, t.dropoff_datetime)
as trip_duration_minutes,
-- Average speed (mph)
case
when t.trip_distance > 0
and datediff('minute', t.pickup_datetime, t.dropoff_datetime) > 0
then t.trip_distance / (datediff('minute', t.pickup_datetime,
t.dropoff_datetime) / 60.0)
else null
end as avg_speed_mph,
-- Tip percentage
case
when t.fare_amount > 0 and t.tip_amount is not null
then (t.tip_amount / t.fare_amount) * 100
else null
end as tip_percentage
from trips t
left join zones pz on t.pickup_zone_id = pz.zone_id
left join zones dz on t.dropoff_zone_id = dz.zone_id
)
select * from enrichedData Validation
The int_trips_validated.sql model filters out outliers:
- Trip duration between 1–180 minutes
- Positive fares (for yellow/green)
- Valid zone IDs
- Non-negative distances
Gold Layer: Analytics-Ready
The gold layer contains pre-aggregated metrics for fast querying. The key innovation here is incremental processing.
{{
config(
materialized='incremental',
unique_key=['trip_date', 'pickup_borough', 'trip_type'],
incremental_strategy='merge'
)
}}
with trips as (
select * from {{ ref('int_trips_validated') }}
{% if is_incremental() %}
-- 3-day lookback for late-arriving data
where pickup_datetime >= (
select dateadd('day', -3, max(trip_date))
from {{ this }}
)
{% endif %}
),
daily_aggregates as (
select
cast(date_trunc('day', pickup_datetime) as date) as trip_date,
pickup_borough,
trip_type,
-- Trip counts
count(*) as total_trips,
sum(coalesce(passenger_count, 0)) as total_passengers,
-- Distance metrics
sum(coalesce(trip_distance, 0)) as total_distance_miles,
avg(trip_distance) as avg_distance_miles,
-- Fare metrics
sum(coalesce(fare_amount, 0)) as total_fare,
avg(fare_amount) as avg_fare,
sum(coalesce(tip_amount, 0)) as total_tips,
avg(tip_percentage) as avg_tip_percentage,
-- Duration metrics
avg(trip_duration_minutes) as avg_duration_minutes,
-- Metadata
current_timestamp as updated_at
from trips
where pickup_borough is not null
group by 1, 2, 3
)
select * from daily_aggregatesKey Pattern: 3-Day Lookback
Notice the is_incremental() block. On subsequent runs, we only reprocess the last 3 days of data. This handles late-arriving records while keeping processing time minimal.
Without this pattern, you'd either miss late data or reprocess everything. Both are costly.
Data Quality Testing
dbt's testing framework validates data at each layer. Tests are defined in YAML schema files:
version: 2
models:
- name: stg_yellow_trips
columns:
- name: trip_id
tests:
- unique
- not_null
- name: pickup_datetime
tests:
- not_null
- name: dropoff_datetime
tests:
- not_null
- name: trip_type
tests:
- accepted_values:
values: ['yellow']Run tests with:
cd dbt_project
uv run dbt testTests catch issues like duplicate IDs, null timestamps, invalid trip types, and out-of-range values before they propagate downstream.
CI/CD with GitHub Actions
Every push triggers the full pipeline, ensuring data quality is validated automatically:
name: dbt CI
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
dbt-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install UV
run: curl -LsSf https://astral.sh/uv/install.sh | sh
- name: Install dependencies
run: uv sync
- name: Download January data (CI subset)
run: uv run python scripts/download_data.py --months 1
- name: Setup warehouse
run: uv run python scripts/setup_warehouse.py
- name: Run dbt
run: |
cd dbt_project
uv run dbt deps
uv run dbt seed
uv run dbt run
uv run dbt testRunning the Pipeline
Execute the complete pipeline with a single command:
# Full pipeline
uv run python scripts/run_pipeline.py
# Or step by step:
uv run python scripts/download_data.py # ~10GB download
uv run python scripts/setup_warehouse.py
cd dbt_project
uv run dbt deps # Install dbt packages
uv run dbt seed # Load reference data
uv run dbt run # Execute transformations
uv run dbt test # Validate data qualitySample Queries
Once complete, query the gold layer:
-- Daily trip summary by borough
SELECT
trip_date,
pickup_borough,
trip_type,
total_trips,
avg_fare,
avg_duration_minutes
FROM marts.fct_daily_trips
WHERE trip_date >= '2024-01-01'
ORDER BY trip_date, total_trips DESC;
-- Busiest zones
SELECT
z.zone_name,
z.borough,
SUM(m.pickups) as total_pickups
FROM marts.fct_zone_metrics m
JOIN marts.dim_zones z ON m.zone_id = z.zone_id
GROUP BY z.zone_name, z.borough
ORDER BY total_pickups DESC
LIMIT 10;Conclusion
In this tutorial, we built a complete data lakehouse featuring:
- Medallion architecture: Bronze → Silver → Gold layers with clear data contracts
- Deterministic IDs: MD5 hashing for reproducible processing
- Incremental processing: 3-day lookback for efficient updates
- Data quality testing: Automated validation at every layer
- CI/CD: GitHub Actions for continuous validation
The beauty of DuckDB + dbt is that this entire pipeline runs locally with zero infrastructure. You can experiment, iterate, and learn modern data engineering patterns without cloud costs.
Next Steps
- Connect a BI tool like Evidence or Metabase to the gold layer
- Add more data sources (weather, events) for richer analysis
- Deploy to a cloud warehouse when scale demands it
Check out the full source code on GitHub.