Open Data Stack

December 15, 2025View Code

Data EngineeringKafkaSparkAirflowDuckDB

Overview

Open Data Stack is an open-source data engineering demonstration that implements batch and streaming pipelines using real financial market data from Yahoo Finance API. It showcases production-ready ETL patterns with a dual-path architecture.

Architecture

text

+==============================================================================+
|                           OPEN DATA STACK                                    |
+==============================================================================+
|                                                                              |
|    +-------------+                      +-------------+                      |
|    |  yfinance   |                      |  yfinance   |                      |
|    | (Stock API) |                      | (Stock API) |                      |
|    +------+------+                      +------+------+                      |
|           |                                    |                             |
|           v                                    v                             |
|    +-------------+                      +-------------+                      |
|    |   Airflow   |                      |    Kafka    |                      |
|    |   (Batch)   |                      | (Streaming) |                      |
|    +------+------+                      +------+------+                      |
|           |                                    |                             |
|           v                                    v                             |
|    +-------------+                      +-------------+                      |
|    |   DuckDB    |                      |    Spark    |                      |
|    | (Warehouse) |                      | (Processing)|                      |
|    +------+------+                      +------+------+                      |
|           |                                    |                             |
|           +----------------+-------------------+                             |
|                            |                                                 |
|                            v                                                 |
|                     +-------------+                                          |
|                     |  Superset   |                                          |
|                     | (Dashboards)|                                          |
|                     +-------------+                                          |
|                                                                              |
+==============================================================================+

Key Features

Dual-Path Architecture - Supports both batch (data warehouse) and streaming pipelines
Real-Time Data - Integrates with Yahoo Finance API for live stock data (AAPL, GOOGL, MSFT, AMZN, META)
Complete Observability - Pre-built Superset dashboards for monitoring
Single-Command Deploy - Full stack via Docker Compose
Production Patterns - Comprehensive testing with 73 tests

Tech Stack

Data Source - yfinance (Yahoo Finance API)
Message Queue - Apache Kafka
Stream Processing - Apache Spark
Orchestration - Apache Airflow
Data Warehouse - DuckDB
Visualization - Apache Superset
Processing - Pandas, PySpark

Quick Start

bash

# Clone and start
git clone https://github.com/AlharbiAbdullah/open_data_stack
cd open_data_stack
docker-compose up --build -d

# Access services
# Airflow:   http://localhost:8080
# Superset:  http://localhost:8088
# Kafka UI:  http://localhost:8082