← Back to Projects

Open Data Stack

View Code
Data EngineeringKafkaSparkAirflowDuckDB

Overview

Open Data Stack is an open-source data engineering demonstration that implements batch and streaming pipelines using real financial market data from Yahoo Finance API. It showcases production-ready ETL patterns with a dual-path architecture.

Architecture

text
+==============================================================================+
|                           OPEN DATA STACK                                    |
+==============================================================================+
|                                                                              |
|    +-------------+                      +-------------+                      |
|    |  yfinance   |                      |  yfinance   |                      |
|    | (Stock API) |                      | (Stock API) |                      |
|    +------+------+                      +------+------+                      |
|           |                                    |                             |
|           v                                    v                             |
|    +-------------+                      +-------------+                      |
|    |   Airflow   |                      |    Kafka    |                      |
|    |   (Batch)   |                      | (Streaming) |                      |
|    +------+------+                      +------+------+                      |
|           |                                    |                             |
|           v                                    v                             |
|    +-------------+                      +-------------+                      |
|    |   DuckDB    |                      |    Spark    |                      |
|    | (Warehouse) |                      | (Processing)|                      |
|    +------+------+                      +------+------+                      |
|           |                                    |                             |
|           +----------------+-------------------+                             |
|                            |                                                 |
|                            v                                                 |
|                     +-------------+                                          |
|                     |  Superset   |                                          |
|                     | (Dashboards)|                                          |
|                     +-------------+                                          |
|                                                                              |
+==============================================================================+

Key Features

  • Dual-Path Architecture - Supports both batch (data warehouse) and streaming pipelines
  • Real-Time Data - Integrates with Yahoo Finance API for live stock data (AAPL, GOOGL, MSFT, AMZN, META)
  • Complete Observability - Pre-built Superset dashboards for monitoring
  • Single-Command Deploy - Full stack via Docker Compose
  • Production Patterns - Comprehensive testing with 73 tests

Tech Stack

  • Data Source - yfinance (Yahoo Finance API)
  • Message Queue - Apache Kafka
  • Stream Processing - Apache Spark
  • Orchestration - Apache Airflow
  • Data Warehouse - DuckDB
  • Visualization - Apache Superset
  • Processing - Pandas, PySpark

Quick Start

bash
# Clone and start
git clone https://github.com/AlharbiAbdullah/open_data_stack
cd open_data_stack
docker-compose up --build -d

# Access services
# Airflow:   http://localhost:8080
# Superset:  http://localhost:8088
# Kafka UI:  http://localhost:8082