A distributed real-time transaction processing pipeline
TrStream is a distributed data pipeline designed to simulate and process high-throughput financial transaction streams in real time.
The project explores the architectural patterns behind modern fintech and data engineering systems, including:
- event-driven ingestion
- scalable stream processing
- lakehouse-style storage
- analytical querying
It was developed as a personal project to apply concepts from Designing Data-Intensive Applications (Martin Kleppmann) in a realistic and reproducible environment.
Modern financial platforms ingest millions of events per day and must:
- process data reliably
- retain raw events for auditing
- optimize data layout for analytics
- remain horizontally scalable
TrStream models this workflow locally using open-source technologies, focusing on system design and data flow.
At a high level, the pipeline:
- Generates transaction events from multiple producers
- Streams the events through Kafka
- Stores raw data in a data lake (Parquet on S3-compatible storage)
- Reorganizes and compacts data for analytical workloads
- Exposes a SQL query layer on optimized data
The system is fully containerized and can be scaled horizontally via Docker Compose.
- Producers simulate transaction events with configurable values, distribution and event rates
- Kafka ensures decoupling, buffering and fault tolerance
- Consumers persist immutable raw data in Parquet format
- The partitioner organizes data by date and transaction type
- The compacter merges small files to improve query performance
- The query layer enables SQL queries directly on optimized Parquet data
- A Streamlit SQL editor is provided to conveniently query data
- Real-time ingestion with Kafka-based buffering and backpressure
- Horizontal scalable producers and consumers via Kafka partitioning
- End-to-end observability of message flow through Kafka UI
- Immutable Parquet storage designed for downstream analytics and auditing
- Explicit data lifecycle stages: ingestion, partitioning and compaction
- SQL querying on lake data via DuckDB (Athena-like experience)
- Lightweight SQL editor implemented with Streamlit
- Fully containerized local environment with Docker Compose orchestration and explicit health checks
| Component | Technology |
|---|---|
| Messaging | Kafka (Bitnami legacy image) |
| Storage | MinIO (S3-compatible) |
| Processing | Python, PyArrow, Boto3 |
| Orchestration | Docker Compose |
| Monitoring | Kafka UI (Provectus Labs) |
| Querying | DuckDB, FastAPI |
| Visualization | Streamlit |
Helper scripts are provided to simplify common workflows.
- Build all images:
bash scripts-cli/build.sh - Start the core pipeline:
bash scripts-cli/run.sh - Start all services (including query layer and dashboards):
bash scripts-cli/run_all.sh - Optional scaling
bash scripts-cli/run.sh producer=3 consumer=4 See scripts-cli/README.md for more details.
- Kafka UI: http://localhost:8080
- MinIO Console: http://localhost:9001
- SQL Querier API: http://localhost:8000
- Streamlit Editor: http://localhost:8501
Planned extensions focus on real-world relevance rather than feature completeness:
- Integration with real transaction APIs (e.g. Stripe, Revolut)
- Metrics and observability improvements
- Fraud detection and analytics use cases