An end-to-end data engineering pipeline that collects, processes, and analyzes football match results, standings data, weather data, Reddit data and summarizes matchdays using Gemini from the top 5 European leagues. Used data sources include football-data.org API, Open-Meteo API, and PRAW (Reddit API), Maps...
This project demonstrates a complete data pipeline for football (soccer) results, from data extraction to visualization. It implements some data engineering practices including data lakes, transformation layers, and Infrastructure as Code (IaC) with Terraform.
- Automated Data Collection: Scheduled data fetching from multiple APIs using Google Cloud Functions
- Multi-layer Data Architecture: Raw data stored in GCS, processed data in BigQuery, and user-facing data in Firestore
- Weather Integration: Match statistics with weather data at match time
- Social Media (Reddit) Data: Reddit comments for fan sentiment
- Infrastructure as Code: Cloud Functions and Pub/Sub subscriptions and topics defined and deployed with Terraform
The pipeline follows the following architecture:
- Data Ingestion: Cloud Functions trigger on schedule to fetch data
- Storage Layers: Raw data(json) → External BQ tables (Parquet) → Processed Data in BQ → Firestore
- Validation: Very simple validation and Data qaulity with Dataplex
- Summarization: Creation of short summaries in Markdown with Gemini 2.0 Flash
- Visualization: Web app for insights
- Football-data.org: Match data, team data, and standings
- Open-Meteo API: Historical weather data
- Reddit (via PRAW): Fan comments and sentiment
- Maps SDK: Location of stadiums
Category | Technologies |
---|---|
Cloud Platform | Google Cloud Platform (GCP) |
Infrastructure as Code | Terraform |
Programming Languages | Python, TypeScript (Svelte) |
Data Storage | Cloud Storage, BigQuery, Firestore |
Data Quality | Dataplex |
Data Transformation | Dataform |
Serverless Computing | Cloud Functions |
Event-Driven Architecture | Pub/Sub |
API Consumption | Football-data.org, Open-Meteo, Reddit API, Google Maps |
CI/CD | GitHub Actions |
Package Management | uv, pyproject.toml |
Code Quality | Ruff, Bandit, Mypy |
Testing | pytest |
Web Framework | Svelte, ShadCN UI Components |
Hosting | Firebase App Hosting |
LLM | Google Gemini 2.0 Flash |
soccer-tracker-DE-project/
├── README.md
├── .gitignore
├── pyproject.toml
├── Github/workflows/ # CI/CD in Github Actions
│ ├── cd.yml
│ └── ci.yml
├── terraform/ # IaC definitions
│ ├── main.tf
│ ├── variables.tf
│ ├── pubsub.tf
│ └── cloud_functions.tf
├── cloud_functions/
│ ├── league_data/ # League and Teams data extraction and load
│ ├── discord_utils/ # Package for sending Discord notifications using webhooks
│ ├── match_data/ # Match data extraction and load
│ ├── weather_data/ # Weather data extraction and load
│ ├── reddit_data/ # Reddit data extraction and load
│ ├── standings_data/ # Standings data extraction and load for each matchday
│ ├── data_validation/ # Data validation using Dataplex
│ ├── serving_layer/ # Load data to firestore
│ └── generate_summaries/ # Generate match summaries with Gemini
├── soccer_tracker_ui/ # Svelte web app in Firebase
│ ├── src/
│ │ ├── lib/ # Reusable components
│ │ │ ├── components/ # UI components from [shadcn](https://next.shadcn-svelte.com/)
│ │ │ ├── firebase.ts # Firebase/Firestore connection
│ │ │ └── stores/ # Svelte stores for state management
│ │ ├── routes/ # Page components
│ ├── package.json # Dependencies and scripts
│ ├── svelte.config.js # Svelte configuration
│ ├── vite.config.js # Vite bundler config
└── tests/ # Test suite for Cloud Functions with Pytest
The project includes a Svelte web app for visualizing match results, weather data, and match summaries.
App includes:
- Match Results
- Match summaries using an LLM (Gemini 2.0 Flash)
- Weather data during matches
- Comments from Reddit
I got the idea to make this project from this repo by digitalghost-dev