A complete data engineering project for collecting, processing, and analyzing English Premier League data using Python, Airflow, PostgreSQL, BigQuery, and Docker.
This project builds a full-stack football data pipeline. It scrapes data from reliable football sources, stores it in relational databases and cloud data warehouses, automates ETL using Airflow, and supports analysis via SQL.
- Select data sources (BBC & worldfootball.net)
- Scrape raw data using Python + BeautifulSoup (functions in
scrape.py
) - Preview and verify the data structure in Jupyter Notebook
- Set up BigQuery & manually create partitioned tables
- Load transformed data to PostgreSQL and BigQuery (append mode with
ingestion_time
) - Use Docker Compose to manage containers (Airflow, Postgres, Jupyter, etc.)
- Schedule daily/weekly scraping jobs in Airflow DAGs
- Analyze data directly in BigQuery using SQL
Source | Data | Frequency |
---|---|---|
BBC Sport | League table & top scorers | Daily |
worldfootball.net | Goal data, player info, history stats | Weekly/Seasonal |
- Python
- Airflow
- PostgreSQL
- Google BigQuery
- Docker
- Jupyter Notebook
Airflow Dags/
├── init_full_load.py
├── scrape_daily_dag.py
└── scrape_weekly_dag.py
scrape.py
docker-compose.yaml
README.md
DAG | Script | Frequency | Description |
---|---|---|---|
Init Load | init_full_load.py |
Manual | One-time historical load |
Daily Scrape | scrape_daily_dag.py |
Daily at 06:00 | league table & scorers |
Weekly Scrape | scrape_weekly_dag.py |
Sunday | historical/player data |
Each table includes an ingestion_time
timestamp column for partitioning.
git clone https://github.com/yourusername/Premier-League-Data-Engineering-Project.git
cd Premier-League-Data-Engineering-Project
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.json
docker-compose up -d
Go to http://localhost:8080
SELECT Name, Club, COUNT(*) as goals
FROM `project.dataset.top_scorers`
WHERE ingestion_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
GROUP BY Name, Club
ORDER BY goals DESC
LIMIT 5;
Pull requests welcome. Submit issues or suggestions.
ZhenXIN
Data Engineer & Football Enthusiast ⚽
MIT License