-
Notifications
You must be signed in to change notification settings - Fork 0
development
This guide provides instructions for setting up and working with the Semantic Medallion Data Platform in a development environment.
Before you begin, ensure you have the following installed:
-
Clone the repository:
git clone https://github.com/yourusername/semantic-medallion-data-platform.git cd semantic-medallion-data-platform
-
Install dependencies:
poetry install
-
Set up pre-commit hooks:
poetry run pre-commit install
The project includes a Docker Compose configuration that sets up a local development environment with:
- PostgreSQL database
- Metabase (data visualization and reporting tool)
Metabase provides a user-friendly interface for creating reports and dashboards based on the data in the PostgreSQL database. It can be accessed at http://localhost:3000 after starting the Docker environment.
To start the local environment:
cd docker
docker-compose up -d
To verify that all services are running:
docker-compose ps
To view logs from a specific service:
docker-compose logs -f <service-name>
Create a .env
file in the project root with the following variables:
# Local Development
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgres
POSTGRES_DB=medallion
# API Keys
NEWSAPI_KEY=your_newsapi_key_here # Get your key from https://newsapi.org/
To run the test suite:
poetry run pytest
To run tests with coverage:
poetry run pytest --cov=semantic_medallion_data_platform
The project uses several tools to maintain code quality:
-
Black: Code formatter
poetry run black .
-
isort: Import sorter
poetry run isort .
-
Flake8: Linter
poetry run flake8
-
mypy: Type checker
poetry run mypy semantic_medallion_data_platform
The project includes scripts for ingesting data into the Bronze layer:
-
Extract News Articles from NewsAPI:
python -m semantic_medallion_data_platform.bronze.brz_01_extract_newsapi --days_back 7
This script:
- Fetches known entities from the database
- Queries NewsAPI for articles mentioning each entity
- Stores the articles in the bronze.newsapi table
-
Extract Known Entities:
python -m semantic_medallion_data_platform.bronze.brz_01_extract_known_entities --raw_data_filepath data/known_entities/
This script:
- Reads entity data from CSV files in the specified directory
- Processes and transforms the data
- Stores the entities in the bronze.known_entities table
To ingest data into the Bronze layer programmatically:
from semantic_medallion_data_platform.bronze import ingest
# Ingest data from a source
ingest.from_csv("path/to/file.csv", "destination_table")
-
Transform and Extract Entities from NewsAPI Articles:
python -m semantic_medallion_data_platform.silver.slv_02_transform_nlp_newsapi
This script:
- Reads news articles from the bronze.newsapi table
- Copies the raw articles to the silver.newsapi table
- Uses spaCy NLP to extract named entities (locations, organizations, persons) from article text
(an academic study in the
academic_study
directory demonstrated spaCy's strong NER performance with an overall F1-score of 0.91) - Normalizes entity types (e.g., converts 'GPE' to 'LOC')
- Removes duplicate entities
- Stores extracted entities in the silver.newsapi_entities table
To process data from Bronze to Silver:
from semantic_medallion_data_platform.silver import transform
# Transform data from Bronze to Silver
transform.bronze_to_silver("source_table", "destination_table")
To aggregate data from Silver to Gold:
from semantic_medallion_data_platform.gold import aggregate
# Aggregate data from Silver to Gold
aggregate.silver_to_gold("source_table", "destination_table")
-
Docker services not starting:
- Check Docker logs:
docker-compose logs
- Ensure ports are not already in use
- Check Docker logs:
-
Poetry dependency issues:
- Update Poetry:
poetry self update
- Clear cache:
poetry cache clear pypi --all
- Update Poetry:
Home | Architecture | Development | Deployment | Infrastructure
© 2025 ByteMeDirk • Report Issues • Last updated: June 10, 2025