-
Notifications
You must be signed in to change notification settings - Fork 0
architecture
The medallion architecture is a data organization framework that structures data into three distinct layers, each with its own purpose and characteristics:
- Bronze Layer (Raw Data)
- Silver Layer (Validated Data)
- Gold Layer (Business Data)
This architecture provides a clear separation of concerns and enables efficient data processing and governance.
The Bronze layer contains raw data ingested from various sources with minimal or no transformations.
- Raw, unprocessed data
- Exact copy of source data
- Immutable
- Append-only
- Full history preserved
- Local filesystem:
data/bronze
- PostgreSQL schema:
bronze
- NewsAPI articles
- Known entities (organizations, locations, persons)
The Silver layer contains cleansed, validated, and transformed data that is ready for analysis.
- Validated and cleansed data
- Standardized schemas
- Data quality checks applied
- Duplicate records removed
- Type conversions applied
- Business keys established
- Natural Language Processing (NLP) transformations applied
- Local filesystem:
data/silver
- PostgreSQL schema:
silver
- Entity extraction from news articles using spaCy (an academic study in the
academic_study
directory demonstrated spaCy's strong NER performance with an overall F1-score of 0.91) - Entity normalization and deduplication
- Entity linking with known entities
The Gold layer contains business-level aggregates and metrics that are ready for consumption by end-users and applications.
- Business-level aggregates
- Denormalized for query performance
- Optimized for specific use cases
- Ready for consumption
- Often includes dimensional models
- Local filesystem:
data/gold
- PostgreSQL schema:
gold
The typical data flow in the medallion architecture is:
- Ingestion: Data is ingested from source systems into the Bronze layer
- Validation: Data is validated, cleansed, and transformed from Bronze to Silver
- Aggregation: Data is aggregated and modeled from Silver to Gold
- Separation of Concerns: Each layer has a specific purpose and set of operations
- Data Quality: Progressive improvement of data quality as it moves through layers
- Auditability: Full lineage and history of data transformations
- Reprocessing: Ability to reprocess data from raw sources if needed
- Performance: Optimized storage and query performance at each layer
Home | Architecture | Development | Deployment | Infrastructure
© 2025 ByteMeDirk • Report Issues • Last updated: June 10, 2025