This study explores how to improve data lineage, attribute resolution, and contextual understanding across heterogeneous data sources by incorporating Natural Language Processing (NLP) technologies into contemporary data processing pipelines. Conventional constraints and keys are the mainstays of traditional Extract, Transform, Load (ETL) pipelines for attribute resolution; these techniques are insufficiently sophisticated for increasingly complex and varied data environments. By creating and putting into practice intelligent data processing pipelines that make use of NLP capabilities within a medallion architecture framework, this study overcomes this limitation.
In order to extract individuals, organisations, and locations from real-time news data acquired through NewsAPI, the
research uses spaCy for named entity recognition as part of a practical implementation approach. An academic study was conducted to evaluate spaCy's performance for NER, which demonstrated strong performance with an overall F1-score of 0.91 (detailed results available in the academic_study
directory). In order to replicate
cloud-native solutions, the system architecture combines Hugging Face NLP models with Apache Spark processing
capabilities, which are deployed in containerised environments. PostgreSQL databases are used to store the results of
data processing, and Metabase offers reporting and visualisation features to show how effective the pipeline is.
The approach focusses on creating NLP-enhanced ETL pipelines that extract keywords for topic affiliation, entity extraction to create intelligent data lineages across various sources, and sentiment analysis on unstructured blob data. Performance evaluation measures improvements in attribute resolution accuracy, processing efficiency, and data lineage completeness by contrasting NLP-integrated pipelines with conventional data processing techniques.
Key findings show that while retaining scalable processing performance, NLP integration greatly improves automated data attribute resolution capabilities. The study shows quantifiable gains in contextual comprehension and data categorisation accuracy, with entity extraction mechanisms effectively creating more intelligent data lineages than traditional techniques. However, when integrating NLP processing in production ETL environments, the implications for computational overhead must be carefully taken into account.
By offering concrete proof of the advantages of integrating NLP into enterprise data processing systems, this work advances the expanding field of AI-enhanced data engineering. By bridging the gap between theoretical AI capabilities and practical data engineering challenges, the research provides a repeatable framework for businesses looking to add intelligent, context-aware capabilities to their data processing infrastructure.
This project implements a medallion architecture for data lakes, which organizes data into three layers:
- Bronze Layer (Raw): Raw data ingested from various sources
- Silver Layer (Validated): Cleansed, validated, and transformed data
- Gold Layer (Business): Business-level aggregates and metrics ready for consumption
graph TD
%% Bronze Layer
subgraph "Bronze Layer"
B1[brz_01_extract_newsapi.py]
B2[brz_01_extract_known_entities.py]
end
%% Silver Layer
subgraph "Silver Layer"
S1[slv_02_transform_nlp_known_entities.py]
S2[slv_02_transform_nlp_newsapi.py]
S3[slv_03_transform_entity_to_entity_mapping.py]
S4[slv_02_transform_sentiment_newsapi.py]
end
%% Gold Layer
subgraph "Gold Layer"
G1[gld_04_load_entities.py]
G2[gld_04_load_newsapi.py]
G3[gld_04_load_newsapi_sentiment.py]
end
%% Data Sources
NewsAPI[NewsAPI] --> B1
KnownEntitiesCSV[Known Entities CSV] --> B2
%% Bronze to Silver
B1 -->|bronze . newsapi| S2
B2 -->|bronze . known_entities| S1
%% Silver Processing
S1 -->|silver . known_entities_entities| S3
S2 -->|silver . newsapi_entities| S3
%% Silver to Gold
S1 -->|silver . known_entities| G1
S1 -->|silver . known_entities| G2
S1 -->|silver . known_entities| G3
S3 -->|silver . entity_to_entity_mapping| G1
S3 -->|silver . entity_to_source_mapping| G2
S3 -->|silver . entity_to_source_mapping| G3
S2 -->|silver . newsapi| G2
S2 -->|silver . newsapi| G3
S2 -->|silver . newsapi| S4
S4 -->|silver . newsapi_sentiment| G3
%% Gold to Reporting
G1 -->|gold . entity_affiliations_complete| Metabase[Metabase Dashboards]
G2 -->|gold . entity_to_newsapi| Metabase
G3 -->|gold . entity_to_newsapi_sentiment| Metabase
graph LR
%% External Systems
NewsAPI[NewsAPI]
CSVFiles[CSV Files]
%% Data Processing
PySpark[PySpark Processing]
%% Storage
PostgreSQL[PostgreSQL Database]
%% Visualization
Metabase[Metabase]
%% Flow
NewsAPI --> PySpark
CSVFiles --> PySpark
PySpark --> PostgreSQL
PostgreSQL --> Metabase
%% Layers within PostgreSQL
subgraph PostgreSQL
Bronze[Bronze Schema]
Silver[Silver Schema]
Gold[Gold Schema]
Bronze --> Silver
Silver --> Gold
end
- Data Processing: PySpark
- Database: PostgreSQL
- Transformation: pyspark
- NLP & Sentiment Analysis: spaCy, Hugging Face Transformers
- Reporting & Visualization: Metabase
- Local Development: Docker, Poetry
- External APIs: NewsAPI
- Infrastructure as Code: Terraform
semantic-medallion-data-platform/
├── .github/ # GitHub Actions workflows
├── data/ # Data files
│ └── known_entities/ # Known entities data files
├── docs/ # Documentation
├── infrastructure/ # Infrastructure as Code
│ └── terraform/ # Terraform configuration for Digital Ocean
│ ├── main.tf # Main Terraform configuration
│ ├── variables.tf # Variable definitions
│ ├── outputs.tf # Output definitions
│ ├── terraform.tfvars.example # Example variables file
│ └── setup.sh # Setup script for Terraform
├── semantic_medallion_data_platform/ # Main package
│ ├── bronze/ # Bronze layer processing
│ │ ├── brz_01_extract_newsapi.py # Extract news articles from NewsAPI
│ │ └── brz_01_extract_known_entities.py # Extract known entities from CSV files
│ ├── silver/ # Silver layer processing
│ │ ├── slv_02_transform_nlp_known_entities.py # Extract entities from known entities descriptions
│ │ ├── slv_02_transform_nlp_newsapi.py # Extract entities from news articles
│ │ └── slv_03_transform_entity_to_entity_mapping.py # Create entity mappings
│ ├── gold/ # Gold layer processing
│ ├── common/ # Shared utilities
│ └── config/ # Configuration
├── tests/ # Unit and integration tests
├── docker/ # Docker configurations
├── .pre-commit-config.yaml # Pre-commit hooks
├── pyproject.toml # Poetry configuration
└── README.md # This file
-
Clone the repository:
git clone https://github.com/yourusername/semantic-medallion-data-platform.git cd semantic-medallion-data-platform
-
Install dependencies:
poetry install
-
Set up pre-commit hooks:
poetry run pre-commit install
-
Create a
.env
file from the template:cp .env.example .env
Edit the
.env
file to set your database credentials and other environment variables. Make sure to set your NewsAPI key if you plan to use the news article extraction functionality:NEWSAPI_KEY=your_newsapi_key_here
You can obtain a NewsAPI key by signing up at https://newsapi.org/.
Start the local development environment:
cd docker
docker-compose up -d
This will start:
- Local PostgreSQL database
- Metabase (data visualization and reporting tool) accessible at http://localhost:3000
You can run the entire ETL pipeline using the provided shell script:
./local_run.sh
This script will execute all the necessary steps in the correct order, from Bronze to Gold layer.
Alternatively, you can run each step individually as described in the sections below.
poetry run pytest
To extract news articles for known entities:
cd semantic-medallion-data-platform
python -m semantic_medallion_data_platform.bronze.brz_01_extract_newsapi --days_back 7
This will:
- Fetch known entities from the database
- Query NewsAPI for articles mentioning each entity
- Store the articles in the bronze.newsapi table
To load known entities from CSV files:
cd semantic-medallion-data-platform
python -m semantic_medallion_data_platform.bronze.brz_01_extract_known_entities --raw_data_filepath data/known_entities/
This will:
- Read entity data from CSV files in the specified directory
- Process and transform the data
- Store the entities in the bronze.known_entities table
To extract entities from known entities descriptions:
cd semantic-medallion-data-platform
python -m semantic_medallion_data_platform.silver.slv_02_transform_nlp_known_entities
This will:
- Copy known entities from bronze.known_entities to silver.known_entities
- Extract entities (locations, organizations, persons) from entity descriptions using NLP
- Store the extracted entities in the silver.known_entities_entities table
To extract entities from news articles:
cd semantic-medallion-data-platform
python -m semantic_medallion_data_platform.silver.slv_02_transform_nlp_newsapi
This will:
- Copy news articles from bronze.newsapi to silver.newsapi
- Extract entities from article title, description, and content using NLP
- Store the extracted entities in the silver.newsapi_entities table
To create entity-to-entity and entity-to-source mappings:
cd semantic-medallion-data-platform
python -m semantic_medallion_data_platform.silver.slv_03_transform_entity_to_entity_mapping
This will:
- Create entity-to-source mappings between known_entities_entities and newsapi_entities
- Create entity-to-entity mappings within known_entities_entities using fuzzy matching
- Store the mappings in silver.entity_to_source_mapping and silver.entity_to_entity_mapping tables
The entity mapping process uses fuzzy matching with RapidFuzz to identify similar entities across different data sources. This enables semantic connections between entities even when there are slight variations in naming or formatting.
To analyze sentiment in news articles:
cd semantic-medallion-data-platform
python -m semantic_medallion_data_platform.silver.slv_02_transform_sentiment_newsapi
This will:
- Read news articles from silver.newsapi
- Apply sentiment analysis to the content of each article using Hugging Face Transformers
- Store the sentiment scores and labels in the silver.newsapi_sentiment table
The sentiment analysis process uses a pre-trained BERT model to classify the sentiment of each article as positive, negative, or neutral, along with a confidence score.
To create a comprehensive entity affiliations table for reporting:
cd semantic-medallion-data-platform
python -m semantic_medallion_data_platform.gold.gld_04_load_entities
This will:
- Join entity-to-entity mappings with known entities data
- Create a wide table with entity information and their fuzzy match affiliations
- Create a bidirectional relationship table for complete entity affiliation analysis
- Store the results in gold.entity_affiliations and gold.entity_affiliations_complete tables
To create a table mapping entities to news sources:
cd semantic-medallion-data-platform
python -m semantic_medallion_data_platform.gold.gld_04_load_newsapi
This will:
- Join entity-to-source mappings with known entities and NewsAPI data
- Create a wide table with entity information and their mentions in news sources
- Store the results in gold.entity_to_newsapi table
To create a table with sentiment analysis of news articles:
cd semantic-medallion-data-platform
python -m semantic_medallion_data_platform.gold.gld_04_load_newsapi_sentiment
This will:
- Join news articles from silver.newsapi with sentiment analysis from silver.newsapi_sentiment
- Create a table with article information and their sentiment scores and labels
- Store the results in gold.entity_to_newsapi_sentiment table
These gold layer tables provide the foundation for the analytics and visualizations in Metabase dashboards, enabling comprehensive entity analysis, news source insights, and sentiment analysis.
Please read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.
This project uses Terraform to manage infrastructure on Digital Ocean, including a PostgreSQL database. Follow these steps to set up the infrastructure:
- Terraform (version 1.0.0 or later)
- Digital Ocean account
- Digital Ocean API token
-
Navigate to the Terraform directory:
cd infrastructure/terraform
-
Create a
terraform.tfvars
file from the example:cp terraform.tfvars.example terraform.tfvars
-
Edit the
terraform.tfvars
file to add your Digital Ocean API token:# Open with your favorite editor nano terraform.tfvars
-
Initialize Terraform:
terraform init
-
Plan the infrastructure changes:
terraform plan -out=tfplan
-
Apply the infrastructure changes:
terraform apply tfplan
-
After successful application, Terraform will output connection details for your PostgreSQL database:
- Database host
- Database port
- Database name
- Database user
- Database password (sensitive)
- Database URI (sensitive)
The Terraform configuration creates the following resources on Digital Ocean:
-
PostgreSQL Database Cluster:
- Version: PostgreSQL 15
- Size: db-s-1vcpu-1gb (1 vCPU, 1GB RAM)
- Region: Configurable (default: London - lon1)
- Node Count: 1
-
Database:
- Name: semantic_data_platform
-
Database User:
- Name: semantic_app_user
-
To update the infrastructure after making changes to the Terraform files:
terraform plan -out=tfplan # Preview changes terraform apply tfplan # Apply changes
-
To destroy the infrastructure when no longer needed:
terraform destroy
For more detailed information about the infrastructure setup, see INFRASTRUCTURE.md.
For deployment instructions, see DEPLOYMENT.md.
The Semantic Medallion Data Platform includes built-in visualization and reporting capabilities using Metabase. The gold layer tables are designed to be easily consumed by Metabase for creating dashboards and reports.
The platform includes pre-configured Metabase dashboards for entity analysis and insights:
This dashboard provides comprehensive analysis of entities and their relationships:
This dashboard focuses on insights derived from news articles mentioning entities:
The platform provides a rich set of analytics capabilities through SQL queries that can be used in Metabase dashboards:
- Entity Information: Basic information about entities
- Entity Affiliations: Affiliated entities based on fuzzy matching
- Entity Affiliations by Type: Count of affiliated entities by type
- Entity Affiliations Over Time: Distribution of entity affiliations over time
- Entity News Sources: News sources mentioning specific entities
- Top Entities by News Mentions: Entities with the most news mentions
- Entity Relationship Network: Network visualization of entity relationships
- Entity Match Quality Distribution: Distribution of match quality scores
- Entity News Source Timeline: Timeline of news sources mentioning entities
- News Source Distribution by Entity Type: Distribution of entity types across news sources
- Entity Affiliation Network Metrics: Network metrics for entity affiliations
- Entity Mention Frequency Over Time: Frequency of entity mentions over time
- Entity Similarity Analysis: Identification of similar entities based on match scores
- News Source Influence: Influence of news sources based on entity coverage
- Entity Affiliation Comparison: Comparison of affiliations between entities
- Entity Mention Context Analysis: Analysis of the context of entity mentions
- Entity Type Correlation: Correlation between different entity types
These queries are stored in the data/metabase_questions/
directory and can be imported into Metabase for creating
custom dashboards and reports.
The PostgreSQL database can also be accessed directly using a database client:
This project is licensed under the MIT License - see the LICENSE file for details.