Skip to content

Commit 2ecd1cf

Browse files
committed
Add NLP utilities and entity transformation scripts for Silver layer processing
1 parent f2a5345 commit 2ecd1cf

File tree

10 files changed

+668
-36
lines changed

10 files changed

+668
-36
lines changed

README.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,9 @@ semantic-medallion-data-platform/
3434
│ │ ├── brz_01_extract_newsapi.py # Extract news articles from NewsAPI
3535
│ │ └── brz_01_extract_known_entities.py # Extract known entities from CSV files
3636
│ ├── silver/ # Silver layer processing
37+
│ │ ├── slv_02_transform_nlp_known_entities.py # Extract entities from known entities descriptions
38+
│ │ ├── slv_02_transform_nlp_newsapi.py # Extract entities from news articles
39+
│ │ └── slv_03_transform_entity_to_entity_mapping.py # Create entity mappings
3740
│ ├── gold/ # Gold layer processing
3841
│ ├── common/ # Shared utilities
3942
│ └── config/ # Configuration
@@ -131,6 +134,50 @@ This will:
131134
2. Process and transform the data
132135
3. Store the entities in the bronze.known_entities table
133136

137+
### Running Silver Layer Processes
138+
139+
#### Processing Known Entities with NLP
140+
141+
To extract entities from known entities descriptions:
142+
143+
```bash
144+
cd semantic-medallion-data-platform
145+
python -m semantic_medallion_data_platform.silver.slv_02_transform_nlp_known_entities
146+
```
147+
148+
This will:
149+
1. Copy known entities from bronze.known_entities to silver.known_entities
150+
2. Extract entities (locations, organizations, persons) from entity descriptions using NLP
151+
3. Store the extracted entities in the silver.known_entities_entities table
152+
153+
#### Processing News Articles with NLP
154+
155+
To extract entities from news articles:
156+
157+
```bash
158+
cd semantic-medallion-data-platform
159+
python -m semantic_medallion_data_platform.silver.slv_02_transform_nlp_newsapi
160+
```
161+
162+
This will:
163+
1. Copy news articles from bronze.newsapi to silver.newsapi
164+
2. Extract entities from article title, description, and content using NLP
165+
3. Store the extracted entities in the silver.newsapi_entities table
166+
167+
#### Creating Entity Mappings
168+
169+
To create entity-to-entity and entity-to-source mappings:
170+
171+
```bash
172+
cd semantic-medallion-data-platform
173+
python -m semantic_medallion_data_platform.silver.slv_03_transform_entity_to_entity_mapping
174+
```
175+
176+
This will:
177+
1. Create entity-to-source mappings between known_entities_entities and newsapi_entities
178+
2. Create entity-to-entity mappings within known_entities_entities
179+
3. Store the mappings in silver.entity_to_source_mapping and silver.entity_to_entity_mapping tables
180+
134181

135182
## Contributing
136183

local_run.sh

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# A local script to run the entire ETL pipeline
2+
3+
# Bronze Layer
4+
#python semantic_medallion_data_platform/bronze/brz_01_extract_known_entities.py --raw_data_filepath "./data/known_entities/*.csv"
5+
#python semantic_medallion_data_platform/bronze/brz_01_extract_newsapi.py --days_back "1"
6+
7+
# Silver Layer
8+
python semantic_medallion_data_platform/silver/slv_02_transform_nlp_newsapi.py
9+
python semantic_medallion_data_platform/silver/slv_02_transform_nlp_known_entities.py
10+
python semantic_medallion_data_platform/silver/slv_03_transform_entity_to_entity_mapping.py

poetry.lock

Lines changed: 108 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ dotenv = "^0.9.9"
2424
spacy = "^3.8.7"
2525
textblob = "^0.19.0"
2626
newsapi-python = "^0.2.7"
27+
rapidfuzz = "^3.13.0"
2728

2829
[tool.poetry.group.dev.dependencies]
2930
pytest = "^7.3.1"
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
"""
2+
Common NLP utilities for the Semantic Medallion Data Platform.
3+
4+
This module provides reusable NLP functions for entity extraction and other text processing tasks.
5+
"""
6+
import spacy
7+
from pyspark.sql.types import ArrayType, StringType, StructField, StructType
8+
9+
# Load spaCy NLP model
10+
NLP = spacy.load("en_core_web_lg")
11+
12+
# Define schema for entity extraction
13+
ENTITY_STRUCT = StructType(
14+
[StructField("text", StringType(), True), StructField("type", StringType(), True)]
15+
)
16+
17+
ENTITIES_SCHEMA = ArrayType(ENTITY_STRUCT)
18+
19+
20+
def extract_entities(text: str) -> list:
21+
"""
22+
Extract location, organization, and person entities from text using spaCy.
23+
24+
Args:
25+
text: The input text to process
26+
27+
Returns:
28+
A list of dictionaries with entity text and type
29+
"""
30+
if not text:
31+
return []
32+
33+
doc = NLP(text)
34+
entities = [
35+
{"text": ent.text, "type": ent.label_}
36+
for ent in doc.ents
37+
if ent.label_ in ("LOC", "GPE", "ORG", "PERSON")
38+
]
39+
return entities

0 commit comments

Comments
 (0)