Skip to content

Commit 432d6c4

Browse files
committed
Add NewsAPI integration and update environment configuration
- Introduced scripts for extracting news articles related to known entities using NewsAPI. - Updated .env.example to include NEWSAPI_KEY. - Modified env.py to include newsapi_key in the database configuration. - Adjusted README.md to reflect local processing and new extraction scripts. - Removed unused Docker services from docker-compose.yml.
1 parent d940b18 commit 432d6c4

17 files changed

+1627
-508
lines changed

.env.example

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,5 @@ DB_PORT=5432
44
DB_NAME=postgres
55
DB_USER=postgres
66
DB_PASSWORD=postgres
7+
# API configuration
8+
NEWSAPI_KEY=your_newsapi_key_here

README.md

Lines changed: 42 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Semantic Medallion Data Platform
22

3-
A modern data platform implementing the medallion architecture on Google Cloud Platform.
3+
A modern data platform implementing the medallion architecture with local processing.
44

55
## Architecture Overview
66

@@ -13,24 +13,25 @@ This project implements a medallion architecture for data lakes, which organizes
1313
## Tech Stack
1414

1515
- **Data Processing**: PySpark, Delta Lake
16-
- **Cloud Infrastructure**: Google Cloud Platform (GCS, BigQuery)
16+
- **Database**: PostgreSQL
1717
- **Orchestration**: Prefect
1818
- **Transformation**: dbt
1919
- **Data Quality**: Great Expectations
2020
- **Local Development**: Docker, Poetry
21-
- **Infrastructure as Code**: Terraform
21+
- **External APIs**: NewsAPI
2222

2323
## Project Structure
2424

2525
```
2626
semantic-medallion-data-platform/
2727
├── .github/ # GitHub Actions workflows
28+
├── data/ # Data files
29+
│ └── known_entities/ # Known entities data files
2830
├── docs/ # Documentation
29-
├── infrastructure/ # Terraform configurations
30-
│ ├── environments/ # Environment-specific configurations
31-
│ └── modules/ # Reusable Terraform modules
3231
├── semantic_medallion_data_platform/ # Main package
3332
│ ├── bronze/ # Bronze layer processing
33+
│ │ ├── brz_01_extract_newsapi.py # Extract news articles from NewsAPI
34+
│ │ └── brz_01_extract_known_entities.py # Extract known entities from CSV files
3435
│ ├── silver/ # Silver layer processing
3536
│ ├── gold/ # Gold layer processing
3637
│ ├── common/ # Shared utilities
@@ -49,8 +50,6 @@ semantic-medallion-data-platform/
4950
- Python 3.9+
5051
- [Poetry](https://python-poetry.org/docs/#installation)
5152
- [Docker](https://docs.docker.com/get-docker/)
52-
- [Terraform](https://learn.hashicorp.com/tutorials/terraform/install-cli)
53-
- Google Cloud account with appropriate permissions
5453

5554
### Installation
5655

@@ -75,39 +74,61 @@ semantic-medallion-data-platform/
7574
cp .env.example .env
7675
```
7776

78-
Edit the `.env` file to set your database credentials and other environment variables.
77+
Edit the `.env` file to set your database credentials and other environment variables. Make sure to set your NewsAPI key if you plan to use the news article extraction functionality:
78+
```
79+
NEWSAPI_KEY=your_newsapi_key_here
80+
```
81+
82+
You can obtain a NewsAPI key by signing up at [https://newsapi.org/](https://newsapi.org/).
7983

8084
### Local Development
8185

8286
Start the local development environment:
8387

8488
```bash
89+
cd docker
8590
docker-compose up -d
8691
```
8792

8893
This will start:
8994
- Local PostgreSQL database
90-
- Local GCS emulator
91-
- Other required services
9295

9396
### Running Tests
9497

9598
```bash
9699
poetry run pytest
97100
```
98101

99-
### Deploying to GCP
102+
### Running Bronze Layer Processes
100103

101-
1. Initialize Terraform:
102-
```bash
103-
cd infrastructure/environments/dev
104-
terraform init
105-
```
104+
#### Extracting News Articles from NewsAPI
105+
106+
To extract news articles for known entities:
107+
108+
```bash
109+
cd semantic-medallion-data-platform
110+
python -m semantic_medallion_data_platform.bronze.brz_01_extract_newsapi --days_back 7
111+
```
112+
113+
This will:
114+
1. Fetch known entities from the database
115+
2. Query NewsAPI for articles mentioning each entity
116+
3. Store the articles in the bronze.newsapi table
117+
118+
#### Extracting Known Entities
119+
120+
To load known entities from CSV files:
121+
122+
```bash
123+
cd semantic-medallion-data-platform
124+
python -m semantic_medallion_data_platform.bronze.brz_01_extract_known_entities --raw_data_filepath data/known_entities/
125+
```
126+
127+
This will:
128+
1. Read entity data from CSV files in the specified directory
129+
2. Process and transform the data
130+
3. Store the entities in the bronze.known_entities table
106131

107-
2. Apply Terraform configuration:
108-
```bash
109-
terraform apply
110-
```
111132

112133
## Contributing
113134

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
entity_name,entity_type,entity_description
2+
Microsoft,ORG,"Multinational technology corporation co-founded by Bill Gates, specializing in software, cloud computing, and consumer electronics."
3+
Apple,ORG,"Multinational technology company co-founded by Steve Jobs, known for consumer electronics, software, and digital services."
4+
Tesla,ORG,"Electric vehicle and clean energy company founded and led by Elon Musk, pioneering sustainable transportation."
5+
SpaceX,ORG,"Private space exploration company founded by Elon Musk, developing spacecraft and satellite internet services."
6+
Meta,ORG,"Social media and technology conglomerate founded by Mark Zuckerberg, formerly known as Facebook."
7+
Facebook,ORG,"Social networking platform founded by Mark Zuckerberg, now part of Meta's ecosystem."
8+
Google,ORG,"Search engine and technology company co-founded by Larry Page and Sergey Brin, now part of Alphabet."
9+
Alphabet,ORG,"Parent company of Google led by Sundar Pichai, encompassing various technology ventures and moonshot projects."
10+
Amazon,ORG,"E-commerce and cloud computing giant founded by Jeff Bezos, dominating online retail and AWS services."
11+
Dell Technologies,ORG,"Computer technology company founded by Michael Dell, specializing in personal computers and enterprise solutions."
12+
Oracle,ORG,"Database software corporation co-founded by Larry Ellison, leading enterprise software and cloud computing."
13+
Nvidia,ORG,"Graphics processing and AI chip company led by Jensen Huang, powering modern AI and gaming."
14+
Salesforce,ORG,"Cloud-based customer relationship management platform founded by Marc Benioff."
15+
Netflix,ORG,"Streaming entertainment service co-founded by Reed Hastings, revolutionizing media consumption."
16+
Twitter,ORG,"Social media platform co-founded by Jack Dorsey, now known as X under Elon Musk's ownership."
17+
X,ORG,"Social media platform formerly known as Twitter, acquired and rebranded by Elon Musk."
18+
Uber,ORG,"Ride-sharing and mobility platform co-founded by Travis Kalanick, transforming transportation services."
19+
Airbnb,ORG,"Home-sharing marketplace co-founded by Brian Chesky, disrupting the hospitality industry."
20+
Spotify,ORG,"Music streaming platform co-founded by Daniel Ek, leading digital music distribution."
21+
Snapchat,ORG,"Multimedia messaging app co-founded by Evan Spiegel, pioneering ephemeral content sharing."
22+
Dropbox,ORG,"Cloud storage service co-founded by Drew Houston, simplifying file sharing and collaboration."
23+
Stripe,ORG,"Online payment processing company co-founded by Patrick and John Collison, powering internet commerce."
24+
LinkedIn,ORG,"Professional networking platform co-founded by Reid Hoffman, connecting business professionals worldwide."
25+
PayPal,ORG,"Digital payment system co-founded by Peter Thiel and Elon Musk, enabling online financial transactions."
26+
Palantir,ORG,"Data analytics company co-founded by Peter Thiel, specializing in big data analysis for enterprises."
27+
Asana,ORG,"Work management platform co-founded by Dustin Moskovitz, helping teams organize and track projects."
28+
Instagram,ORG,"Photo and video sharing platform co-founded by Kevin Systrom, acquired by Facebook/Meta."
29+
Neuralink,ORG,"Neurotechnology company founded by Elon Musk, developing brain-computer interface technology."
30+
The Boring Company,ORG,"Tunnel construction company founded by Elon Musk, creating underground transportation systems."
31+
YouTube,ORG,"Video sharing platform formerly led by Susan Wojcicki, owned by Google/Alphabet."
32+
Square,ORG,"Financial services company co-founded by Jack Dorsey, providing payment solutions for businesses."
33+
Pixar,ORG,"Animation studio acquired by Steve Jobs, creating computer-animated films before Disney acquisition."
34+
NeXT,ORG,"Computer company founded by Steve Jobs after leaving Apple, later acquired by Apple."
35+
Beats Electronics,ORG,"Audio products company acquired by Apple under Tim Cook's leadership for $3 billion."
36+
WhatsApp,ORG,"Messaging application acquired by Facebook/Meta, enabling global instant communication."
37+
Threads,ORG,"Social networking platform launched by Meta under Mark Zuckerberg's leadership."
38+
Cascade Investment,ORG,"Private investment firm founded and controlled by Bill Gates, managing his wealth and investments."
39+
Breakthrough Energy Ventures,ORG,"Clean energy investment fund founded by Bill Gates, backing climate change solutions."
40+
TerraPower,ORG,"Nuclear reactor design company co-founded by Bill Gates, developing
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
entity_name,entity_type,entity_description
2+
Medina,LOC,"Exclusive city in Washington state near Lake Washington, home to tech billionaires including Bill Gates."
3+
Seattle,LOC,"Major city in Washington state, birthplace of Bill Gates and headquarters region for Microsoft."
4+
Los Altos,LOC,"City in California's Silicon Valley, location of Steve Jobs' garage where Apple was founded."
5+
Cupertino,LOC,"City in California, headquarters of Apple Inc. and center of consumer electronics innovation."
6+
Palo Alto,LOC,"City in California's Silicon Valley, hub for tech startups and venture capital firms."
7+
Menlo Park,LOC,"City in California, headquarters of Meta/Facebook and numerous venture capital firms."
8+
Mountain View,LOC,"City in California, headquarters of Google/Alphabet and center of search technology."
9+
Redmond,LOC,"City in Washington state, headquarters of Microsoft Corporation."
10+
Austin,LOC,"Capital city of Texas, headquarters of Dell Technologies and Oracle Corporation."
11+
Round Rock,LOC,"City in Texas, location of Dell's primary headquarters and manufacturing facilities."
12+
Los Altos Hills,LOC,"Affluent city in California, home to tech executives including Sundar Pichai and Jensen Huang."
13+
Atherton,LOC,"Wealthy city in California, residence of tech executives including Sheryl Sandberg."
14+
San Francisco,LOC,"Major city in California, headquarters of numerous tech companies including Salesforce and Twitter/X."
15+
Fremont,LOC,"City in California, location of Tesla's main manufacturing facility."
16+
Hawthorne,LOC,"City in California, headquarters of SpaceX and location of rocket development."
17+
Boca Chica,LOC,"Location in Texas, site of SpaceX's Starship development and launch facility."
18+
Bastrop,LOC,"City in Texas, location of Elon Musk's corporate campus including X headquarters."
19+
Santa Clara,LOC,"City in California, birthplace of Susan Wojcicki and center of semiconductor industry."
20+
Kauai,LOC,"Hawaiian island, location of Mark Zuckerberg's extensive private estate and compound."
21+
Indian Creek,LOC,"Exclusive island in Miami, Florida, known as 'Billionaire Bunker' and home to Jeff Bezos."
22+
Manalapan,LOC,"Wealthy town in Florida, location of Larry Ellison's $173 million estate."
23+
Lanai,LOC,"Hawaiian island owned almost entirely by Larry Ellison as his primary residence."
24+
Dolores Heights,LOC,"Neighborhood in San Francisco, residence of Airbnb founder Brian Chesky."
25+
Stanford,LOC,"University and surrounding area in California, where Google founders Larry Page and Sergey Brin met."
26+
Cambridge,LOC,"City in Massachusetts, location of Harvard University where Mark Zuckerberg founded Facebook."
27+
Hyderabad,LOC,"City in India, birthplace of Microsoft CEO Satya Nadella."
28+
Moscow,LOC,"Capital of Russia, birthplace of Google co-founder Sergey Brin."
29+
Lansing,LOC,"City in Michigan, birthplace of Google co-founder Larry Page."
30+
Mobile,LOC,"City in Alabama, birthplace of Apple CEO Tim Cook."
31+
Robertsdale,LOC,"City in Alabama, hometown where Tim Cook grew up."
32+
Eden,LOC,"City in Utah, location associated with Netflix founder Reed Hastings."
33+
Costa Rica,LOC,"Central American country, frequent vacation destination of Twitter co-founder Jack Dorsey."
34+
Silicon Valley,LOC,"Technology hub region in California, center of global tech innovation and venture capital."
35+
San Jose,LOC,"Major city in California's Silicon Valley, center of technology and innovation."
36+
Sunnyvale,LOC,"City in California's Silicon Valley, location of numerous tech companies."
37+
Santa Monica,LOC,"City in California, location of Snapchat headquarters."
38+
Beverly Hills,LOC,"Affluent city in California, residence area for many tech executives."
39+
Malibu,LOC,"Coastal city in California, residence area for tech billionaires."
40+
Newport Beach,LOC,"Wealthy coastal city in California, location of tech executive residences."
41+
Lake Tahoe,LOC,"Mountain lake region, popular retreat location for Silicon Valley executives."
42+
Napa Valley,LOC,"Wine region in California, popular investment and residence area for tech billionaires."
43+
Washington State,LOC,"U.S. state, home to Microsoft headquarters and Bill Gates' residence."
44+
California,LOC,"U.S. state, center of global technology industry and Silicon Valley."
45+
Texas,LOC,"U.S. state, emerging tech hub with headquarters of Dell, Oracle, and Tesla operations."
46+
Florida,LOC,"U.S. state, location of tech billionaire estates and emerging tech scene."
47+
Hawaii,LOC,"U.S. state, location of tech billionaire private estates and retreats."
48+
New York City,LOC,"Major U.S. city, location of tech company offices and executive residences."
49+
Boston,LOC,"Major U.S. city, technology and education hub with MIT and Harvard nearby."
50+
Chicago,LOC,"Major U.S. city, location where Larry Ellison attended university."
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
entity_name,entity_type,entity_description
2+
Bill Gates,PER,"Co-founder of Microsoft, philanthropist, and pioneer in the software industry."
3+
Steve Jobs,PER,"Co-founder of Apple Inc., visionary in personal computing and consumer electronics."
4+
Elon Musk,PER,"CEO of Tesla, SpaceX, Neuralink, and co-founder of PayPal, known for innovation in electric vehicles and space exploration."
5+
Mark Zuckerberg,PER,"Co-founder and CEO of Meta (Facebook), influential in social media and virtual reality."
6+
Sundar Pichai,PER,"CEO of Alphabet and Google, key figure in AI and cloud computing."
7+
Tim Cook,PER,"CEO of Apple Inc., leading innovation in consumer electronics and services."
8+
Satya Nadella,PER,"CEO of Microsoft, driving cloud computing and AI advancements."
9+
Michael Dell,PER,"Founder and CEO of Dell Technologies, pioneer in direct sales and customized computing."
10+
Larry Page,PER,"Co-founder of Google, instrumental in search engine technology and AI development."
11+
Sergey Brin,PER,"Co-founder of Google, key contributor to search technology and innovation."
12+
Jeff Bezos,PER,"Founder of Amazon, innovator in e-commerce and cloud computing."
13+
Sheryl Sandberg,PER,"Former COO of Facebook, influential in social media business strategy."
14+
Susan Wojcicki,PER,"Former CEO of YouTube, key figure in online video and digital advertising."
15+
Reed Hastings,PER,"Co-founder and CEO of Netflix, pioneer in streaming media."
16+
Jack Dorsey,PER,"Co-founder of Twitter and Square, influential in social media and fintech."
17+
Larry Ellison,PER,"Co-founder and CEO of Oracle Corporation, major figure in database technology."
18+
Jensen Huang,PER,"CEO of Nvidia, leader in graphics processing and AI chip development."
19+
Marc Benioff,PER,"CEO of Salesforce, pioneer in cloud-based customer relationship management."
20+
Travis Kalanick,PER,"Co-founder of Uber, revolutionary in ride-sharing and gig economy."
21+
Brian Chesky,PER,"Co-founder and CEO of Airbnb, innovator in sharing economy and hospitality."
22+
Daniel Ek,PER,"Co-founder and CEO of Spotify, leader in music streaming technology."
23+
Evan Spiegel,PER,"Co-founder and CEO of Snapchat, innovator in ephemeral messaging and AR."
24+
Drew Houston,PER,"Co-founder and CEO of Dropbox, pioneer in cloud storage solutions."
25+
Patrick Collison,PER,"Co-founder and CEO of Stripe, leader in online payment processing."
26+
John Collison,PER,"Co-founder and President of Stripe, innovator in fintech infrastructure."
27+
Reid Hoffman,PER,"Co-founder of LinkedIn, influential in professional networking and venture capital."
28+
Peter Thiel,PER,"Co-founder of PayPal and Palantir, prominent venture capitalist and entrepreneur."
29+
Dustin Moskovitz,PER,"Co-founder of Facebook and Asana, influential in social media and productivity software."
30+
Bobby Murphy,PER,"Co-founder and CTO of Snapchat, key figure in mobile messaging innovation."
31+
Kevin Systrom,PER,"Co-founder of Instagram, pioneer in photo-sharing and social media."

data/sources.md

Lines changed: 0 additions & 4 deletions
This file was deleted.

docker/docker-compose.yml

Lines changed: 0 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -18,58 +18,5 @@ services:
1818
timeout: 5s
1919
retries: 5
2020

21-
gcs-emulator:
22-
image: fsouza/fake-gcs-server
23-
container_name: medallion-gcs
24-
ports:
25-
- "4443:4443"
26-
command: ["-scheme", "http", "-port", "4443", "-public-host", "localhost:4443"]
27-
volumes:
28-
- gcs-data:/storage
29-
healthcheck:
30-
test: ["CMD", "curl", "-f", "http://localhost:4443"]
31-
interval: 10s
32-
timeout: 5s
33-
retries: 5
34-
35-
spark-master:
36-
image: bitnami/spark:3.4
37-
container_name: medallion-spark-master
38-
environment:
39-
- SPARK_MODE=master
40-
- SPARK_RPC_AUTHENTICATION_ENABLED=no
41-
- SPARK_RPC_ENCRYPTION_ENABLED=no
42-
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
43-
- SPARK_SSL_ENABLED=no
44-
ports:
45-
- "8080:8080"
46-
- "7077:7077"
47-
healthcheck:
48-
test: ["CMD", "curl", "-f", "http://localhost:8080"]
49-
interval: 10s
50-
timeout: 5s
51-
retries: 5
52-
53-
spark-worker:
54-
image: bitnami/spark:3.4
55-
container_name: medallion-spark-worker
56-
environment:
57-
- SPARK_MODE=worker
58-
- SPARK_MASTER_URL=spark://spark-master:7077
59-
- SPARK_WORKER_MEMORY=1G
60-
- SPARK_WORKER_CORES=1
61-
- SPARK_RPC_AUTHENTICATION_ENABLED=no
62-
- SPARK_RPC_ENCRYPTION_ENABLED=no
63-
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
64-
- SPARK_SSL_ENABLED=no
65-
depends_on:
66-
- spark-master
67-
healthcheck:
68-
test: ["CMD", "curl", "-f", "http://localhost:8081"]
69-
interval: 10s
70-
timeout: 5s
71-
retries: 5
72-
7321
volumes:
7422
postgres-data:
75-
gcs-data:

docs/architecture.md

Lines changed: 16 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,8 @@
22

33
## Overview
44

5-
The medallion architecture is a data organization framework that structures data into three distinct layers, each with its own purpose and characteristics:
5+
The medallion architecture is a data organization framework that structures data into three distinct layers, each with
6+
its own purpose and characteristics:
67

78
1. **Bronze Layer (Raw Data)**
89
2. **Silver Layer (Validated Data)**
@@ -15,21 +16,24 @@ This architecture provides a clear separation of concerns and enables efficient
1516
The Bronze layer contains raw data ingested from various sources with minimal or no transformations.
1617

1718
### Characteristics:
19+
1820
- Raw, unprocessed data
1921
- Exact copy of source data
2022
- Immutable
2123
- Append-only
2224
- Full history preserved
2325

2426
### Storage:
25-
- GCS bucket: `{project_id}-bronze`
26-
- BigQuery dataset: `bronze`
27+
28+
- Local filesystem: `data/bronze`
29+
- PostgreSQL schema: `bronze`
2730

2831
## Silver Layer
2932

3033
The Silver layer contains cleansed, validated, and transformed data that is ready for analysis.
3134

3235
### Characteristics:
36+
3337
- Validated and cleansed data
3438
- Standardized schemas
3539
- Data quality checks applied
@@ -38,23 +42,27 @@ The Silver layer contains cleansed, validated, and transformed data that is read
3842
- Business keys established
3943

4044
### Storage:
41-
- GCS bucket: `{project_id}-silver`
42-
- BigQuery dataset: `silver`
45+
46+
- Local filesystem: `data/silver`
47+
- PostgreSQL schema: `silver`
4348

4449
## Gold Layer
4550

46-
The Gold layer contains business-level aggregates and metrics that are ready for consumption by end-users and applications.
51+
The Gold layer contains business-level aggregates and metrics that are ready for consumption by end-users and
52+
applications.
4753

4854
### Characteristics:
55+
4956
- Business-level aggregates
5057
- Denormalized for query performance
5158
- Optimized for specific use cases
5259
- Ready for consumption
5360
- Often includes dimensional models
5461

5562
### Storage:
56-
- GCS bucket: `{project_id}-gold`
57-
- BigQuery dataset: `gold`
63+
64+
- Local filesystem: `data/gold`
65+
- PostgreSQL schema: `gold`
5866

5967
## Data Flow
6068

0 commit comments

Comments
 (0)