Data Career Navigator

About: A data engineering portfolio project that provides insights into career opportunities in the data industry. Utilizes a comprehensive dataset sourced from LinkedIn.

Project Status

✅ Complete: This project has been successfully developed and deployed. All planned features have been implemented, and the pipeline is fully operational.

Key Features

ETL Pipeline: Cleans, enriches, and transforms raw data into analytics-ready gold-layer tables.
Skill & Salary Extraction: Extracts skills, experience, and salary information using advanced NLP and heuristics.
MotherDuck Integration: Loads gold-layer data into MotherDuck for scalable analytics.
Interactive Dashboard: Visualizes trends, skills, and salary insights.
Workflow Orchestration: Orchestrates data pipelines using Airflow and Make.

Overview

The Data Career Navigator is a data engineering project that aims to provide insights into career opportunities in the data industry. It leverages a comprehensive dataset sourced from LinkedIn, which includes details such as job title, company, location, date posted, and job description. The project utilizes advanced data engineering techniques to extract and analyze key insights from this dataset. It automates data ingestion, enrichment, and gold-layer analytics. Supported by MotherDuck for scalable analytics and Airflow or Make for workflow orchestration.

Project Title: Data Career Navigator – Exploring Data-Driven Career Opportunities

Problem Statement: Despite the rapid growth of the data industry, job seekers and professionals often lack clear, data-driven insights into the evolving landscape of data-related careers. Information about required skills, salary benchmarks, experience levels, and industry trends is fragmented across various sources, making it challenging to make informed career decisions. There is a need for a unified platform that aggregates, cleans, and analyzes job market data to provide actionable insights for individuals navigating data careers.

Project Description: An interactive dashboard providing deep insights into career opportunities for data-related roles, utilizing a comprehensive dataset sourced from LinkedIn. Features include analysis of experience levels, salaries, key skills, job locations, and industry trends, aiding job seekers and professionals in exploring and identifying optimal career paths.

Project Structure:

data-career-navigator/
│
├── data/
│   ├── bronze/                        # Raw, unaltered ingestion (monthly dumps from Kaggle)
│   │   ├── clean_jobs.csv
│   │   └── clean_jobs_latest.csv
│   ├── silver/                        # Cleaned & partially enriched (after ETL)
│   │   └── enriched_jobs.csv
│   └── gold/                          # Final aggregated datasets (ready for reporting/dashboard)
│       ├── job_postings.parquet
│       ├── skills.parquet
│       ├── job_skills.parquet
│       ├── companies.parquet
│       ├── country_skill_counts.parquet
│       ├── experience_skill_counts.parquet
│       └── salary_skill_stats.parquet
│
├── notebooks/
│   ├── 01-data-cleaning.ipynb         # Ingest from data/bronze, produce silver outputs
│   ├── 02-eda-enriched-jobs.ipynb     # Explore silver tables (salary, experience, skills)
│   └── 03-visualizations.ipynb        # Prototype charts, maps, etc. using silver/gold
│
├── src/
│   ├── extractors/
│   │     ├── __init__.py
│   │     ├── salary_extractor.py      # SalaryExtractor & SalaryETL
│   │     ├── experience_extractor.py  # categorize_experience(...)
│   │     ├── skills_extractor.py      # extract_skills(job_description)
│   │     └── job_type_extractor.py    # extract_job_type(work_type, employment_type)
│   ├── webscrape.py                   # (If you scrape additional data into bronze/)
│   ├── data_ingestion.py              # Ingest clean_jobs.csv from Kaggle on monthly basis
│   ├── data_processing.py             # Helper functions: clean location, parse skills, etc.
│   └── etl.py                         # Reads from data/bronze, writes to data/silver & data/gold
│
├── airflow_dags/
│   └── etl_workflow.py                # Airflow DAG for orchestrating ETL workflow
│
├── app/
│   ├── dashboard.py                   # Streamlit main dashboard page
│   ├── utils.py                       # Shared helper functions for dashboard
│   └── pages/
│       ├── Status.py                  # Streamlit "Status" page
│       └── About.py                   # Streamlit "About" page
│
├── Makefile                           # Makefile for local workflow orchestration
├── requirements.txt                   # All Python dependencies (pandas, streamlit, airflow, etc.)
├── README.md                          # Project description, how to run ETL & app, directory conventions
├── .gitignore                         # Exclude /venv, __pycache__, data/bronze/*, etc.
└── LICENSE                            # (Apache 2.0)

Datasets

LinkedIn Data Jobs Dataset

Data Dictionary

Dataset description

A dataset of job postings for data-related roles (such as Data Analyst, Data Scientist, Data Engineer) sourced from LinkedIn. Includes key details for each job such as title, company, location, date posted, and job description, providing insights into current opportunities in the data job market.

Variable definitions

Name in Dataset	Variable	Definition
id (String)	Job ID	Unique identifier for each job posting
title (String)	Job Title	The title of the job position as listed on LinkedIn
company (String)	Company	Name of the company offering the job
location (String)	Location	Location of the job (may include city, state, or country)
link (String)	Job Posting URL	Direct URL to the job posting on LinkedIn
source (String)	Source	Platform or website where the job was sourced (e.g., LinkedIn)
date_posted (Date)	Date Posted	Date when the job was posted (format: YYYY-MM-DD)
work_type (String)	Work Type	Specifies if the job is Remote, On-site, or Hybrid
employment_type (String)	Employment Type	Nature of employment (e.g., Full-time, Part-time, Contract, Internship)
description (String)	Job Description	Full text description of the job, including responsibilities and requirements

Last updated:

15 May 2025, 09:00

Next update:

Monthly

Data source(s)

LinkedIn Job Postings

URLs to dataset

https://www.kaggle.com/datasets/joykimaiyo18/linkedin-data-jobs-dataset/download?datasetVersionNumber=2

License

This dataset is made available for public use under the Kaggle Datasets Terms of Service, intended for educational and research purposes. Please refer to Kaggle Terms of Service for further information.

ExchangeRate-API Daily Exchange Rates

Data Dictionary

Dataset description

A daily-updated dataset providing foreign exchange rates for major and minor world currencies. Rates are fetched automatically once a day using GitHub Actions, making the dataset suitable for financial analysis, currency conversion, and economic research. In our case, we use it for salary conversion during extraction process via salary_extractor.py.

Variable definitions

Name in Dataset	Variable	Definition
base_code (String)	Base Currency Code	The 3-letter ISO currency code used as the base for all exchange rates (e.g., 'USD')
target_code (String)	Target Currency Code	The 3-letter ISO currency code for the currency being compared to the base (e.g., 'EUR')
rate (Float)	Exchange Rate	The conversion rate from the base currency to the target currency
date (Date)	Date	The date the exchange rates apply to, in YYYY-MM-DD format
time_last_update_utc (String)	Last Update Time (UTC)	The date and time when the rates were last updated, in UTC (e.g., '2025-06-05T00:00:00Z')

Last updated:

6 Jun 2025, 00:00 UTC

Next update:

Daily updates via GitHub Actions which is more than suffice for the purpose of this project

Data source(s)

ExchangeRate-API (https://www.exchangerate-api.com/)

URLs to dataset

JSON link on GitHub raw file here

License

Use of this dataset is subject to the ExchangeRate-API Terms of Service.

Redistribution, storage, or commercial use of this data is not permitted without explicit written consent from ExchangeRate-API.
Attribution to ExchangeRate-API is required if their data is displayed.
For more details, refer to the full Terms of Service.

Data Modeling Approach

The Data Career Navigator project adopts a modern, layered data modeling strategy inspired by medallion architecture (Bronze, Silver, Gold) to ensure data quality, traceability, and analytics-readiness at every stage of the pipeline.

1. Bronze Layer (Raw Ingestion):

Purpose: Store raw, unaltered data as ingested from external sources (e.g., Kaggle LinkedIn dataset).
Contents: Files such as clean_jobs.csv and clean_jobs_latest.csv in bronze.
Characteristics: Immutable, auditable, and used as the single source of truth for all downstream processing.

2. Silver Layer (Cleaned & Enriched):

Purpose: Hold cleaned, validated, and partially enriched data.
Contents: enriched_jobs.csv in silver, which includes standardized fields, deduplicated records, and extracted features (skills, salaries, experience, etc.).
Transformations:
- Data cleaning (handling missing values, standardizing formats)
- Feature extraction (NLP-based skill and salary extraction, experience categorization)
- Location normalization and currency conversion using exchange rates

3. Gold Layer (Analytics-Ready):

Purpose: Provide high-quality, aggregated datasets optimized for analytics, reporting, and dashboarding.
Contents: Parquet files in gold such as:
- job_postings.parquet
- skills.parquet
- job_skills.parquet
- companies.parquet
- country_skill_counts.parquet
- experience_skill_counts.parquet
- salary_skill_stats.parquet
Transformations:
- Aggregations and joins to create fact and dimension tables
- Calculation of skill frequencies, salary statistics, and experience distributions
- Data is loaded into MotherDuck/DuckDB for scalable analytics

Entity-Relationship Model:

The ERD (see diagram below) defines the relationships between core entities:
- Job Postings: Central fact table, linked to companies, skills, and locations
- Skills: Dimension table, linked via a many-to-many relationship with job postings
- Companies: Dimension table, providing company-level analytics
- Aggregated Tables: Precomputed statistics for country, experience, and salary insights

Best Practices:

All transformations are performed using reproducible ETL scripts and orchestrated via Airflow or Make.
Data lineage is maintained from raw ingestion to gold outputs.
The model is designed for extensibility, allowing new data sources or features to be integrated with minimal disruption.

The Entity Relational Diagram (ERD):

Tech Stacks and Architecture

The Data Career Navigator leverages a modern, modular data engineering stack to ensure reliability, scalability, and maintainability across the entire analytics workflow.

Data Pipeline:

Batch ETL: Automated, scheduled batch jobs ingest, clean, enrich, and transform job market data from external sources into analytics-ready datasets.

Orchestration:

Apache Airflow & Make: Workflow orchestration is managed using Airflow DAGs for production-grade scheduling, monitoring, and dependency management, with Makefiles supporting local development and ad-hoc runs.

Data Storage & Lakehouse:

Layered Data Lake:
- Bronze Layer: Raw data ingested from Kaggle and other sources, stored as immutable CSVs.
- Silver Layer: Cleaned and enriched data, with standardized fields and extracted features.
- Gold Layer: Aggregated, analytics-ready Parquet files optimized for reporting and dashboarding.
MotherDuck/DuckDB: Gold-layer data is loaded into MotherDuck (cloud DuckDB) for scalable, serverless analytics and fast SQL querying.

Data Transformation & Enrichment:

Python ETL Scripts: All data cleaning, enrichment (NLP-based skill and salary extraction, experience categorization), and transformation logic is implemented in modular Python scripts.
Pandas, NumPy, Scikit-learn: Used for data wrangling, feature engineering, and statistical analysis.

Analytics & Visualization:

Streamlit: Interactive dashboards are built with Streamlit, providing real-time exploration of trends, skills, salaries, and company insights.
Plotly: Advanced visualizations (heatmaps, time series, skill distributions) are rendered using Plotly for rich, interactive analytics.

Supporting Tools:

Kaggle API: Automated data downloads and updates from the LinkedIn dataset.
ExchangeRate-API: Daily currency rates for salary normalization.
Jupyter Notebooks: Used for exploratory data analysis (EDA), prototyping, and documentation.

Infrastructure & DevOps:

Reproducibility: All dependencies are managed via requirements.txt, and workflows are version-controlled in Git.
Scalability: The architecture is designed to scale from local development to cloud-based analytics with minimal changes.
Documentation: Architecture and ERD diagrams are included for transparency and onboarding.

Architecture Diagram:

The following diagrams illustrate the end-to-end data flow, from ingestion to dashboarding, and the relationships between core entities and processing layers.

Data Career Navigator Dashboard

Built with Streamlit, it enables users to explore real job market data, skills, salaries, and trends for data-related careers worldwide, powered by MotherDuck and DuckDB for fast, scalable analytics.

Key Features:

Job Market Overview: Instantly see the total number of job postings, with breakdowns by role, company, and posting date.
Dynamic Filtering: Filter job postings by skill, company, work type (remote, on-site, hybrid), and employment type (full-time, part-time, contract, internship).
Gold-Layer Table Previews: Select and preview any gold-layer analytics table (e.g., job postings, skills, salary stats) directly in the dashboard.
Skills & Demand: Visualize the most in-demand skills, their frequency, and how they relate to job titles and companies.
Salary Insights: Analyze salary distributions, compare compensation across roles, and view salary trends by skill or geography.
Geography & Companies: Explore hiring trends by country, city, and company, with interactive maps and company leaderboards.
Responsive UI: Built for usability, with a modern dark theme and intuitive navigation.

Live Dashboard: Access the live dashboard here: data-career-navigator.streamlit.app

Dashboard Preview:

The dashboard empowers job seekers, professionals, and analysts to:

Identify high-demand skills and roles in the data industry
Benchmark salaries and experience requirements
Discover top hiring companies and locations
Make data-driven career decisions with confidence

All analytics are powered by the gold-layer datasets generated by the ETL pipeline, ensuring up-to-date and reliable insights.

Getting Started

Follow these steps to set up, run, and explore the Data Career Navigator project on your local machine or cloud environment.

1. Prerequisites

Python 3.10+ (recommended)
pip (Python package manager)
Git (for cloning the repository)
Kaggle account (for dataset access)
ExchangeRate-API key (for currency conversion)

2. Clone the Repository

git clone https://github.com/pizofreude/data-career-navigator.git
cd data-career-navigator

3. Set Up Python Environment

It is recommended to use a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

4. Install Dependencies

pip install -r requirements.txt

5. Configure Kaggle API (for Data Ingestion)

Create a Kaggle account and generate an API token from your Kaggle account settings.
Place the downloaded kaggle.json file in ~/.kaggle/ (Linux/Mac) or %USERPROFILE%\.kaggle\ (Windows).
Ensure the file has correct permissions (readable only by you).

6. Download the Raw Dataset

Run the data ingestion script to fetch the latest LinkedIn job postings dataset:

python src/data_ingestion.py

This will download and update data/bronze/clean_jobs_latest.csv.

7. (Optional) Set Up Exchange Rate API Key

Register at exchangerate-api.com to obtain a free API key.
Add your API key as a GitHub secret or environment variable if you wish to automate daily updates (see src/extractors/README.md for GitHub Actions setup).

8. Run the ETL Pipeline

Transform and enrich the data from bronze to gold layers:

python src/etl.py

This will generate cleaned and analytics-ready datasets in data/silver/ and data/gold/.

9. Launch the Dashboard

Start the Streamlit dashboard locally:

streamlit run app/Dashboard.py

The dashboard will be available at http://localhost:8501
Explore job market trends, skills, salaries, and more interactively.

10. (Optional) Orchestrate with Airflow

To run the ETL pipeline on a schedule, set up Apache Airflow and use the provided DAG in airflow_dags/etl_workflow.py.
For local development, you can also use the Makefile for common tasks.

Using Make for Workflow Automation

The project includes a Makefile to streamline and automate common ETL and data update tasks. This is especially useful for local development, testing, and running the pipeline end-to-end. Below are the available Make targets and the recommended order of execution:

1. Update Exchange Rate Data

Fetch the latest code and (optionally) update exchange rate data:

make update-exchange-rate

This will pull the latest changes from the repository. (You may also update exchange rates if configured.)

2. Data Ingestion

Download the latest LinkedIn job postings dataset from Kaggle:

make run-data-ingestion

This will run the data ingestion script and update data/bronze/clean_jobs_latest.csv.

3. (Manual) Scrape Header Text (if needed)

Some job postings may require manual header scraping due to LinkedIn restrictions:

make scrape-header

This will prompt you to run the Selenium-based script manually. Follow the instructions in the terminal (manual LinkedIn login required).

4. Run Local ETL Pipeline

Transform and enrich the data from bronze to gold layers locally:

make run-etl-local

This will generate cleaned and analytics-ready datasets in data/silver/ and data/gold/.

5. Load Gold Data to MotherDuck (Cloud Analytics)

Upload the gold-layer Parquet files to MotherDuck for scalable analytics:

make run-etl-motherduck

6. Full Workflow (Recommended for Local Development)

To run all steps (except the manual header scraping) in sequence:

make full

This will execute all the above steps in the correct order, pausing for manual header scraping if needed.

Tip:

You can always inspect or modify the Makefile to customize workflow steps for your environment.
For production or scheduled runs, consider using Airflow with the provided DAGs.

11. Explore Notebooks

Jupyter notebooks in notebooks/ provide step-by-step EDA, data cleaning, and advanced analytics.
Launch with:

jupyter lab

12. Access the Live Dashboard

Visit the hosted dashboard at: data-career-navigator.streamlit.app

For more details:

See the README.md and code comments for further documentation.

Contributions and Feedback

For troubleshooting or contributing, open an issue or pull request on GitHub.

We welcome contributions to the Data Career Navigator project. Here's how you can help:

Bug Reports & Feature Requests

Search existing issues before creating a new one
Open a new issue with a clear title and detailed description
Include steps to reproduce bugs and expected behavior
Tag issues appropriately (bug, enhancement, documentation, etc.)

Pull Requests

Fork the repository
Create a new branch (git checkout -b feature/amazing-feature)
Make your changes
Run tests and linting
Commit with clear messages (git commit -m 'Add amazing feature')
Push to your branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Guidelines

Follow PEP 8 style guide for Python code
Add tests for new features
Update documentation as needed
Keep commits atomic and well-described

Questions & Feedback

Join our Discussions
Ask questions and share ideas
Provide feedback on features and documentation

Please read our Code of Conduct before contributing.

License

This project is licensed under the Apache 2.0 License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 227 Commits
.github/workflows		.github/workflows
airflow_dags		airflow_dags
app		app
data		data
images		images
notebooks		notebooks
scripts		scripts
src		src
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt

License

pizofreude/data-career-navigator

Folders and files

Latest commit

History

Repository files navigation

Data Career Navigator

Project Status

Key Features

Overview

Datasets

Dataset description

Variable definitions

Last updated:

Next update:

Data source(s)

URLs to dataset

License

Dataset description

Variable definitions

Last updated:

Next update:

Data source(s)

URLs to dataset

License

Data Modeling Approach

Tech Stacks and Architecture

Data Career Navigator Dashboard

Getting Started

1. Prerequisites

2. Clone the Repository

3. Set Up Python Environment

4. Install Dependencies

5. Configure Kaggle API (for Data Ingestion)

6. Download the Raw Dataset

7. (Optional) Set Up Exchange Rate API Key

8. Run the ETL Pipeline

9. Launch the Dashboard

10. (Optional) Orchestrate with Airflow

Using Make for Workflow Automation

1. Update Exchange Rate Data

2. Data Ingestion

3. (Manual) Scrape Header Text (if needed)

4. Run Local ETL Pipeline

5. Load Gold Data to MotherDuck (Cloud Analytics)

6. Full Workflow (Recommended for Local Development)

11. Explore Notebooks

12. Access the Live Dashboard

Contributions and Feedback

Bug Reports & Feature Requests

Pull Requests

Development Guidelines

Questions & Feedback

License

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages