About: A data engineering portfolio project that provides insights into career opportunities in the data industry. Utilizes a comprehensive dataset sourced from LinkedIn.
âś… Complete: This project has been successfully developed and deployed. All planned features have been implemented, and the pipeline is fully operational.
- ETL Pipeline: Cleans, enriches, and transforms raw data into analytics-ready gold-layer tables.
- Skill & Salary Extraction: Extracts skills, experience, and salary information using advanced NLP and heuristics.
- MotherDuck Integration: Loads gold-layer data into MotherDuck for scalable analytics.
- Interactive Dashboard: Visualizes trends, skills, and salary insights.
- Workflow Orchestration: Orchestrates data pipelines using Airflow and Make.
The Data Career Navigator is a data engineering project that aims to provide insights into career opportunities in the data industry. It leverages a comprehensive dataset sourced from LinkedIn, which includes details such as job title, company, location, date posted, and job description. The project utilizes advanced data engineering techniques to extract and analyze key insights from this dataset. It automates data ingestion, enrichment, and gold-layer analytics. Supported by MotherDuck for scalable analytics and Airflow or Make for workflow orchestration.
Project Title: Data Career Navigator – Exploring Data-Driven Career Opportunities
Problem Statement: Despite the rapid growth of the data industry, job seekers and professionals often lack clear, data-driven insights into the evolving landscape of data-related careers. Information about required skills, salary benchmarks, experience levels, and industry trends is fragmented across various sources, making it challenging to make informed career decisions. There is a need for a unified platform that aggregates, cleans, and analyzes job market data to provide actionable insights for individuals navigating data careers.
Project Description:Â An interactive dashboard providing deep insights into career opportunities for data-related roles, utilizing a comprehensive dataset sourced from LinkedIn. Features include analysis of experience levels, salaries, key skills, job locations, and industry trends, aiding job seekers and professionals in exploring and identifying optimal career paths.
Project Structure:
data-career-navigator/
│
├── data/
│ ├── bronze/ # Raw, unaltered ingestion (monthly dumps from Kaggle)
│ │ ├── clean_jobs.csv
│ │ └── clean_jobs_latest.csv
│ ├── silver/ # Cleaned & partially enriched (after ETL)
│ │ └── enriched_jobs.csv
│ └── gold/ # Final aggregated datasets (ready for reporting/dashboard)
│ ├── job_postings.parquet
│ ├── skills.parquet
│ ├── job_skills.parquet
│ ├── companies.parquet
│ ├── country_skill_counts.parquet
│ ├── experience_skill_counts.parquet
│ └── salary_skill_stats.parquet
│
├── notebooks/
│ ├── 01-data-cleaning.ipynb # Ingest from data/bronze, produce silver outputs
│ ├── 02-eda-enriched-jobs.ipynb # Explore silver tables (salary, experience, skills)
│ └── 03-visualizations.ipynb # Prototype charts, maps, etc. using silver/gold
│
├── src/
│ ├── extractors/
│ │ ├── __init__.py
│ │ ├── salary_extractor.py # SalaryExtractor & SalaryETL
│ │ ├── experience_extractor.py # categorize_experience(...)
│ │ ├── skills_extractor.py # extract_skills(job_description)
│ │ └── job_type_extractor.py # extract_job_type(work_type, employment_type)
│ ├── webscrape.py # (If you scrape additional data into bronze/)
│ ├── data_ingestion.py # Ingest clean_jobs.csv from Kaggle on monthly basis
│ ├── data_processing.py # Helper functions: clean location, parse skills, etc.
│ └── etl.py # Reads from data/bronze, writes to data/silver & data/gold
│
├── airflow_dags/
│ └── etl_workflow.py # Airflow DAG for orchestrating ETL workflow
│
├── app/
│ ├── dashboard.py # Streamlit main dashboard page
│ ├── utils.py # Shared helper functions for dashboard
│ └── pages/
│ ├── Status.py # Streamlit "Status" page
│ └── About.py # Streamlit "About" page
│
├── Makefile # Makefile for local workflow orchestration
├── requirements.txt # All Python dependencies (pandas, streamlit, airflow, etc.)
├── README.md # Project description, how to run ETL & app, directory conventions
├── .gitignore # Exclude /venv, __pycache__, data/bronze/*, etc.
└── LICENSE # (Apache 2.0)
-
Data Dictionary
A dataset of job postings for data-related roles (such as Data Analyst, Data Scientist, Data Engineer) sourced from LinkedIn. Includes key details for each job such as title, company, location, date posted, and job description, providing insights into current opportunities in the data job market.
Name in Dataset Variable Definition id (String) Job ID Unique identifier for each job posting title (String) Job Title The title of the job position as listed on LinkedIn company (String) Company Name of the company offering the job location (String) Location Location of the job (may include city, state, or country) link (String) Job Posting URL Direct URL to the job posting on LinkedIn source (String) Source Platform or website where the job was sourced (e.g., LinkedIn) date_posted (Date) Date Posted Date when the job was posted (format: YYYY-MM-DD) work_type (String) Work Type Specifies if the job is Remote, On-site, or Hybrid employment_type (String) Employment Type Nature of employment (e.g., Full-time, Part-time, Contract, Internship) description (String) Job Description Full text description of the job, including responsibilities and requirements 15 May 2025, 09:00
Monthly
- LinkedIn Job Postings
This dataset is made available for public use under the Kaggle Datasets Terms of Service, intended for educational and research purposes. Please refer to Kaggle Terms of Service for further information.
-
ExchangeRate-API Daily Exchange Rates
Data Dictionary
A daily-updated dataset providing foreign exchange rates for major and minor world currencies. Rates are fetched automatically once a day using GitHub Actions, making the dataset suitable for financial analysis, currency conversion, and economic research. In our case, we use it for salary conversion during extraction process via
salary_extractor.py
.Name in Dataset Variable Definition base_code (String) Base Currency Code The 3-letter ISO currency code used as the base for all exchange rates (e.g., 'USD') target_code (String) Target Currency Code The 3-letter ISO currency code for the currency being compared to the base (e.g., 'EUR') rate (Float) Exchange Rate The conversion rate from the base currency to the target currency date (Date) Date The date the exchange rates apply to, in YYYY-MM-DD format time_last_update_utc (String) Last Update Time (UTC) The date and time when the rates were last updated, in UTC (e.g., '2025-06-05T00:00:00Z') 6 Jun 2025, 00:00 UTC
Daily updates via GitHub Actions which is more than suffice for the purpose of this project
- ExchangeRate-API (https://www.exchangerate-api.com/)
- JSON link on GitHub raw file here
Use of this dataset is subject to the ExchangeRate-API Terms of Service.
- Redistribution, storage, or commercial use of this data is not permitted without explicit written consent from ExchangeRate-API.
- Attribution to ExchangeRate-API is required if their data is displayed.
- For more details, refer to the full Terms of Service.
The Data Career Navigator project adopts a modern, layered data modeling strategy inspired by medallion architecture (Bronze, Silver, Gold) to ensure data quality, traceability, and analytics-readiness at every stage of the pipeline.
1. Bronze Layer (Raw Ingestion):
- Purpose: Store raw, unaltered data as ingested from external sources (e.g., Kaggle LinkedIn dataset).
- Contents: Files such as
clean_jobs.csv
andclean_jobs_latest.csv
in bronze. - Characteristics: Immutable, auditable, and used as the single source of truth for all downstream processing.
2. Silver Layer (Cleaned & Enriched):
- Purpose: Hold cleaned, validated, and partially enriched data.
- Contents:
enriched_jobs.csv
in silver, which includes standardized fields, deduplicated records, and extracted features (skills, salaries, experience, etc.). - Transformations:
- Data cleaning (handling missing values, standardizing formats)
- Feature extraction (NLP-based skill and salary extraction, experience categorization)
- Location normalization and currency conversion using exchange rates
3. Gold Layer (Analytics-Ready):
- Purpose: Provide high-quality, aggregated datasets optimized for analytics, reporting, and dashboarding.
- Contents: Parquet files in gold such as:
job_postings.parquet
skills.parquet
job_skills.parquet
companies.parquet
country_skill_counts.parquet
experience_skill_counts.parquet
salary_skill_stats.parquet
- Transformations:
- Aggregations and joins to create fact and dimension tables
- Calculation of skill frequencies, salary statistics, and experience distributions
- Data is loaded into MotherDuck/DuckDB for scalable analytics
Entity-Relationship Model:
- The ERD (see diagram below) defines the relationships between core entities:
- Job Postings: Central fact table, linked to companies, skills, and locations
- Skills: Dimension table, linked via a many-to-many relationship with job postings
- Companies: Dimension table, providing company-level analytics
- Aggregated Tables: Precomputed statistics for country, experience, and salary insights
Best Practices:
- All transformations are performed using reproducible ETL scripts and orchestrated via Airflow or Make.
- Data lineage is maintained from raw ingestion to gold outputs.
- The model is designed for extensibility, allowing new data sources or features to be integrated with minimal disruption.
The Entity Relational Diagram (ERD):
The Data Career Navigator leverages a modern, modular data engineering stack to ensure reliability, scalability, and maintainability across the entire analytics workflow.
Data Pipeline:
- Batch ETL: Automated, scheduled batch jobs ingest, clean, enrich, and transform job market data from external sources into analytics-ready datasets.
Orchestration:
- Apache Airflow & Make: Workflow orchestration is managed using Airflow DAGs for production-grade scheduling, monitoring, and dependency management, with Makefiles supporting local development and ad-hoc runs.
Data Storage & Lakehouse:
- Layered Data Lake:
- Bronze Layer: Raw data ingested from Kaggle and other sources, stored as immutable CSVs.
- Silver Layer: Cleaned and enriched data, with standardized fields and extracted features.
- Gold Layer: Aggregated, analytics-ready Parquet files optimized for reporting and dashboarding.
- MotherDuck/DuckDB: Gold-layer data is loaded into MotherDuck (cloud DuckDB) for scalable, serverless analytics and fast SQL querying.
Data Transformation & Enrichment:
- Python ETL Scripts: All data cleaning, enrichment (NLP-based skill and salary extraction, experience categorization), and transformation logic is implemented in modular Python scripts.
- Pandas, NumPy, Scikit-learn: Used for data wrangling, feature engineering, and statistical analysis.
Analytics & Visualization:
- Streamlit: Interactive dashboards are built with Streamlit, providing real-time exploration of trends, skills, salaries, and company insights.
- Plotly: Advanced visualizations (heatmaps, time series, skill distributions) are rendered using Plotly for rich, interactive analytics.
Supporting Tools:
- Kaggle API: Automated data downloads and updates from the LinkedIn dataset.
- ExchangeRate-API: Daily currency rates for salary normalization.
- Jupyter Notebooks: Used for exploratory data analysis (EDA), prototyping, and documentation.
Infrastructure & DevOps:
- Reproducibility: All dependencies are managed via requirements.txt, and workflows are version-controlled in Git.
- Scalability: The architecture is designed to scale from local development to cloud-based analytics with minimal changes.
- Documentation: Architecture and ERD diagrams are included for transparency and onboarding.
Architecture Diagram:
- The following diagrams illustrate the end-to-end data flow, from ingestion to dashboarding, and the relationships between core entities and processing layers.
Built with Streamlit, it enables users to explore real job market data, skills, salaries, and trends for data-related careers worldwide, powered by MotherDuck and DuckDB for fast, scalable analytics.
Key Features:
- Job Market Overview: Instantly see the total number of job postings, with breakdowns by role, company, and posting date.
- Dynamic Filtering: Filter job postings by skill, company, work type (remote, on-site, hybrid), and employment type (full-time, part-time, contract, internship).
- Gold-Layer Table Previews: Select and preview any gold-layer analytics table (e.g., job postings, skills, salary stats) directly in the dashboard.
- Skills & Demand: Visualize the most in-demand skills, their frequency, and how they relate to job titles and companies.
- Salary Insights: Analyze salary distributions, compare compensation across roles, and view salary trends by skill or geography.
- Geography & Companies: Explore hiring trends by country, city, and company, with interactive maps and company leaderboards.
- Responsive UI: Built for usability, with a modern dark theme and intuitive navigation.
Live Dashboard: Access the live dashboard here: data-career-navigator.streamlit.app
Dashboard Preview:
The dashboard empowers job seekers, professionals, and analysts to:
- Identify high-demand skills and roles in the data industry
- Benchmark salaries and experience requirements
- Discover top hiring companies and locations
- Make data-driven career decisions with confidence
All analytics are powered by the gold-layer datasets generated by the ETL pipeline, ensuring up-to-date and reliable insights.
Follow these steps to set up, run, and explore the Data Career Navigator project on your local machine or cloud environment.
- Python 3.10+ (recommended)
- pip (Python package manager)
- Git (for cloning the repository)
- Kaggle account (for dataset access)
- ExchangeRate-API key (for currency conversion)
git clone https://github.com/pizofreude/data-career-navigator.git
cd data-career-navigator
It is recommended to use a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
- Create a Kaggle account and generate an API token from your Kaggle account settings.
- Place the downloaded
kaggle.json
file in~/.kaggle/
(Linux/Mac) or%USERPROFILE%\.kaggle\
(Windows). - Ensure the file has correct permissions (readable only by you).
Run the data ingestion script to fetch the latest LinkedIn job postings dataset:
python src/data_ingestion.py
This will download and update data/bronze/clean_jobs_latest.csv
.
- Register at exchangerate-api.com to obtain a free API key.
- Add your API key as a GitHub secret or environment variable if you wish to automate daily updates (see
src/extractors/README.md
for GitHub Actions setup).
Transform and enrich the data from bronze to gold layers:
python src/etl.py
This will generate cleaned and analytics-ready datasets in data/silver/
and data/gold/
.
Start the Streamlit dashboard locally:
streamlit run app/Dashboard.py
- The dashboard will be available at http://localhost:8501
- Explore job market trends, skills, salaries, and more interactively.
- To run the ETL pipeline on a schedule, set up Apache Airflow and use the provided DAG in
airflow_dags/etl_workflow.py
. - For local development, you can also use the
Makefile
for common tasks.
The project includes a Makefile
to streamline and automate common ETL and data update tasks. This is especially useful for local development, testing, and running the pipeline end-to-end. Below are the available Make targets and the recommended order of execution:
Fetch the latest code and (optionally) update exchange rate data:
make update-exchange-rate
This will pull the latest changes from the repository. (You may also update exchange rates if configured.)
Download the latest LinkedIn job postings dataset from Kaggle:
make run-data-ingestion
This will run the data ingestion script and update data/bronze/clean_jobs_latest.csv
.
Some job postings may require manual header scraping due to LinkedIn restrictions:
make scrape-header
This will prompt you to run the Selenium-based script manually. Follow the instructions in the terminal (manual LinkedIn login required).
Transform and enrich the data from bronze to gold layers locally:
make run-etl-local
This will generate cleaned and analytics-ready datasets in data/silver/
and data/gold/
.
Upload the gold-layer Parquet files to MotherDuck for scalable analytics:
make run-etl-motherduck
To run all steps (except the manual header scraping) in sequence:
make full
This will execute all the above steps in the correct order, pausing for manual header scraping if needed.
Tip:
- You can always inspect or modify the
Makefile
to customize workflow steps for your environment. - For production or scheduled runs, consider using Airflow with the provided DAGs.
- Jupyter notebooks in
notebooks/
provide step-by-step EDA, data cleaning, and advanced analytics. - Launch with:
jupyter lab
- Visit the hosted dashboard at: data-career-navigator.streamlit.app
For more details:
- See the README.md and code comments for further documentation.
For troubleshooting or contributing, open an issue or pull request on GitHub.
We welcome contributions to the Data Career Navigator project. Here's how you can help:
- Search existing issues before creating a new one
- Open a new issue with a clear title and detailed description
- Include steps to reproduce bugs and expected behavior
- Tag issues appropriately (bug, enhancement, documentation, etc.)
- Fork the repository
- Create a new branch (
git checkout -b feature/amazing-feature
) - Make your changes
- Run tests and linting
- Commit with clear messages (
git commit -m 'Add amazing feature'
) - Push to your branch (
git push origin feature/amazing-feature
) - Open a Pull Request
- Follow PEP 8 style guide for Python code
- Add tests for new features
- Update documentation as needed
- Keep commits atomic and well-described
- Join our Discussions
- Ask questions and share ideas
- Provide feedback on features and documentation
Please read our Code of Conduct before contributing.
This project is licensed under the Apache 2.0 License. See the LICENSE file for details.