This project was developed as part of the Applications in Data Analytics module during my master's program at TU Dresden, Germany. It demonstrates an end-to-end data science workflow in R — from data cleaning and exploration to forecasting freight rail transport performance using machine learning models.
Data Sources: Public datasets from German Federal Statistical Office. Access cleaned datasets by me here
- Goal: Explore Germany freight rail transport volume patterns over 13 years and forecast rail freight transport performance.
- Tools: R, mlr3 ecosystem, and libraries (
tidyverse
,ggplot2
,plotly
,mlr3
,caret
,xgboost
,iml
,...) - View all project code here
- Data Extraction, Transformation, and Loading (ETL)
- Exploratory Data Analysis (EDA)
- Spatial & Temporal Visualization
- Machine Learning Modeling
- Model Interpretation & Evaluation
- Load and clean different datasets
- Model data features
- Final dataset: 11,726 observations with 11 features (columns)
- Understand summary statistics, variable distributions, and outliers
- Examine correlations and multicollinearity
Visualizes spread and outliers in data features
Highlights strong correlations between features
- Explore changes in transport patterns over time and space
Shows yearly transport performance of top 10 goods
Visualizes Regional transport intensity in 2023 (e.g. Düsseldorf, Braunschweig)
Highlights routes with consistent yearly activity over 13 years
- Frame as a regression problem
- Train and tune different models: Linear Regression, Random Forest, XGBoost
- Perform nested resampling for unbiased validation
- Perform Feature selection and ML models' hyperparameter tuning to increase prediction performance
Model | RMSE (tkm) | MAE (tkm) | MAPE (%) | R² |
---|---|---|---|---|
RF (Tuned) | 10.18M | 3.36M | 1.03 | 0.969 |
XGB (Tuned) | 7.91M | 2.64M | 6.42 | 0.981 |
- Understand the influence of top features to verify the model's accuracy
- Evaluate model behavior and possible biases
Visual comparison of predicted vs actual transport performance
Partial dependence and individual conditional effects for rail volume
This project is conducted by Thu Thuy Nguyen - MSc. Transport Economics – TU Dresden, Germany
If using this project for academic or educational purposes, please cite the report or credit the author. To read the detailed report, click here
This project reflects both my technical skills in R and my curiosity in applying data science to real-world problems. From wrangling messy data to generating insights through modeling and visualization, this was both a learning and exploratory process.
I know this isn’t perfect — and that’s the point. It’s a step forward in a longer journey toward mastery. I plan to keep building on this, refining both the code and the analytical depth over time.
If you have suggestions, questions, or ideas, I’d love to hear them. Collaboration and continuous improvement are at the heart of what I do.
Thanks for checking it out!