This repository houses a comprehensive data processing and modeling pipeline designed for bike sharing data. The project employs the medallion format, transforming raw bike and weather information into refined data at bronze, silver, and gold levels of quality. The pipeline encompasses Extract Transform Load (ETL) processes, Exploratory Data Analysis (EDA), Modeling and ML Ops, and Application deployment.
The ETL pipeline is responsible for processing historical trip and weather data using a Spark streaming job. It not only ingests new data in real-time but also ensures the creation of bronze, silver, and gold level data. The pipeline leverages optimization techniques such as partitioning and Z-ordering for Delta tables.
The bronze level captures raw bike and weather data, maintaining the integrity of the original information. This serves as the initial layer of the data refinement process.
At the silver level, the data is processed to meet the requirements for training features and runtime inference. This layer serves as an intermediate step towards the gold level.
The gold level encapsulates application and monitoring data. It represents the highest quality of refined data, ready for deployment and real-time monitoring.
The Spark streaming job script defines the schema and data transformations applied to the streaming data. This ensures a standardized and consistent approach to data processing.
EDA scripts aim to provide insights into the bike sharing data. The analysis covers various aspects, including monthly trip trends, daily trip trends, the impact of holidays on system use, and the influence of weather on daily/hourly trends.
The modeling pipeline builds a forecasting model that infers net bike changes by the hour. Leveraging historical data, the model is designed to provide accurate predictions for system usage.
Hyperparameter tuning is conducted using MLflow experiments, with Spark trials and hyper-opts applied for optimization. This ensures the model is fine-tuned for optimal performance.
The model is registered in the Databricks model registry, facilitating seamless transition between staging and production environments.
The gold data table is a crucial component, storing both inference and monitoring data. It acts as a centralized repository for high-quality, processed data.
Real-time monitoring components within the application provide insights into system performance, enabling timely interventions if needed.
Comparing actual vs. predicted real-time displays helps ensure the reliability and effectiveness of the models in both staging and production environments.
- Clone the Repository:
git clone https://github.com/shubhamtamhane/CitiBike-Forecasting-And-Analysis.git
Configure your Spark environment and install necessary dependencies as outlined in the documentation.
Run the ETL pipeline, EDA, and modeling scripts in the specified order.
Deploy the application components as required for monitoring and real-time display.
This project provides a robust end-to-end solution for processing and modeling bike sharing data. From raw data extraction to real-time monitoring, each component is meticulously designed to ensure data quality, model accuracy, and system reliability. The modular structure allows for easy customization and scalability, making it suitable for a variety of bike sharing scenarios.