A machine learning project with multivariate time series to forecast sales, built with Python and scikit-learn, designed for corporation favorita and retail companies aiming to forecast sales and optimize product inventory.
Corporation Favorita aims to optimize its inventory management by accurately forecasting the demand for various products across its stores in Ecuador. The goal is to ensure that each store has the right quantity of products in stock to meet customer demand while minimizing overstocking or stockouts.
The objective is to build machine learning models that can predict unit sales for different product families at Favorita stores accurately. These models will help optimize inventory levels, improve sales forecasting accuracy, and ultimately enhance customer satisfaction by ensuring product availability.
The CRoss Industry Standard Process for Data Mining (CRISP-DM).
- Jupyter Notebook containing data analysis, visualizations, and interpretation.
- Detailed documentation outlining methodology, data sources, and analysis results.
- Interactive visualizations in Power BI and Dash showcasing sales trends and key insights.
- Anaconda
- PowerBI
- Python
- Polars
- Dash
- Plotly
- plotly_resampler
- Jupyter Notebooks
- Git
- Scipy
- Sklearn
- Xgboost
- LinearRegression
- HistGradientBoostingRegressor
- Prophet
- Pyodbc
- Re
- Typing
- Kaleido
conda env create -f favorita_env.yml
def create_date_features(df: pl.DataFrame)-> pl.DataFrame:
# To date
df = to_date(df, 'date')
# Year
# Extracting the year from the date and adding it to the df as a new column
df = df.with_columns(pl.col('date').dt.year().alias('year'))
# Extracting the ordinal day from the date and adding it to the df as a new column
df = df.with_columns(pl.col('date').dt.ordinal_day().alias('day_of_year'))
# Extracting the year start bool from the date and adding it to the df as a new column
df = df.with_columns((pl.col('day_of_year')==1).alias('is_year_start'))
# Extracting the year end bool from the date and adding it to the df as a new column
df = df.with_columns((pl.col('day_of_year')==31).alias('is_year_end'))
# Month
# Extracting the month from the date and adding it to the df as a new column (1-12)
df = df.with_columns(pl.col('date').dt.month().alias('month'))
# Extracting the month start bool from the date and adding it to the df as a new column
df = df.with_columns((pl.col('date')==pl.col('date').dt.month_start()).alias('is_month_start'))
# Extracting the month end bool from the date and adding it to the df as a new column
df = df.with_columns((pl.col('date')==pl.col('date').dt.month_end()).alias('is_month_end'))
# Extracting the month name from the date and adding it to the df as a new column
df = df.with_columns(pl.col('date').dt.to_string('%B').alias('month_name'))
# Day
# Extracting the day from the date and adding it to the df as a new column
df = df.with_columns(pl.col('date').dt.day().alias('day'))
# Extracting the day name from the date and adding it to the df as a new column
df = df.with_columns(pl.col('date').dt.to_string('%A').alias('day_name'))
# Quarter
# Extracting the quarter from the date and adding it to the df as a new column (1-4)
df = df.with_columns(pl.col('date').dt.quarter().alias('quarter'))
# Extracting the quarter start bool from the date and adding it to the df as a new column
df = df.with_columns((((((pl.col('quarter')-1)*3)+1)==pl.col('month')) & (pl.col('is_month_start'))).alias('is_quarter_start'))
# Extracting the quarter end bool from the date and adding it to the df as a new column
df = df.with_columns((((pl.col('quarter')*3)==pl.col('month')) & (pl.col('is_month_end'))).alias('is_quarter_end'))
# Week
# Extracting the week from the date and adding it to the df as a new column (1-52)
df = df.with_columns(pl.col('date').dt.week().alias('week'))
# Extracting the week day from the date and adding it to the df as a new column (1-7)
df = df.with_columns(pl.col('date').dt.weekday().alias('week_day'))
# Extracting the weekend bool from the date and adding it to the df as a new column
df = df.with_columns(pl.col('week_day').is_in([6,7]).alias('is_week_end'))
return df
def create_lag_features(df: pl.DataFrame, lag_days: List[int] = [1, 6, 7], rows_per_day: int = 1782, lag_factor: int = 91) -> pl.DataFrame:
# Initialize a list to hold the lag columns
lag_columns = []
# Iterate to create each lag feature, fill nulls with zero since they are not known or are probably zero if the first date is the inception
for i in lag_days:
lag_col = df['sales'].shift(i*rows_per_day*lag_factor).alias(f'lag_{i}').fill_null(strategy='zero')
lag_columns.append(lag_col)
# Combine the original df with the lag columns. Drop rows with null values in lag columns if they were not filled, probably redundant since they were filled
lagged_df = df.with_columns(lag_columns).drop_nulls([col.name for col in lag_columns])
return lagged_df
def create_all_features(df):
if not isinstance(df, pl.DataFrame): # Handle arrays inputs from pipeline
schema = ['id', 'store_nbr', 'family', 'sales', 'onpromotion', 'city', 'state', 'store_type', 'cluster',
'promotion', 'holiday', 'year', 'day_of_year', 'is_year_start', 'is_year_end', 'month', 'is_month_start',
'is_month_end', 'month_name', 'day', 'day_name', 'quarter', 'is_quarter_start', 'is_quarter_end', 'week',
'week_day', 'is_week_end', 'lag_1', 'lag_6', 'lag_7', 'oil_price']
df = pl.DataFrame(data=df, schema=schema)
return column_dropper(
create_oil_feature(
create_lag_features(
create_date_features(
create_holiday_feature(
create_promotion_feature(
create_store_features(fix_col_types(df))
)
)
)
)
)
)
- Fork the repository and clone it to your local machine.
- Explore the Jupyter Notebooks and documentation.
- Implement enhancements, fix bugs, or propose new features.
- Submit a pull request with your changes, ensuring clear descriptions and documentation.
- Participate in discussions, provide feedback, and collaborate with the community.
Feedback, suggestions, and contributions are welcome! Feel free to open an issue for bug reports, feature requests, or general inquiries. For additional support or questions, you can connect with me on LinkedIn.
Link to article on Medium: From Meeting Product Demands to transforming Inventory Management: Forecasting Sales in Ecuador's Leading Retailer, Corporación Favorita.
🕺🏻Gabriel Okundaye
-
GitHub: GitHub Profile
-
LinkedIn: LinkedIn Profile
If you like this project kindly show some love, give it a 🌟 STAR 🌟. Thank you!
This project is MIT licensed.