Skip to content

As a data scientist at Corporation Favorita, I'm tasked with ensuring that we always maintain the right quantity of products in stock. To achieve this, I've chosen to develop a series of machine learning models to forecast the demand for products across different locations. I adhered to the CRISP-DM Framework for Data Science projects.

License

Notifications You must be signed in to change notification settings

D0nG4667/favorita_products_demand_forecasting

Repository files navigation

Favorita Product Demand Forecasting

A machine learning project with multivariate time series to forecast sales, built with Python and scikit-learn, designed for corporation favorita and retail companies aiming to forecast sales and optimize product inventory.

Python Version Data Analysis Data Visualization Hypothesis Testing Machine Learning Medium Article

Overview

Corporation Favorita aims to optimize its inventory management by accurately forecasting the demand for various products across its stores in Ecuador. The goal is to ensure that each store has the right quantity of products in stock to meet customer demand while minimizing overstocking or stockouts.

Key Objectives

The objective is to build machine learning models that can predict unit sales for different product families at Favorita stores accurately. These models will help optimize inventory levels, improve sales forecasting accuracy, and ultimately enhance customer satisfaction by ensuring product availability.

Framework

The CRoss Industry Standard Process for Data Mining (CRISP-DM).

Features

  • Jupyter Notebook containing data analysis, visualizations, and interpretation.
  • Detailed documentation outlining methodology, data sources, and analysis results.
  • Interactive visualizations in Power BI and Dash showcasing sales trends and key insights.

Dash app

PowerBI Dashboard

Dashboard

Technologies Used

  • Anaconda
  • PowerBI
  • Python
  • Polars
  • Dash
  • Plotly
  • plotly_resampler
  • Jupyter Notebooks
  • Git
  • Scipy
  • Sklearn
  • Xgboost
  • LinearRegression
  • HistGradientBoostingRegressor
  • Prophet
  • Pyodbc
  • Re
  • Typing
  • Kaleido

Installation

Recommended install

conda env create -f favorita_env.yml

Code Snippet- used to create date features

def create_date_features(df: pl.DataFrame)-> pl.DataFrame:
    # To date
    df = to_date(df, 'date')
    
    # Year
    # Extracting the year from the date and adding it to the df as a new column
    df = df.with_columns(pl.col('date').dt.year().alias('year'))
    
    # Extracting the ordinal day from the date and adding it to the df as a new column
    df = df.with_columns(pl.col('date').dt.ordinal_day().alias('day_of_year'))
    
    # Extracting the year start bool from the date and adding it to the df as a new column
    df = df.with_columns((pl.col('day_of_year')==1).alias('is_year_start'))
    
    # Extracting the year end bool from the date and adding it to the df as a new column
    df = df.with_columns((pl.col('day_of_year')==31).alias('is_year_end'))
    
    # Month
    # Extracting the month from the date and adding it to the df as a new column (1-12)
    df = df.with_columns(pl.col('date').dt.month().alias('month'))
    
    # Extracting the month start bool from the date and adding it to the df as a new column
    df = df.with_columns((pl.col('date')==pl.col('date').dt.month_start()).alias('is_month_start'))
    
    # Extracting the month end bool from the date and adding it to the df as a new column
    df = df.with_columns((pl.col('date')==pl.col('date').dt.month_end()).alias('is_month_end'))
    
    # Extracting the month name from the date and adding it to the df as a new column
    df = df.with_columns(pl.col('date').dt.to_string('%B').alias('month_name')) 
    
    # Day
    # Extracting the day from the date and adding it to the df as a new column
    df = df.with_columns(pl.col('date').dt.day().alias('day'))           
    
    # Extracting the day name from the date and adding it to the df as a new column
    df = df.with_columns(pl.col('date').dt.to_string('%A').alias('day_name'))
    
    # Quarter
    # Extracting the quarter from the date and adding it to the df as a new column (1-4)
    df = df.with_columns(pl.col('date').dt.quarter().alias('quarter'))
    
    # Extracting the quarter start bool from the date and adding it to the df as a new column
    df = df.with_columns((((((pl.col('quarter')-1)*3)+1)==pl.col('month')) & (pl.col('is_month_start'))).alias('is_quarter_start'))
    
    # Extracting the quarter end bool from the date and adding it to the df as a new column
    df = df.with_columns((((pl.col('quarter')*3)==pl.col('month')) & (pl.col('is_month_end'))).alias('is_quarter_end'))
    
    # Week
    # Extracting the week from the date and adding it to the df as a new column (1-52)
    df = df.with_columns(pl.col('date').dt.week().alias('week'))
    
    # Extracting the week day from the date and adding it to the df as a new column (1-7)
    df = df.with_columns(pl.col('date').dt.weekday().alias('week_day'))
    
    # Extracting the weekend bool from the date and adding it to the df as a new column
    df = df.with_columns(pl.col('week_day').is_in([6,7]).alias('is_week_end'))         
    
    return df

(back to top)

Code Snippet- used to create lag features

def create_lag_features(df: pl.DataFrame, lag_days: List[int] = [1, 6, 7], rows_per_day: int = 1782, lag_factor: int = 91) -> pl.DataFrame:
    # Initialize a list to hold the lag columns
    lag_columns = []

    # Iterate to create each lag feature, fill nulls with zero since they are not known or are probably zero if the first date is the inception
    for i in lag_days:
        lag_col = df['sales'].shift(i*rows_per_day*lag_factor).alias(f'lag_{i}').fill_null(strategy='zero')
        lag_columns.append(lag_col)

    # Combine the original df with the lag columns. Drop rows with null values in lag columns if they were not filled, probably redundant since they were filled
    lagged_df = df.with_columns(lag_columns).drop_nulls([col.name for col in lag_columns])

    return lagged_df

(back to top)

Code Snippet- used to create all features

def create_all_features(df):
    if not isinstance(df, pl.DataFrame): # Handle arrays inputs from pipeline
        schema = ['id', 'store_nbr', 'family', 'sales', 'onpromotion', 'city', 'state', 'store_type', 'cluster', 
              'promotion', 'holiday', 'year', 'day_of_year', 'is_year_start', 'is_year_end', 'month', 'is_month_start', 
              'is_month_end', 'month_name', 'day', 'day_name', 'quarter', 'is_quarter_start', 'is_quarter_end', 'week', 
              'week_day', 'is_week_end', 'lag_1', 'lag_6', 'lag_7', 'oil_price']
    
        df = pl.DataFrame(data=df, schema=schema)
        
    return column_dropper(
        create_oil_feature(
            create_lag_features(
                create_date_features(
                    create_holiday_feature(
                        create_promotion_feature(
                            create_store_features(fix_col_types(df))
                        )
                    )
                )
            )
        )
    )
    

(back to top)

Contributions

How to Contribute

  1. Fork the repository and clone it to your local machine.
  2. Explore the Jupyter Notebooks and documentation.
  3. Implement enhancements, fix bugs, or propose new features.
  4. Submit a pull request with your changes, ensuring clear descriptions and documentation.
  5. Participate in discussions, provide feedback, and collaborate with the community.

(back to top)

Feedback and Support

Feedback, suggestions, and contributions are welcome! Feel free to open an issue for bug reports, feature requests, or general inquiries. For additional support or questions, you can connect with me on LinkedIn.

Link to article on Medium: From Meeting Product Demands to transforming Inventory Management: Forecasting Sales in Ecuador's Leading Retailer, Corporación Favorita.

(back to top)

👥 Authors

🕺🏻Gabriel Okundaye

(back to top)

⭐️ Show your support

If you like this project kindly show some love, give it a 🌟 STAR 🌟. Thank you!

(back to top)

📝 License

This project is MIT licensed.

(back to top)

About

As a data scientist at Corporation Favorita, I'm tasked with ensuring that we always maintain the right quantity of products in stock. To achieve this, I've chosen to develop a series of machine learning models to forecast the demand for products across different locations. I adhered to the CRISP-DM Framework for Data Science projects.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published