Skip to content

intr3pid64/MLFoodPredictor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This project contains four main Python scripts that work together to clean survey data, train a machine learning ensemble, validate the model, and make predictions.

The dataset used in this project was collected through a food-themed survey and is stored in CSV format. Each row represents a user's response to a series of questions about a food item, including how complex they believe it is to prepare, how many ingredients it might contain, the setting in which it is typically served (such as weekday lunch or late-night snack), and the estimated cost. Respondents were also asked to name a movie and a drink they associate with the food, identify who it reminds them of (e.g., parents, friends), and indicate how much hot sauce they would add. The cleaned version of this dataset, stored as cleaned_data_advanced.csv, includes parsed numeric values and one-hot encoded features derived from these questions, along with a final label indicating the food type (e.g., Pizza, Shawarma, or Sushi) for supervised learning.

The clean_data.py script is responsible for preprocessing and feature engineering. It converts raw survey responses, such as perceived food complexity, estimated cost, associated movies or drinks, and social context, into structured numeric features. This involves parsing free-text answers, applying logic to standardize drink and movie names, and encoding multi-select fields using one-hot encoding. The result is a clean, model-ready dataset saved as cleaned_data_advanced.csv.

The train_ensemble.py script takes the cleaned dataset and splits it into training and validation sets. It applies preprocessing steps including one-hot encoding, label encoding, and median imputation. It then trains three classifiers: Logistic Regression, Random Forest, and Gradient Boosting. Hyperparameters for each model are tuned using GridSearchCV. The best model parameters, along with feature metadata and imputation statistics, are saved into model_params.json for use during inference.

The validate.py script is used to evaluate the performance of the trained models. It loads the same cleaned dataset and reconstructs the best-performing classifiers using fixed parameters. These models are combined using a soft-voting VotingClassifier, and the ensemble is evaluated using 5-fold cross-validation. This provides a robust estimate of model accuracy and consistency across data splits.

The pred.py script performs inference on new test data. It loads model_params.json and manually implements prediction logic for each of the three models without using scikit-learn at runtime. This includes decision tree traversal for Random Forest and Gradient Boosting, and softmax calculation for Logistic Regression. The predictions from all models are combined using soft voting to produce final output labels. This script can be executed from the command line with python pred.py path_to_cleaned_test_csv.csv.

About

ML ensemble that takes information from students regarding a food, and predicts what food it is.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages