Dirty Comments, Clean Plates

Dirty comments and clean plates: Using restaurant reviews to predict failed health inspections & predict fake reviews

Our task: use a corpus of text-based reviews to train a model to classify if a restaurant is likely to fail a health inspection and predict if a review is human-generated or generated by GPT 3.5 or 4.

Our project has had the following phases:

attempting to use Chicago inspections and Yelp review data
pivoting to a philadelphia-based approach when our scraping efforts were throttled
creating NN and transformer-based models
building a fake reviews dataset
applying our binary-classification models on the fake reviews labeled dataset

This repository has the following directories, which include code for how we constructed our project:

collect/: contains two sub-directories that chronicle our data collection efforts for both the Chicago-based approach and the philadelpha approach.
- chicago/datatypes.py: ReviewDataset class for reading in scaped yelp data (Jack, 25 lines)
- chicago/get_inspections.py: Gets inspections and restaurants jsons, cleans and merges them (Claire, 90 lines)
- chicago/get_restaurants_by_point.py: Pulls top 1000 restaurants given a list of lat/long coordinates (Claire, 85 lines)
- chicago/identify_points.py: Given chicago geographic boundaries, identifies n random points (Claire, 59 lines)
- chicago/yelp_scrape.py: Scrapes yelp reviews given a business id (Jack, 127 lines)
- philadelphia/merge.py: merges inspection and review datasets for philadelphia (Jack & Claire, 145 lines)
- philadelphia/merge.py: merges inspection and review datasets for philadelphia (Jack & Claire, 145 lines)
- yelp/yelp_cleaning.py: subsets all yelp reviews to only those by verified users (Raul, 80 lines)
data/: houses outputs from both data collection processes (no code here, just intermediate outputs)
eda/: contains a few notebooks that explore the merged philadelphia dataset
- eda.py: overview of merged philadelphia data (Benja, 150 lines)
- graphs_plots.py: plots for eda (Raul, 100 lines)
- yelp_data_eda.py: counts for yelp data for getting an idea of what is available (Claire, 30 lines)
models/: contains a few key custom classes/functions that build our modeling pipeline. The rest of the files are notebooks that use these custom packages to run our models and report results.
- dataloaders.py: custom dataset classes and vectorizers for text processing (Claire & Jack, 215 lines)
- features.py: data cleaning to make the feature variables into numeric arrays (Jack, 200 lines)
- shared_models.py: 5 developed models (Claire & Jack, 88 lines)
- helpers.py: helpers for training models and displaying results, adapted from class functions (Claire, 97 lines)
- run_models.ipynb: implements all Logistic/SVN models (Claire, 100 lines)
- all other ipynb notebooks: implements BERT and RNN models (Jack, ~100 lines per script)
chat_gpt/: contains our pipeline to read in "fake" reviews from ChatGPT 3.5 and 4.0.
- generate_reviews.py: Uses the openAI API to generate fake reviews (Benja, 200 lines)

Name		Name	Last commit message	Last commit date
Latest commit History 160 Commits
chat_gpt		chat_gpt
collect		collect
data		data
eda		eda
models		models
notebooks		notebooks
plots		plots
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
final_presentation.pdf		final_presentation.pdf
final_report.pdf		final_report.pdf
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dirty Comments, Clean Plates

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

benjaleivas/Dirty-Comments-Clean-Plates

Folders and files

Latest commit

History

Repository files navigation

Dirty Comments, Clean Plates

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages