TheCocatailEDA

This project focuses on exploratory data analysis (EDA) and clustering of cocktail recipes, based on a dataset from TheCocktailDB. See exac changes in changelog

1. Introduction

This project performs EDA and clustering on a dataset of cocktails and their ingredients. The dataset is available in the data/ folder.

Project Structure

project-root/
├── configs/                             # Configuration files for analysis and preprocessing
│   ├── analysis_configs/                # Analysis-specific configuration files
│   │   ├── general_analysis_config.yaml
│   │   ├── ingredient_analysis_config.yaml
│   │   └── tag_analysis_config.yaml
│   ├── preprocessing_configs/           # Preprocessing-specific configuration files
│   │   ├── data_simplification_config.yaml
│   │   ├── tagging_config.yaml
│   │   └── global_configs.yaml
├── data/                                # Data directory
│   ├── processed/                       # Processed dataset
│   │   ├── processed_cocktail_dataset.json
│   │   └── clustered_cocktail_dataset.json
│   └── raw/                             # Raw dataset (add any initial datasets here)
├── notebooks/                           # Jupyter notebooks for data exploration and visualization
├── outputs/                             # Directory for any output files or results
├── src/                                 # Source code
│   ├── clustering/  
│   │   └── clustering.py
│   ├── analysis/                        # Analysis-related scripts
│   │   ├── general_analysis.py
│   │   ├── ingredients_analysis.py
│   │   └── tag_analysis.py
│   ├── preprocessing_scripts/           # Preprocessing scripts
│   │   ├── one_hot_encode_tags.py
│   │   ├── simplify_data.py
│   │   └── tagging_script.py
├── .gitignore                           # Git ignore file
├── CHANGELOG.md                         # Project changelog
├── environment.yaml                     # Conda environment setup file
├── pyproject.toml                       # Project configuration file
├── README.md                            # Project README with documentation
└── requirements.txt                     # Python dependencies

2. Installation

Create Conda Environment

Clone this repository:

git clone https://github.com/Marcelele-0/TheCocktailEDA
cd TheCocktailEDA

Create and activate the Conda environment:

conda env create -f environment.yaml
conda activate cocktail-clustering

Install dependencies:
```
pip install -r requirements.txt
```

Usage

Make sure to set global config to use processed data.

You can enable and run analysis by choosing interesting functions in configs and run them with:

  python src/analysis/general_analysis.py
  python src/analysis/ingredients_analysis.py
  python src/analysis/tag_analysis.py

You can preprocess data by running (configs are set up correctly by default - make sure global config is set to processed data if you changed it in analysis):

  python src/preprocessing_scripts/simplify_data.py
  python src/preprocessing_scripts/tagging_script.py
  python src/preprocessing_scripts/one_hot_encode_tags.py

Now cluster the data by (config is not yet set up so it will inform about everything):
```
  python src/clustering/clustering.py 
```

3. Dataset

The dataset is stored in JSON format in the data/ folder.

4. EDA Conclusions

- Data shema

Cocktail

id (integer): Unique identifier of the cocktail
Example: 11000, 11001
name (string): Name of the cocktail
Example: "Mojito", "Old Fashioned"
tags (list) - updated in tagging_script.py Example: "tags":["IBA","Classic","Alcoholic","Expensive","Savory"]
instructions (string, nullable): Instructions on how to prepare the cocktail
Example: "Muddle mint leaves with sugar and lime juice..."
alcoholic (boolean): Indicates whether the cocktail contains alcohol
Example: true, false
category (enum, nullable): Category of the cocktail
Example: "Cocktail", "Ordinary Drink", "Shot"
glass (enum, nullable): Type of glass used for serving
Example: "Highball glass", "Old-fashioned glass"
imageUrl (string, nullable): URL of the cocktail image - deleted in simplify_data.py Example: https://cocktails.solvro.pl/images/cocktails/mojito.png
createdAt (string, nullable): Record creation date - deleted in simplify_data.py Example: "2024-08-19 18:39:58"
updatedAt (string, nullable): Record update date - deleted in simplify_data.py Example: "2024-08-21 10:12:58"
one_hot_tags (list) - created in one_hot_encode_tags.py Example: "oneHotTags": [1, 0, 1, ..., 0, 1] # Represents presence of tags in a binary format

Ingredients

id (integer): Unique identifier of the ingredient
Example: 10, 11
name (string): Name of the ingredient
Example: "Red wine", "Grapefruit juice"
description (string, nullable): Description of the ingredient
Example: "Red wine is a type of wine made from dark-colored grape varieties..."
alcohol (boolean, nullable): Indicates whether the ingredient contains alcohol
Example: true, false
type (enum, nullable): Type of ingredient
Example: "Vodka", "Gin", "Juice"
percentage (number, nullable): Alcohol percentage of the ingredient
Example: 40, null
imageUrl (string, nullable): URL of the ingredient image
Example: https://cocktails.solvro.pl/images/ingredients/rose.png
createdAt (string, nullable): Record creation date for the ingredient
Example: "2024-08-19 18:39:58"
updatedAt (string, nullable): Record update date for the ingredient
Example: "2024-08-21 10:12:58"
measure (string): Measurement or quantity of the ingredient used
Example: "1/2 oz"
Unique ingredients in the dataset:

['Soda water', 'Light Rum', 'Lime', 'Mint', 'Sugar', 'Water', 'Angostura Bitters', 'Bourbon', 'lemon', 'Vodka', 'Gin', 'Tequila', 'Coca-Cola', 'Sweet Vermouth', 'Campari', 'Powdered Sugar', 'Blended Whiskey', 'Cherry', 'Dry Vermouth', 'Olive', 'Lime Juice', 'Salt', 'Triple Sec', 'Ice', 'Maraschino Cherry', 'Orange Peel', 'Ginger Ale', 'Apricot Brandy', 'Lemon Juice', 'Amaretto', 'Sloe Gin', 'Southern Comfort', 'Lemon Peel', 'Orange Bitters', 'Yellow Chartreuse', 'Creme De Cacao', 'Light Cream', 'Nutmeg', 'Brandy', 'Lemon vodka', 'Pineapple Juice', 'Blackberry Brandy', 'Kummel', 'Dark Rum', 'Egg White', 'Kahlua', 'Club Soda', 'White Creme de Menthe', 'Tea', 'Whipped Cream', 'Apple Brandy', 'Applejack', 'Orange', 'Benedictine', 'Wine', 'Champagne', 'Green Creme de Menthe', 'Grand Marnier', 'Bitters', 'Scotch', 'Banana', 'Carbonated Water', 'Coffee Liqueur', 'Celery Salt', 'Tabasco Sauce', 'Tomato Juice', 'Worcestershire Sauce', 'Blue Curacao', 'Lemonade', 'Anejo Rum', 'Orange Juice', 'Tia Maria', 'Maraschino Liqueur', 'Grenadine', 'Egg', 'Cachaca', 'Egg Yolk', 'Cognac', 'Cherry Brandy', 'Port', 'Chocolate Ice-cream', 'Dubonnet Rouge', 'Sugar Syrup', 'Pineapple', 'Tonic Water', 'Orange spiral', 'Strawberries', 'Heavy cream', 'Galliano', 'Irish Whiskey', 'Peach brandy', 'Sweet and Sour', 'Green Chartreuse', 'Drambuie', 'Orgeat Syrup', 'Grapefruit Juice', 'Red Wine', 'Raspberry syrup', 'Sherry', 'Coffee Brandy', 'Lime vodka', 'Lemon-lime soda']
Unique tags in the dataset:

['IBA' 'ContemporaryClassic' 'Alcoholic' 'USA' 'Asia' 'Vegan' 'Citrus' 'Brunch' 'Hangover' 'Mild' 'Classic' 'Expensive' 'Savory' 'Strong' 'StrongFlavor' 'Vegetarian' 'Sour' 'Christmas' 'Beach' 'DinnerParty' 'Summer' 'Chilli' 'Dairy' 'Nutty' 'Cold' 'Fruity' 'Breakfast' 'NewEra']

Approach to Tagging and Ingredients

In our cocktail analysis project, we have adopted a structured and dynamic approach to tagging cocktails based on their ingredients. This system not only enhances the organization of our data but also facilitates deeper insights into the relationships between different cocktails and their components. The coctails will be clustered based on tags.

Tagging System

Our tagging framework utilizes a set of predefined tags that categorize cocktails based on their ingredient composition. Each tag has specific criteria that must be met, allowing for a flexible and dynamic assignment of tags. The key components of our tagging system include:
- Tag Definitions: Tags are defined in a configuration file using YAML format. Each tag has associated ingredients and a threshold that determines how many of those ingredients must be present in a cocktail for the tag to be assigned. This approach allows for easy modifications and additions to the tagging rules as our understanding of cocktails evolves. Tags are defined in a YAML configuration file (tagging_config.yaml).
- Ingredient Categorization: Ingredients are categorized into various groups, such as strong, new era, classic, and regional ingredients. This classification helps in understanding the characteristics of cocktails and their flavor profiles.
- Dynamic Assignment: The tagging mechanism dynamically assigns tags based on the ingredients present in each cocktail. This means that as we expand our ingredient database or modify our tagging criteria, the tagging process remains adaptable and robust.

Silhouette Score Results

At the beginning of the project, the Silhouette Score was approximately 0.18, indicating relatively low clustering quality. Consequently, several iterations of modifications were made to the tags and data to enhance the results.

Actions Taken

Disabling Certain Tags: It was determined that disabling the assignment of specific tags allowed for clearer grouping. The following main tags were turned off:
- Classic
- Contemporary Classic
- New Era
Deleting Tags Completely: Some tags were found to be overly homogeneous or too specific, leading to their removal. For example 'Chili' - low presence 'Alcoholic' - too high presence
Using MinMaxScaler to appply weights.

Result: The Silhouette Score increased to above 0.3, indicating a significant improvement in clustering quality and better differentiation between groups.

These changes were implemented to achieve more coherent and interpretable clusters while enhancing the readability of the results and the accuracy of the cocktail grouping.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TheCocatailEDA

Table of Contents

1. Introduction

Project Structure

2. Installation

Create Conda Environment

Usage

3. Dataset

4. EDA Conclusions

- Data shema

Cocktail

Ingredients

Unique ingredients in the dataset:

Unique tags in the dataset:

Approach to Tagging and Ingredients

Tagging System

Silhouette Score Results

Actions Taken

Result: The Silhouette Score increased to above 0.3, indicating a significant improvement in clustering quality and better differentiation between groups.

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
configs		configs
data		data
notebooks		notebooks
outputs		outputs
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
README.md		README.md
environment.yaml		environment.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Marcelele-0/TheCocktailEDA

Folders and files

Latest commit

History

Repository files navigation

TheCocatailEDA

Table of Contents

1. Introduction

Project Structure

2. Installation

Create Conda Environment

Usage

3. Dataset

4. EDA Conclusions

- Data shema

Cocktail

Ingredients

Unique ingredients in the dataset:

Unique tags in the dataset:

Approach to Tagging and Ingredients

Tagging System

Silhouette Score Results

Actions Taken

Result: The Silhouette Score increased to above 0.3, indicating a significant improvement in clustering quality and better differentiation between groups.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages