This project aims to predict the quality of wine using various features from two datasets: red and white wine. It addresses the class imbalance problem prevalent in the dataset, particularly in the quality labels.
The main goal is to predict wine quality based on various chemical properties. The dataset consists of features such as acidity, sugar content, and alcohol level, among others.
- Pandas: For data manipulation and analysis.
- NumPy: For numerical operations.
- Scikit-learn: For machine learning models and metrics.
- Imbalanced-learn: For handling class imbalance using SMOTE.
- Seaborn & Matplotlib: For data visualization.
-
Data Loading:
- Load the datasets for red and white wine.
import pandas as pd white_wine = pd.read_csv('winequality-white.csv', sep=';') red_wine = pd.read_csv('winequality-red.csv', sep=';')
-
Data Preparation:
- Add a feature indicating the type of wine (red or white).
- Merge the two datasets and shuffle the observations.
- Create a quality label based on the quality score.
-
Data Exploration:
- Visualize the distribution of wine quality labels to identify class imbalance.
import seaborn as sns import matplotlib.pyplot as plt sns.countplot(x=wines['quality_label']) plt.show()
-
Data Splitting:
- Split the data into training and test sets.
-
Data Scaling:
- Scale the features using
StandardScaler
.
- Scale the features using
-
Model Training:
- Train a Logistic Regression model on the imbalanced dataset.
from sklearn.linear_model import LogisticRegression lg = LogisticRegression() lg.fit(X_train, y_train)
-
Handling Class Imbalance:
- Apply SMOTE to balance the classes in the training set.
from imblearn.over_sampling import SMOTE smote = SMOTE() X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
-
Model Evaluation:
- Predict on the test set and evaluate the model using confusion matrix and classification report.
y_pred_smote = lg.predict(X_test) from sklearn.metrics import confusion_matrix, classification_report results = confusion_matrix(y_test, y_pred_smote) print("Confusion Matrix:\n", results) print("Classification Report:\n", classification_report(y_test, y_pred_smote))
- The confusion matrix and classification report will provide insights into the model's performance, particularly in predicting the minority class.
This project demonstrates the importance of addressing class imbalance in predictive modeling. By applying SMOTE, we can improve the model's ability to predict underrepresented classes effectively.
- Experiment with other machine learning algorithms to further improve prediction accuracy.
- Implement hyperparameter tuning for better model performance.
- Explore additional feature engineering techniques to enhance the dataset.
- Wine Quality Dataset from the UCI Machine Learning Repository.
- Various libraries and frameworks that facilitate data science and machine learning tasks.
This project is licensed under the MIT License.