This project focuses on clustering product categories and generating meaningful insights using unsupervised learning techniques, semantic embeddings, and modern language models. The ultimate goal is to assist in developing a marketing affiliate approach by grouping similar products, analyzing customer sentiment, and providing actionable summaries for recommendations.
The project performs three main tasks:
- Sentiment Analysis of product reviews using VADER to classify reviews as positive, neutral, or negative.
- Product Category Clustering with K-Means to group products into 4 meaningful categories.
- Summarization of customer feedback and cluster-based recommendations using Cohere's large language model (LLM).
- Sentiment Analysis:
- Used VADER to classify customer reviews into positive, neutral, or negative sentiments.
- Clustering:
- Applied K-Means on embeddings generated by Google’s Universal Sentence Encoder (USE) to cluster product categories into 4 groups:
- Tablets
- Smart Home Devices
- E-Readers
- Others
- Applied K-Means on embeddings generated by Google’s Universal Sentence Encoder (USE) to cluster product categories into 4 groups:
- Summarization:
- Leveraged Cohere's LLM to generate summaries and actionable recommendations for each product category.
- Libraries:
- pandas: For data manipulation.
- scikit-learn: For K-Means clustering.
- TensorFlow & TensorFlow Hub: For Universal Sentence Encoder embeddings.
- Matplotlib: For data visualization.
- Cohere API: For LLM-based summarization.
- VADER: For sentiment analysis.
- Pretrained Models:
- Google’s Universal Sentence Encoder (USE): To generate semantic embeddings for clustering.
- Cohere's LLM: For generating summaries.
- Source: Kaggle
- Description:
- Approximately 34,000 product entries with columns like:
name
: Product names.categories
: Product categories.reviews.text
: Customer reviews.
- Approximately 34,000 product entries with columns like:
- Preprocessed the
reviews.text
column by:- Cleaning the text (removing special characters, stopwords).
- Feeding the cleaned text into VADER.
- Output: A new column
sentiment
indicating positive, neutral, or negative reviews.
- Preprocessed
name
andcategories
by combining them into a single column (name_and_category
). - Generated semantic embeddings using Universal Sentence Encoder (USE).
- Applied K-Means clustering to group products into 4 clusters:
- Tablets
- Smart Home Devices
- E-Readers
- Others
- Validated the clusters using PCA visualization and word frequency analysis.
- Used Cohere’s LLM to generate summaries for each cluster by feeding:
- Cluster-specific product names, reviews, and categories.
- Output: Summarized recommendations and marketing content highlighting key products for each category.