This project tackles the challenge of understanding customer feedback from diverse sources β including Amazon, YouTube, and Reddit β by aggregating, clustering, and analyzing product reviews across platforms.
In addition to review aggregation and topic modeling, this system also implements a QA bot that allows users to ask product-specific questions and receive context-driven answers sourced from actual user reviews using a Retrieval-Augmented Generation (RAG) architecture.
The result is an end-to-end product insight system designed to help:
- Consumers make better purchase decisions
- Product managers detect sentiment patterns
- Analysts discover trending product themes
Product reviews are scattered across platforms and filled with noise, redundancy, and unstructured sentiment. While individual reviews are helpful, they lack aggregated insight or structured Q&A capabilities.
Challenges addressed:
- π« Inconsistent review formats across platforms (text, video, Reddit posts)
- π§© Difficulty finding thematic consensus across thousands of reviews
- β Inability to ask contextual questions like:
βWhat do users say about battery life?β or βIs the camera good in low light?β
This project solves those gaps using:
- Multi-source scraping
- Topic modeling and clustering
- QA via semantic search + LLM-generated answers
- YouTube transcripts scraped via
youtube_transcript_api
- Amazon reviews (pre-collected dataset)
- Reddit product threads using
PRAW
- Regex cleaning and normalization
- Sentence filtering
- Deduplication and source labeling
- FAISS vector store for document retrieval
- LangChain/OpenAI for contextual QA generation
- Top-k similarity-based passage ranking
Result: Ask βWhat do people dislike about this product?β and get a grounded, review-based answer.
Query: βWhat do users say about battery life of iPhone 15?β
Answer:
- Most users report that the battery lasts about 6β8 hours of heavy use.
- Several Reddit posts highlight fast battery drain after 1 year.
- YouTube reviewers suggest turning off background sync to save power.
Query: βIs the camera good in low light?β
Answer:
- Amazon reviews frequently mention poor detail in night shots.
- Reddit users recommend using manual mode for better results.
- A YouTube reviewer compares it unfavorably with a mid-tier DSLR.
- π Multi-Source Scraping: Review data from Reddit, YouTube, and Amazon
- π§Ό Smart Preprocessing: Text cleaning, deduplication, and sentence-level segmentation
- π§ Topic Modeling with BERTopic: Identify pain points and praise patterns
- π€ QA Bot Powered by RAG: Ask questions and get real, evidence-based answers
- β‘ Modular Design: Each step can be used independently or chained