Automate the collection, processing, and transformation of Trustpilot customer reviews into high-quality QA datasets for fine-tuning Gemma 3 and similar LLMs.
This repo provides a full workflow to:
- Collect customer reviews from Trustpilot using the Bright Data API.
- Convert reviews to Markdown for easy inspection or sharing.
- Chunk and refine reviews with OpenAI LLMs for coherence and quality.
- Generate insightful QA pairs suitable for fine-tuning customer service or product feedback models.
- Upload the dataset directly to Hugging Face Hub with a train/test split.
collect_reviews.py
: Download reviews from Trustpilot via Bright Data.json_to_md.py
: Convert review JSON to Markdown format.md_chunker.py
: Split/clean markdown into chunks and enhance with LLM.generate_qa_pairs.py
: Use OpenAI to generate question-answer pairs from review chunks.upload_to_hf.py
: Validate and upload results as Hugging Face dataset.
Requires: Python 3.8+
Recommended: virtualenv
git clone https://github.com/youraccount/finetune-gemma3-bright.git
cd finetune-gemma3-bright
pip install -r requirements.txt
-
Create a
.env
file and add:OPENAI_API_KEY=your-openai-key HF_TOKEN=your-huggingface-key
-
Update your Bright Data API key and dataset ID in
collect_reviews.py
.
- requests — robust HTTP requests
- python-dotenv — manage secrets
- openai — LLM completion and chat API
- tenacity — retry mechanisms
- langchain-text-splitters — advanced text chunking
- datasets — fast data handling for ML/NLP
- huggingface-hub — seamless model/data uploads
-
Start by collecting reviews
python collect_reviews.py
- Triggers a Bright Data snapshot collection of a Trustpilot page.
- Polls until reviews are ready, then saves to
trustpilot_reviews.json
.
-
Convert JSON reviews to pretty Markdown
python json_to_md.py --input trustpilot_reviews.json --output trustpilot_reviews.md
-
Chunk markdown, improve coherence with GPT, save as JSONL
python md_chunker.py --input trustpilot_reviews.md --output trustpilot_reviews_chunks.jsonl
-
Generate Question-Answer pairs using OpenAI
python generate_qa_pairs.py --input trustpilot_reviews_chunks.jsonl --output trustpilot_qa_pairs.json
-
Upload dataset to Hugging Face Hub 🚀
python upload_to_hf.py --input trustpilot_qa_pairs.json --repo yourusername/trustpilot-reviews-qa-dataset
# Step 1: Collect data
python collect_reviews.py
# Step 2: Markdown conversion
python json_to_md.py
# Step 3: Chunk and enhance
python md_chunker.py
# Step 4: QA generation
python generate_qa_pairs.py
# Step 5: Upload to Hugging Face
python upload_to_hf.py
-
Change Target Company:
EditTARGET_URL
incollect_reviews.py
. -
Replace Keys:
Put your API keys/secrets in.env
. -
Adjust Chunk/Overlap Sizes:
Use--chunk-size
and--chunk-overlap
flags in chunker script. -
Customize QA Prompt:
ModifySYSTEM_PROMPT
ingenerate_qa_pairs.py
.
- Bright Data API Docs: API Reference
- OpenAI API Docs: Reference
- Hugging Face Hub: Upload Dataset
- LangChain Text Splitters: Docs
MIT
Happy Finetuning! 🎉