✨ finetune-gemma3-bright

Automate the collection, processing, and transformation of Trustpilot customer reviews into high-quality QA datasets for fine-tuning Gemma 3 and similar LLMs.

🚀 Overview

This repo provides a full workflow to:

Collect customer reviews from Trustpilot using the Bright Data API.
Convert reviews to Markdown for easy inspection or sharing.
Chunk and refine reviews with OpenAI LLMs for coherence and quality.
Generate insightful QA pairs suitable for fine-tuning customer service or product feedback models.
Upload the dataset directly to Hugging Face Hub with a train/test split.

🏗️ Directory Structure

collect_reviews.py : Download reviews from Trustpilot via Bright Data.
json_to_md.py : Convert review JSON to Markdown format.
md_chunker.py : Split/clean markdown into chunks and enhance with LLM.
generate_qa_pairs.py : Use OpenAI to generate question-answer pairs from review chunks.
upload_to_hf.py : Validate and upload results as Hugging Face dataset.

📦 Installation

Requires: Python 3.8+
Recommended: virtualenv

git clone https://github.com/youraccount/finetune-gemma3-bright.git
cd finetune-gemma3-bright
pip install -r requirements.txt

🔑 Environment Setup

Create a .env file and add:

OPENAI_API_KEY=your-openai-key
HF_TOKEN=your-huggingface-key

Update your Bright Data API key and dataset ID in collect_reviews.py.

📚 Key Dependencies

requests — robust HTTP requests
python-dotenv — manage secrets
openai — LLM completion and chat API
tenacity — retry mechanisms
langchain-text-splitters — advanced text chunking
datasets — fast data handling for ML/NLP
huggingface-hub — seamless model/data uploads

🔄 Review Collection Flow

Start by collecting reviews
```
python collect_reviews.py
```
- Triggers a Bright Data snapshot collection of a Trustpilot page.
- Polls until reviews are ready, then saves to trustpilot_reviews.json.

Convert JSON reviews to pretty Markdown

python json_to_md.py --input trustpilot_reviews.json --output trustpilot_reviews.md

Chunk markdown, improve coherence with GPT, save as JSONL

python md_chunker.py --input trustpilot_reviews.md --output trustpilot_reviews_chunks.jsonl

Generate Question-Answer pairs using OpenAI

python generate_qa_pairs.py --input trustpilot_reviews_chunks.jsonl --output trustpilot_qa_pairs.json

Upload dataset to Hugging Face Hub 🚀

python upload_to_hf.py --input trustpilot_qa_pairs.json --repo yourusername/trustpilot-reviews-qa-dataset

📝 Example Workflow

# Step 1: Collect data
python collect_reviews.py

# Step 2: Markdown conversion
python json_to_md.py

# Step 3: Chunk and enhance
python md_chunker.py

# Step 4: QA generation
python generate_qa_pairs.py

# Step 5: Upload to Hugging Face
python upload_to_hf.py

💡 Extending & Customizing

Change Target Company:
Edit TARGET_URL in collect_reviews.py.
Replace Keys:
Put your API keys/secrets in .env.
Adjust Chunk/Overlap Sizes:
Use --chunk-size and --chunk-overlap flags in chunker script.
Customize QA Prompt:
Modify SYSTEM_PROMPT in generate_qa_pairs.py.

🌎 Helpful Links

Bright Data API Docs: API Reference
OpenAI API Docs: Reference
Hugging Face Hub: Upload Dataset
LangChain Text Splitters: Docs

📝 License

MIT

Happy Finetuning! 🎉

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
scripts		scripts
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

✨ finetune-gemma3-bright

🚀 Overview

🏗️ Directory Structure

📦 Installation

🔑 Environment Setup

📚 Key Dependencies

🔄 Review Collection Flow

📝 Example Workflow

💡 Extending & Customizing

🌎 Helpful Links

📝 License

About

Uh oh!

Languages

luminati-io/finetune-gemma3-brightdata

Folders and files

Latest commit

History

Repository files navigation

✨ finetune-gemma3-bright

🚀 Overview

🏗️ Directory Structure

📦 Installation

🔑 Environment Setup

📚 Key Dependencies

🔄 Review Collection Flow

📝 Example Workflow

💡 Extending & Customizing

🌎 Helpful Links

📝 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages