Dataset Generator for Fine-tuning 📊

A Streamlit-based tool for generating training datasets from text files and PDFs for fine-tuning language models. This tool supports multiple AI models (Gemini, Claude, OpenAI) to generate high-quality question-answer pairs in various formats compatible with different models.

Features ✨

Multiple AI Models: Choose from Gemini, Claude, or OpenAI for dataset generation
File Upload: Support for both text files (.txt) and PDF files (.pdf)
Smart Chunking: Split content by word count instead of manual delimiters
Customizable Generation: Control number of questions per chunk and conversation exchanges
Multiple Model Formats: Support for Gemma, Llama, ChatML, Alpaca, ShareGPT, and Generic formats
Custom Prompts: Write your own prompts for dataset generation
Real-time Progress: Live progress tracking during generation
Instant Download: Download generated datasets in JSONL format

Installation 🛠️

Clone or download the repository
Install dependencies:
```
pip install -r requirements.txt
```
Set up your API keys:
- Create a .env file in the project directory
- Add your API keys (see API Requirements section)
- Or enter them directly in the app interface

Usage 🚀

Start the application:
```
streamlit run app.py
```
Configure settings:
- Select your AI model (Gemini, Claude, or OpenAI)
- Enter your API key
- Upload a text file or PDF
- Set chunking parameters (words per chunk)
- Choose number of questions per chunk
- Select conversation exchanges (1 or 2)
- Choose your target model format
Generate dataset:
- Write or customize your generation prompt
- Click "Generate Dataset"
- Monitor progress in real-time
- Download the generated JSONL file

API Requirements 🔑

You'll need an API key for at least one of the supported AI models:

Google Gemini

Go to Google AI Studio
Create an API key
Add GEMINI_API_KEY=your_key_here to your .env file

Anthropic Claude

Go to Anthropic Console
Create an API key
Add ANTHROPIC_API_KEY=your_key_here to your .env file

OpenAI

Go to OpenAI Platform
Create an API key
Add OPENAI_API_KEY=your_key_here to your .env file

Configuration Options ⚙️

Words per chunk: 50-2000 words (default: 300)
Questions per chunk: 1-10 questions (default: 3)
Conversation exchanges: 1-5 exchanges (default: 1)
Model format: Choose from 4 supported formats
Custom prompt: Personalize the generation instructions with enhanced prompt area

Model Recommendations 💡

For Beginners

Gemini 2.5-flash: Free tier available, good performance
Claude 3.5-haiku: Fast and reliable for simple tasks

For High Quality

Claude Sonnet-4: Best overall quality and reasoning
Gemini 2.5-pro: Excellent for complex content

For Cost Efficiency

OpenAI gpt-4o-mini: Good balance of cost and quality
Gemini 2.5-flash: Free tier for testing

For Advanced Reasoning

OpenAI o3-mini: Specialized reasoning capabilities
Claude 3-7-sonnet: Enhanced analytical thinking

Tips for Best Results 🎯

Chunk Size: Use 200-500 words per chunk for optimal results
Questions: Start with 2-4 questions per chunk
Prompts: Be specific about the type of content you want to generate
File Quality: Ensure your source files are well-formatted and readable
Model Selection: Choose models based on your content complexity and budget

Troubleshooting 🔧

API Errors: Check your API key in the .env file and internet connection
PDF Issues: Ensure PDF files are text-based (not scanned images)
Memory Issues: Reduce chunk size or questions per chunk for large files
Generation Failures: Try adjusting your custom prompt or reducing complexity
OpenAI o3-mini errors: The model uses different API parameters automatically handled by the app

Example Use Cases 📝

Educational content fine-tuning
Domain-specific knowledge training
Customer service chatbot training
Technical documentation Q&A
Creative writing assistance
Multi-turn conversation datasets

License 📄

This project is open source and available under the MIT License.

Support 💬

For issues or questions, please create an issue in the repository or contact the maintainers.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
QUICKSTART.md		QUICKSTART.md
README.md		README.md
app.py		app.py
dataset_generator.py		dataset_generator.py
env_template.txt		env_template.txt
file_handlers.py		file_handlers.py
models.py		models.py
output_formats.py		output_formats.py
requirements.txt		requirements.txt
text_processing.py		text_processing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dataset Generator for Fine-tuning 📊

Features ✨

Installation 🛠️

Usage 🚀

API Requirements 🔑

Google Gemini

Anthropic Claude

OpenAI

Configuration Options ⚙️

Model Recommendations 💡

For Beginners

For High Quality

For Cost Efficiency

For Advanced Reasoning

Tips for Best Results 🎯

Troubleshooting 🔧

Example Use Cases 📝

License 📄

Support 💬

About

Uh oh!

Releases

Packages

Languages

MonkWarrior08/Dataset_Generator_for_Fine-tuning

Folders and files

Latest commit

History

Repository files navigation

Dataset Generator for Fine-tuning 📊

Features ✨

Installation 🛠️

Usage 🚀

API Requirements 🔑

Google Gemini

Anthropic Claude

OpenAI

Configuration Options ⚙️

Model Recommendations 💡

For Beginners

For High Quality

For Cost Efficiency

For Advanced Reasoning

Tips for Best Results 🎯

Troubleshooting 🔧

Example Use Cases 📝

License 📄

Support 💬

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages