This repository manages the dataset used for model fine-tuning and text embedding of the Ondaum AI psychological counseling chatbot.
The original dataset is sourced from AI Hub's Emotional Dialogue Dataset.
dataset/
: Original and processed dataset filesfinetune_dataset.jsonl
: Processed dataset for model fine-tuningreclassified_dataset.jsonl
: Processed dataset for text-embedding
emotion_dataset_to_text_embedding.py
: Script to convert the original dataset into text embedding formattext_embedding_to_fine_tuning.py
: Script to convert text embedding data into fine-tuning formatcreate_tuned_model.py
: Script to create and run model fine-tuningmanage_tuned_models.py
: Script to manage (list, check status, delete) fine-tuned models
For copyright interpretation of all datasets included in this repository, please refer to the ondaum-reference repository.
- Python 3.8 or higher
- Google Cloud API Key with Gemini API access
- Set environment variable:
export GEMINI_API_KEY=your_api_key_here
- Set up Python virtual environment
python -m venv .venv
source .venv/bin/activate # Linux/Mac
# or
.venv\Scripts\activate # Windows
- Install required packages
pip install -r requirements.txt
- Process dataset
python emotion_dataset_to_text_embedding.py
python text_embedding_to_fine_tuning.py
- Create and run model fine-tuning
python create_tuned_model.py
- Manage fine-tuned models
# List all tuned models
python manage_tuned_models.py --list
# Check status of a specific model
python manage_tuned_models.py --status "models/your-model-name"
# Delete a specific model
python manage_tuned_models.py --delete "models/your-model-name"
The fine-tuning process uses the following default parameters:
- Base Model:
models/gemini-1.5-flash-001-tuning
- Batch Size: 4
- Learning Rate: 0.001
- Target Epochs: 100 (automatically adjusted based on dataset size)
- Maximum Total Examples: 250,000
These parameters can be modified in create_tuned_model.py
if needed.