Skip to content

solutionchallenge/ondaum-dataset

Repository files navigation

Ondaum Dataset

This repository manages the dataset used for model fine-tuning and text embedding of the Ondaum AI psychological counseling chatbot.

Dataset Source

The original dataset is sourced from AI Hub's Emotional Dialogue Dataset.

Repository Structure

  • dataset/: Original and processed dataset files
    • finetune_dataset.jsonl: Processed dataset for model fine-tuning
    • reclassified_dataset.jsonl: Processed dataset for text-embedding
  • emotion_dataset_to_text_embedding.py: Script to convert the original dataset into text embedding format
  • text_embedding_to_fine_tuning.py: Script to convert text embedding data into fine-tuning format
  • create_tuned_model.py: Script to create and run model fine-tuning
  • manage_tuned_models.py: Script to manage (list, check status, delete) fine-tuned models

Copyright

For copyright interpretation of all datasets included in this repository, please refer to the ondaum-reference repository.

Prerequisites

  • Python 3.8 or higher
  • Google Cloud API Key with Gemini API access
  • Set environment variable:
    export GEMINI_API_KEY=your_api_key_here

Usage

  1. Set up Python virtual environment
python -m venv .venv
source .venv/bin/activate  # Linux/Mac
# or
.venv\Scripts\activate  # Windows
  1. Install required packages
pip install -r requirements.txt
  1. Process dataset
python emotion_dataset_to_text_embedding.py
python text_embedding_to_fine_tuning.py
  1. Create and run model fine-tuning
python create_tuned_model.py
  1. Manage fine-tuned models
# List all tuned models
python manage_tuned_models.py --list

# Check status of a specific model
python manage_tuned_models.py --status "models/your-model-name"

# Delete a specific model
python manage_tuned_models.py --delete "models/your-model-name"

Model Fine-tuning Configuration

The fine-tuning process uses the following default parameters:

  • Base Model: models/gemini-1.5-flash-001-tuning
  • Batch Size: 4
  • Learning Rate: 0.001
  • Target Epochs: 100 (automatically adjusted based on dataset size)
  • Maximum Total Examples: 250,000

These parameters can be modified in create_tuned_model.py if needed.

About

2025 Google Solution Challenge Tuning-Dataset Repository for Team Ondaum

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages