Skip to content

A robust pipeline for fine-tuning language models using advanced techniques like LoRA (Low-Rank Adaptation), QLoRA (Quantized LoRA), and custom tokenization.

License

Notifications You must be signed in to change notification settings

happybear-21/genie

Repository files navigation

Language Model Fine-tuning Pipeline

This repository contains a robust pipeline for fine-tuning language models using advanced techniques like LoRA (Low-Rank Adaptation), QLoRA (Quantized LoRA), and custom tokenization. The project is designed to be modular, configurable, and production-ready.

Features

  • Custom Tokenizer Training: Implements BPE (Byte Pair Encoding) tokenizer training with configurable vocabulary size
  • LoRA Fine-tuning: Efficient fine-tuning using Low-Rank Adaptation
  • QLoRA Support: Quantized LoRA implementation for memory-efficient training (CUDA-only)
  • Configurable Pipeline: YAML-based configuration for easy customization
  • Robust Logging: Comprehensive logging system for monitoring training progress
  • Memory Efficient: Optimized for GPU memory usage with gradient accumulation and mixed precision training
  • Production Ready: Includes proper error handling, validation, and directory management

Prerequisites

  • Python 3.8+
  • CUDA-capable GPU (required for QLoRA, recommended for LoRA)
  • PyTorch
  • Transformers
  • Datasets
  • PEFT
  • Tokenizers
  • PyYAML
  • bitsandbytes (for QLoRA)

Installation

  1. Clone the repository:
git clone https://github.com/happybear-21/genie.git
cd genie
  1. Install dependencies:
pip install -r requirements.txt

Project Structure

.
├── config/
│   └── models.yaml           # Configuration file
├── finetune.py              # Unified fine-tuning script with method selection
├── optimization_methods.py   # Optimization methods registry and base classes
├── tokenizer.py             # Custom tokenizer implementation
├── utils.py                 # Utility functions
├── examples/                # Comprehensive examples and guides
│   ├── README.md            # Examples overview and usage guide
│   ├── method_comparison.py # Compare all optimization methods
│   ├── custom_method_example.py # How to create custom methods
│   ├── method_selection_guide.py # Interactive method selection
│   ├── performance_benchmark.py # Benchmark different methods
│   └── method_specific_configs/ # Optimized configs for each method
│       ├── README.md        # Configuration guide
│       ├── qlora_config.yaml # QLoRA-optimized configuration
│       ├── dora_config.yaml # DoRA-optimized configuration
│       └── lora_xs_config.yaml # LoRA-XS-optimized configuration
└── README.md                # This file

Configuration

The project uses a YAML configuration file (config/models.yaml) to manage all parameters. Key configuration sections include:

  • Data Configuration: Dataset paths and processing parameters
  • Model Configuration: Model-specific parameters including LoRA settings
  • Training Configuration: Training hyperparameters
  • Output Configuration: Logging and model saving paths

Implemented Models

The following models are currently implemented and configured in the pipeline:

  1. StarCoder2-3B

    • Base Model: bigcode/starcoder2-3b
    • Vocabulary Size: 50,000
    • Batch Size: 8
  2. CodeLlama-7B

    • Base Model: codellama/CodeLlama-7b-hf
    • Vocabulary Size: 32,000
    • Batch Size: 8
  3. WizardCoder-34B

    • Base Model: WizardLM/WizardCoder-Python-34B-V1.0
    • Vocabulary Size: 32,000
    • Batch Size: 4
  4. DeepSeek-Coder-6.7B

    • Base Model: deepseek-ai/deepseek-coder-6.7b-base
    • Vocabulary Size: 32,000
    • Batch Size: 8
  5. Codestral-22B

    • Base Model: mistralai/Codestral-22B-v0.1
    • Vocabulary Size: 32,000
    • Batch Size: 4
  6. WizardCoder-15B

    • Base Model: WizardLM/WizardCoder-15B-V1.0
    • Vocabulary Size: 32,000
    • Batch Size: 6
  7. DeepSeek-Coder-33B

    • Base Model: deepseek-ai/deepseek-coder-33b-base
    • Vocabulary Size: 32,000
    • Batch Size: 2

All models are configured with LoRA fine-tuning parameters optimized for their respective sizes, including appropriate rank, alpha, and dropout values.

Usage

1. Prepare Your Dataset

The repository includes a sample dataset of Perl programming examples that demonstrates the required format. You can use this as a reference for preparing your own dataset.

Sample dataset structure:

dataset/processed/
├── train.jsonl
├── valid.jsonl
└── test.jsonl

Each JSONL file should contain entries in the following format:

{"prompt": "input text", "code": "target text"}

For the Perl programming dataset, the format is:

{
    "prompt": "Write a Perl script to read a file and count word frequency",
    "code": "#!/usr/bin/perl\nuse strict;\nuse warnings;\n\nmy %word_count;\nopen(my $fh, '<', 'input.txt') or die \"Cannot open file: $!\";\nwhile(my $line = <$fh>) {\n    chomp($line);\n    my @words = split(/\\s+/, $line);\n    foreach my $word (@words) {\n        $word_count{$word}++;\n    }\n}\nclose($fh);\n\nforeach my $word (sort keys %word_count) {\n    print \"$word: $word_count{$word}\\n\";\n}"
}

You can examine the sample dataset to understand:

  • Input format: Natural language descriptions of programming tasks
  • Output format: Complete, executable Perl code solutions
  • Data organization: Training, validation, and test splits

To use your own dataset:

  1. Follow the same JSONL format
  2. Split your data into train/valid/test files
  3. Place them in your configured data directory
  4. Ensure your input/output pairs are properly formatted

2. Configure Your Training

Edit config/models.yaml to set your desired parameters:

data:
  processed_dir: "dataset/processed/"
  metadata_dir: "dataset/metadata/"

models:
  - name: "model_name"
    pretrained_model: "model/checkpoint"
    max_length: 512
    vocab_size: 32000
    lora_rank: 8
    lora_alpha: 32
    lora_dropout: 0.1

training:
  epochs: 3
  batch_size: 8
  learning_rate: 2e-4
  weight_decay: 0.01
  warmup_steps: 100
  gradient_accumulation_steps: 4

3. Run the Pipeline

  1. Train the tokenizer:
python tokenizer.py
  1. Run fine-tuning with your preferred method:

Standard LoRA fine-tuning:

python finetune.py --method lora

QLoRA fine-tuning (CUDA-only):

python finetune.py --method qlora

Available methods:

  • lora: Standard LoRA with mixed precision training
  • qlora: Quantized LoRA with 4-bit quantization for memory efficiency

Additional options:

# Use custom config file
python finetune.py --method qlora --config path/to/config.yaml

# See all available methods
python finetune.py --help

Note: QLoRA requires CUDA support. If CUDA is not available, the script will automatically fall back to standard LoRA training.

Technical Details

Tokenizer Implementation

The custom tokenizer (tokenizer.py) implements:

  • BPE tokenization with configurable vocabulary size
  • Special tokens handling ([PAD], [BOS], [EOS], [UNK])
  • Efficient batch processing
  • Vocabulary statistics generation

Fine-tuning Process

The unified fine-tuning script (finetune.py) implements multiple optimization approaches:

  1. Standard LoRA (--method lora):
  • LoRA adaptation for efficient fine-tuning
  • Mixed precision training (FP16) when available
  • Gradient accumulation for handling large batches
  • TensorBoard integration for monitoring
  • Automatic model checkpointing
  1. QLoRA (--method qlora):
  • Quantized LoRA implementation for extreme memory efficiency
  • 4-bit quantization of the base model
  • CUDA-only optimization with automatic fallback to standard LoRA
  • Same monitoring and checkpointing features as standard LoRA
  1. Extensible Architecture:
  • Easy to add new optimization methods
  • Modular design with OptimizationMethod base class
  • Method-specific configurations and training settings
  • Automatic method validation and error handling

Utility Functions

The utilities module (utils.py) provides:

  • YAML configuration loading
  • Logging setup
  • Directory validation and creation
  • Dataset statistics computation

Monitoring

Training progress can be monitored through:

  • Log files in the specified log directory
  • TensorBoard metrics
  • Model checkpoints saved at regular intervals

Best Practices

  • Always validate your configuration before training
  • Monitor GPU memory usage during training
  • Keep track of model checkpoints
  • Use appropriate batch sizes for your GPU
  • Regular evaluation of model performance
  • For QLoRA, ensure CUDA is available for optimal performance

Adding New Optimization Methods

The project is designed to be easily extensible with new optimization methods. To add a new method:

  1. Create a new class inheriting from OptimizationMethod:
from optimization_methods import OptimizationMethod

class YourNewMethod(OptimizationMethod):
    def __init__(self):
        super().__init__("your_method", "Description of your method")
    
    def load_model(self, pretrained_model: str, trust_remote_code: bool = True):
        # Your model loading logic
        pass
    
    def setup_lora_config(self, model_config: dict) -> LoraConfig:
        # Your LoRA configuration
        pass
    
    def get_training_config(self, use_cuda: bool) -> dict:
        # Your training configuration
        pass
  1. Register the method:
from optimization_methods import add_method
add_method("your_method", YourNewMethod())
  1. Use it in training:
python finetune.py --method your_method

See example_add_method.py for a complete working example.

Documentation

This project includes comprehensive documentation to help you get started and contribute:

Contributing

Please read the Contributing Guidelines before submitting pull requests.

Acknowledgments

  • Hugging Face Transformers library
  • PEFT library for LoRA implementation
  • Tokenizers library for efficient tokenization
  • bitsandbytes library for quantization support

About

A robust pipeline for fine-tuning language models using advanced techniques like LoRA (Low-Rank Adaptation), QLoRA (Quantized LoRA), and custom tokenization.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages