Language Model Fine-tuning Pipeline

This repository contains a robust pipeline for fine-tuning language models using advanced techniques like LoRA (Low-Rank Adaptation), QLoRA (Quantized LoRA), and custom tokenization. The project is designed to be modular, configurable, and production-ready.

Features

Custom Tokenizer Training: Implements BPE (Byte Pair Encoding) tokenizer training with configurable vocabulary size
LoRA Fine-tuning: Efficient fine-tuning using Low-Rank Adaptation
QLoRA Support: Quantized LoRA implementation for memory-efficient training (CUDA-only)
Configurable Pipeline: YAML-based configuration for easy customization
Robust Logging: Comprehensive logging system for monitoring training progress
Memory Efficient: Optimized for GPU memory usage with gradient accumulation and mixed precision training
Production Ready: Includes proper error handling, validation, and directory management

Prerequisites

Python 3.8+
CUDA-capable GPU (required for QLoRA, recommended for LoRA)
PyTorch
Transformers
Datasets
PEFT
Tokenizers
PyYAML
bitsandbytes (for QLoRA)

Installation

Clone the repository:

git clone https://github.com/happybear-21/genie.git
cd genie

Install dependencies:

pip install -r requirements.txt

Project Structure

.
├── config/
│   └── models.yaml           # Configuration file
├── finetune.py              # Unified fine-tuning script with method selection
├── optimization_methods.py   # Optimization methods registry and base classes
├── tokenizer.py             # Custom tokenizer implementation
├── utils.py                 # Utility functions
├── examples/                # Comprehensive examples and guides
│   ├── README.md            # Examples overview and usage guide
│   ├── method_comparison.py # Compare all optimization methods
│   ├── custom_method_example.py # How to create custom methods
│   ├── method_selection_guide.py # Interactive method selection
│   ├── performance_benchmark.py # Benchmark different methods
│   └── method_specific_configs/ # Optimized configs for each method
│       ├── README.md        # Configuration guide
│       ├── qlora_config.yaml # QLoRA-optimized configuration
│       ├── dora_config.yaml # DoRA-optimized configuration
│       └── lora_xs_config.yaml # LoRA-XS-optimized configuration
└── README.md                # This file

Configuration

The project uses a YAML configuration file (config/models.yaml) to manage all parameters. Key configuration sections include:

Data Configuration: Dataset paths and processing parameters
Model Configuration: Model-specific parameters including LoRA settings
Training Configuration: Training hyperparameters
Output Configuration: Logging and model saving paths

Implemented Models

The following models are currently implemented and configured in the pipeline:

StarCoder2-3B
- Base Model: bigcode/starcoder2-3b
- Vocabulary Size: 50,000
- Batch Size: 8
CodeLlama-7B
- Base Model: codellama/CodeLlama-7b-hf
- Vocabulary Size: 32,000
- Batch Size: 8
WizardCoder-34B
- Base Model: WizardLM/WizardCoder-Python-34B-V1.0
- Vocabulary Size: 32,000
- Batch Size: 4
DeepSeek-Coder-6.7B
- Base Model: deepseek-ai/deepseek-coder-6.7b-base
- Vocabulary Size: 32,000
- Batch Size: 8
Codestral-22B
- Base Model: mistralai/Codestral-22B-v0.1
- Vocabulary Size: 32,000
- Batch Size: 4
WizardCoder-15B
- Base Model: WizardLM/WizardCoder-15B-V1.0
- Vocabulary Size: 32,000
- Batch Size: 6
DeepSeek-Coder-33B
- Base Model: deepseek-ai/deepseek-coder-33b-base
- Vocabulary Size: 32,000
- Batch Size: 2

All models are configured with LoRA fine-tuning parameters optimized for their respective sizes, including appropriate rank, alpha, and dropout values.

Usage

1. Prepare Your Dataset

The repository includes a sample dataset of Perl programming examples that demonstrates the required format. You can use this as a reference for preparing your own dataset.

Sample dataset structure:

dataset/processed/
├── train.jsonl
├── valid.jsonl
└── test.jsonl

Each JSONL file should contain entries in the following format:

{"prompt": "input text", "code": "target text"}

For the Perl programming dataset, the format is:

{
    "prompt": "Write a Perl script to read a file and count word frequency",
    "code": "#!/usr/bin/perl\nuse strict;\nuse warnings;\n\nmy %word_count;\nopen(my $fh, '<', 'input.txt') or die \"Cannot open file: $!\";\nwhile(my $line = <$fh>) {\n    chomp($line);\n    my @words = split(/\\s+/, $line);\n    foreach my $word (@words) {\n        $word_count{$word}++;\n    }\n}\nclose($fh);\n\nforeach my $word (sort keys %word_count) {\n    print \"$word: $word_count{$word}\\n\";\n}"
}

You can examine the sample dataset to understand:

Input format: Natural language descriptions of programming tasks
Output format: Complete, executable Perl code solutions
Data organization: Training, validation, and test splits

To use your own dataset:

Follow the same JSONL format
Split your data into train/valid/test files
Place them in your configured data directory
Ensure your input/output pairs are properly formatted

2. Configure Your Training

Edit config/models.yaml to set your desired parameters:

data:
  processed_dir: "dataset/processed/"
  metadata_dir: "dataset/metadata/"

models:
  - name: "model_name"
    pretrained_model: "model/checkpoint"
    max_length: 512
    vocab_size: 32000
    lora_rank: 8
    lora_alpha: 32
    lora_dropout: 0.1

training:
  epochs: 3
  batch_size: 8
  learning_rate: 2e-4
  weight_decay: 0.01
  warmup_steps: 100
  gradient_accumulation_steps: 4

3. Run the Pipeline

Train the tokenizer:

python tokenizer.py

Run fine-tuning with your preferred method:

Standard LoRA fine-tuning:

python finetune.py --method lora

QLoRA fine-tuning (CUDA-only):

python finetune.py --method qlora

Available methods:

lora: Standard LoRA with mixed precision training
qlora: Quantized LoRA with 4-bit quantization for memory efficiency

Additional options:

# Use custom config file
python finetune.py --method qlora --config path/to/config.yaml

# See all available methods
python finetune.py --help

Note: QLoRA requires CUDA support. If CUDA is not available, the script will automatically fall back to standard LoRA training.

Technical Details

Tokenizer Implementation

The custom tokenizer (tokenizer.py) implements:

BPE tokenization with configurable vocabulary size
Special tokens handling ([PAD], [BOS], [EOS], [UNK])
Efficient batch processing
Vocabulary statistics generation

Fine-tuning Process

The unified fine-tuning script (finetune.py) implements multiple optimization approaches:

Standard LoRA (--method lora):

LoRA adaptation for efficient fine-tuning
Mixed precision training (FP16) when available
Gradient accumulation for handling large batches
TensorBoard integration for monitoring
Automatic model checkpointing

QLoRA (--method qlora):

Quantized LoRA implementation for extreme memory efficiency
4-bit quantization of the base model
CUDA-only optimization with automatic fallback to standard LoRA
Same monitoring and checkpointing features as standard LoRA

Extensible Architecture:

Easy to add new optimization methods
Modular design with OptimizationMethod base class
Method-specific configurations and training settings
Automatic method validation and error handling

Utility Functions

The utilities module (utils.py) provides:

YAML configuration loading
Logging setup
Directory validation and creation
Dataset statistics computation

Monitoring

Training progress can be monitored through:

Log files in the specified log directory
TensorBoard metrics
Model checkpoints saved at regular intervals

Best Practices

Always validate your configuration before training
Monitor GPU memory usage during training
Keep track of model checkpoints
Use appropriate batch sizes for your GPU
Regular evaluation of model performance
For QLoRA, ensure CUDA is available for optimal performance

Adding New Optimization Methods

The project is designed to be easily extensible with new optimization methods. To add a new method:

Create a new class inheriting from OptimizationMethod:

from optimization_methods import OptimizationMethod

class YourNewMethod(OptimizationMethod):
    def __init__(self):
        super().__init__("your_method", "Description of your method")
    
    def load_model(self, pretrained_model: str, trust_remote_code: bool = True):
        # Your model loading logic
        pass
    
    def setup_lora_config(self, model_config: dict) -> LoraConfig:
        # Your LoRA configuration
        pass
    
    def get_training_config(self, use_cuda: bool) -> dict:
        # Your training configuration
        pass

Register the method:

from optimization_methods import add_method
add_method("your_method", YourNewMethod())

Use it in training:

python finetune.py --method your_method

See example_add_method.py for a complete working example.

Documentation

This project includes comprehensive documentation to help you get started and contribute:

README.md - Project overview and setup instructions
CONTRIBUTING.md - Guidelines for contributing to the project
LICENSE - Software license terms

Contributing

Please read the Contributing Guidelines before submitting pull requests.

Acknowledgments

Hugging Face Transformers library
PEFT library for LoRA implementation
Tokenizers library for efficient tokenization
bitsandbytes library for quantization support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Language Model Fine-tuning Pipeline

Features

Prerequisites

Installation

Project Structure

Configuration

Implemented Models

Usage

1. Prepare Your Dataset

2. Configure Your Training

3. Run the Pipeline

Technical Details

Tokenizer Implementation

Fine-tuning Process

Utility Functions

Monitoring

Best Practices

Adding New Optimization Methods

Documentation

Contributing

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
config		config
dataset		dataset
examples		examples
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
example_add_method.py		example_add_method.py
finetune.py		finetune.py
optimization_methods.py		optimization_methods.py
requirements.txt		requirements.txt
test.py		test.py
tokenizer.py		tokenizer.py
utils.py		utils.py

License

happybear-21/genie

Folders and files

Latest commit

History

Repository files navigation

Language Model Fine-tuning Pipeline

Features

Prerequisites

Installation

Project Structure

Configuration

Implemented Models

Usage

1. Prepare Your Dataset

2. Configure Your Training

3. Run the Pipeline

Technical Details

Tokenizer Implementation

Fine-tuning Process

Utility Functions

Monitoring

Best Practices

Adding New Optimization Methods

Documentation

Contributing

Acknowledgments

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages