This repository contains a robust pipeline for fine-tuning language models using advanced techniques like LoRA (Low-Rank Adaptation), QLoRA (Quantized LoRA), and custom tokenization. The project is designed to be modular, configurable, and production-ready.
- Custom Tokenizer Training: Implements BPE (Byte Pair Encoding) tokenizer training with configurable vocabulary size
- LoRA Fine-tuning: Efficient fine-tuning using Low-Rank Adaptation
- QLoRA Support: Quantized LoRA implementation for memory-efficient training (CUDA-only)
- Configurable Pipeline: YAML-based configuration for easy customization
- Robust Logging: Comprehensive logging system for monitoring training progress
- Memory Efficient: Optimized for GPU memory usage with gradient accumulation and mixed precision training
- Production Ready: Includes proper error handling, validation, and directory management
- Python 3.8+
- CUDA-capable GPU (required for QLoRA, recommended for LoRA)
- PyTorch
- Transformers
- Datasets
- PEFT
- Tokenizers
- PyYAML
- bitsandbytes (for QLoRA)
- Clone the repository:
git clone https://github.com/happybear-21/genie.git
cd genie
- Install dependencies:
pip install -r requirements.txt
.
├── config/
│ └── models.yaml # Configuration file
├── finetune.py # Unified fine-tuning script with method selection
├── optimization_methods.py # Optimization methods registry and base classes
├── tokenizer.py # Custom tokenizer implementation
├── utils.py # Utility functions
├── examples/ # Comprehensive examples and guides
│ ├── README.md # Examples overview and usage guide
│ ├── method_comparison.py # Compare all optimization methods
│ ├── custom_method_example.py # How to create custom methods
│ ├── method_selection_guide.py # Interactive method selection
│ ├── performance_benchmark.py # Benchmark different methods
│ └── method_specific_configs/ # Optimized configs for each method
│ ├── README.md # Configuration guide
│ ├── qlora_config.yaml # QLoRA-optimized configuration
│ ├── dora_config.yaml # DoRA-optimized configuration
│ └── lora_xs_config.yaml # LoRA-XS-optimized configuration
└── README.md # This file
The project uses a YAML configuration file (config/models.yaml
) to manage all parameters. Key configuration sections include:
- Data Configuration: Dataset paths and processing parameters
- Model Configuration: Model-specific parameters including LoRA settings
- Training Configuration: Training hyperparameters
- Output Configuration: Logging and model saving paths
The following models are currently implemented and configured in the pipeline:
-
StarCoder2-3B
- Base Model:
bigcode/starcoder2-3b
- Vocabulary Size: 50,000
- Batch Size: 8
- Base Model:
-
CodeLlama-7B
- Base Model:
codellama/CodeLlama-7b-hf
- Vocabulary Size: 32,000
- Batch Size: 8
- Base Model:
-
WizardCoder-34B
- Base Model:
WizardLM/WizardCoder-Python-34B-V1.0
- Vocabulary Size: 32,000
- Batch Size: 4
- Base Model:
-
DeepSeek-Coder-6.7B
- Base Model:
deepseek-ai/deepseek-coder-6.7b-base
- Vocabulary Size: 32,000
- Batch Size: 8
- Base Model:
-
Codestral-22B
- Base Model:
mistralai/Codestral-22B-v0.1
- Vocabulary Size: 32,000
- Batch Size: 4
- Base Model:
-
WizardCoder-15B
- Base Model:
WizardLM/WizardCoder-15B-V1.0
- Vocabulary Size: 32,000
- Batch Size: 6
- Base Model:
-
DeepSeek-Coder-33B
- Base Model:
deepseek-ai/deepseek-coder-33b-base
- Vocabulary Size: 32,000
- Batch Size: 2
- Base Model:
All models are configured with LoRA fine-tuning parameters optimized for their respective sizes, including appropriate rank, alpha, and dropout values.
The repository includes a sample dataset of Perl programming examples that demonstrates the required format. You can use this as a reference for preparing your own dataset.
Sample dataset structure:
dataset/processed/
├── train.jsonl
├── valid.jsonl
└── test.jsonl
Each JSONL file should contain entries in the following format:
{"prompt": "input text", "code": "target text"}
For the Perl programming dataset, the format is:
{
"prompt": "Write a Perl script to read a file and count word frequency",
"code": "#!/usr/bin/perl\nuse strict;\nuse warnings;\n\nmy %word_count;\nopen(my $fh, '<', 'input.txt') or die \"Cannot open file: $!\";\nwhile(my $line = <$fh>) {\n chomp($line);\n my @words = split(/\\s+/, $line);\n foreach my $word (@words) {\n $word_count{$word}++;\n }\n}\nclose($fh);\n\nforeach my $word (sort keys %word_count) {\n print \"$word: $word_count{$word}\\n\";\n}"
}
You can examine the sample dataset to understand:
- Input format: Natural language descriptions of programming tasks
- Output format: Complete, executable Perl code solutions
- Data organization: Training, validation, and test splits
To use your own dataset:
- Follow the same JSONL format
- Split your data into train/valid/test files
- Place them in your configured data directory
- Ensure your input/output pairs are properly formatted
Edit config/models.yaml
to set your desired parameters:
data:
processed_dir: "dataset/processed/"
metadata_dir: "dataset/metadata/"
models:
- name: "model_name"
pretrained_model: "model/checkpoint"
max_length: 512
vocab_size: 32000
lora_rank: 8
lora_alpha: 32
lora_dropout: 0.1
training:
epochs: 3
batch_size: 8
learning_rate: 2e-4
weight_decay: 0.01
warmup_steps: 100
gradient_accumulation_steps: 4
- Train the tokenizer:
python tokenizer.py
- Run fine-tuning with your preferred method:
Standard LoRA fine-tuning:
python finetune.py --method lora
QLoRA fine-tuning (CUDA-only):
python finetune.py --method qlora
Available methods:
lora
: Standard LoRA with mixed precision trainingqlora
: Quantized LoRA with 4-bit quantization for memory efficiency
Additional options:
# Use custom config file
python finetune.py --method qlora --config path/to/config.yaml
# See all available methods
python finetune.py --help
Note: QLoRA requires CUDA support. If CUDA is not available, the script will automatically fall back to standard LoRA training.
The custom tokenizer (tokenizer.py
) implements:
- BPE tokenization with configurable vocabulary size
- Special tokens handling ([PAD], [BOS], [EOS], [UNK])
- Efficient batch processing
- Vocabulary statistics generation
The unified fine-tuning script (finetune.py
) implements multiple optimization approaches:
- Standard LoRA (
--method lora
):
- LoRA adaptation for efficient fine-tuning
- Mixed precision training (FP16) when available
- Gradient accumulation for handling large batches
- TensorBoard integration for monitoring
- Automatic model checkpointing
- QLoRA (
--method qlora
):
- Quantized LoRA implementation for extreme memory efficiency
- 4-bit quantization of the base model
- CUDA-only optimization with automatic fallback to standard LoRA
- Same monitoring and checkpointing features as standard LoRA
- Extensible Architecture:
- Easy to add new optimization methods
- Modular design with
OptimizationMethod
base class - Method-specific configurations and training settings
- Automatic method validation and error handling
The utilities module (utils.py
) provides:
- YAML configuration loading
- Logging setup
- Directory validation and creation
- Dataset statistics computation
Training progress can be monitored through:
- Log files in the specified log directory
- TensorBoard metrics
- Model checkpoints saved at regular intervals
- Always validate your configuration before training
- Monitor GPU memory usage during training
- Keep track of model checkpoints
- Use appropriate batch sizes for your GPU
- Regular evaluation of model performance
- For QLoRA, ensure CUDA is available for optimal performance
The project is designed to be easily extensible with new optimization methods. To add a new method:
- Create a new class inheriting from
OptimizationMethod
:
from optimization_methods import OptimizationMethod
class YourNewMethod(OptimizationMethod):
def __init__(self):
super().__init__("your_method", "Description of your method")
def load_model(self, pretrained_model: str, trust_remote_code: bool = True):
# Your model loading logic
pass
def setup_lora_config(self, model_config: dict) -> LoraConfig:
# Your LoRA configuration
pass
def get_training_config(self, use_cuda: bool) -> dict:
# Your training configuration
pass
- Register the method:
from optimization_methods import add_method
add_method("your_method", YourNewMethod())
- Use it in training:
python finetune.py --method your_method
See example_add_method.py
for a complete working example.
This project includes comprehensive documentation to help you get started and contribute:
- README.md - Project overview and setup instructions
- CONTRIBUTING.md - Guidelines for contributing to the project
- LICENSE - Software license terms
Please read the Contributing Guidelines before submitting pull requests.
- Hugging Face Transformers library
- PEFT library for LoRA implementation
- Tokenizers library for efficient tokenization
- bitsandbytes library for quantization support