sail-llama

Finetune Llama 3 with Appian SAIL code (~12000 items) and run locally with a single Nvidia GPU with 24G VRAM.

Fine-tuning Technology in brief:

QLoRA (4-bit quantization with LoRA), final GGUF model is around 16GB in size
Base model: Meta-Llama-3-8B-Instruct
LoRA config: rank=4, alpha=16, targeting attention layers
Training: 2 epochs, batch size=1, gradient accumulation=32

Merging & Conversion Process:

Merge: PEFT to combine LoRA weights with base model
Convert: llama.cpp tools to create GGUF format
Output: f16 precision GGUF file for local inference

Notable Features:

Memory-efficient with CPU merging and garbage collection
Custom dataset handler for structured code data
Final output compatible with LMStudio and llama.cpp

=============================================================================== Finetuning Process Overview

The process consists of two main phases:

Finetuning with LoRA (in sail-llama-finetune.py)
Merging and Converting (in merge_and_convert_sail.py)

Phase 1: Finetuning with LoRA The sail-llama-finetune.py script handles the actual finetuning process:

Authentication Setup: Verifies access to the Meta Llama 3 model on Hugging Face
Model Preparation: Loads the base model with 4-bit quantization
LoRA Configuration: Sets up Low-Rank Adaptation parameters
Dataset Preparation: Loads and processes the SAIL code samples
Training: Finetunes the model using the Hugging Face Trainer API
Saving: Saves the resulting LoRA weights

Phase 2: Merging and Converting The merge_and_convert_sail.py script takes the finetuned LoRA weights and:

Merges LoRA Weights: Combines the LoRA weights with the base model
Sets up llama.cpp: Clones and builds the llama.cpp repository
Converts to GGUF: Transforms the merged model to GGUF format for efficient inference

Technologies Used

The finetuning process leverages several key technologies:

Parameter-Efficient Fine-Tuning (PEFT)

LoRA (Low-Rank Adaptation): Instead of updating all model parameters, LoRA adds small trainable "adapter" layers to specific attention components (q_proj, k_proj, v_proj, o_proj)
The configuration uses rank=4, alpha=16, and dropout=0.05

Quantization

4-bit Quantization: Uses the BitsAndBytes library to reduce memory requirements
Specifically uses NF4 (normalized float 4) quantization with double quantization

Training Framework

Hugging Face Transformers: For model loading and training infrastructure
PyTorch: As the underlying deep learning framework
Gradient Checkpointing: To reduce memory usage during training
Mixed Precision Training (fp16): For faster training

Optimization Parameters

Learning Rate: 2e-4 with cosine scheduler
Batch Size: 1 per device with gradient accumulation of 32 steps
Training Duration: 2 epochs
Optimizer: AdamW with beta1=0.9, beta2=0.999, epsilon=1e-8
Gradient Clipping: 0.3 max norm

Model Conversion

llama.cpp: For efficient inference on consumer hardware
GGUF Format: The modern replacement for GGML format, allowing for optimized inference

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
interface-designer.png		interface-designer.png
merge_and_convert_sail.py		merge_and_convert_sail.py
sail-llama-finetune.py		sail-llama-finetune.py
sail-lllama-crtGrid.png		sail-lllama-crtGrid.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

sail-llama

About

Uh oh!

Releases

Packages

Languages

License

pdkang/sail-llama

Folders and files

Latest commit

History

Repository files navigation

sail-llama

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages