This repository contains the code, data, and benchmarks for the research paper: "Assessing the Impact of QLoRA Fine-Tuning on Sub-10B Parameter LLMs for Reasoning and Fact Recall."
This project evaluates the impact of QLoRA (Quantized Low-Rank Adaptors) fine-tuning on seven state-of-the-art Large Language Models (LLMs) with fewer than 10 billion parameters. The study focuses on enhancing two key capabilities:
- Fact Recall in the BioMedical domain ๐ฉบ
- Reasoning in Math Word Problems ๐ข
By leveraging computationally efficient techniques like 4-bit quantization and LoRA, we fine-tuned models such as Llama 3.1, Mistral, and Gemma 2 on specialized datasets. Our findings highlight the trade-offs between a model's baseline performance, gains from fine-tuning, and potential for overfitting, offering insights for optimizing smaller LLMs for domain-specific tasks.
- For Math Tasks ๐งฎ: Llama 3.1 showed the most significant performance improvement, achieving the top score of 4.97 after three epochs of fine-tuning.
- For BioMedical Tasks ๐งฌ: Mistral and Llama 3.1 demonstrated the most stability and consistency, making them reliable choices. Many models had strong baseline scores, and extended fine-tuning sometimes led to overfitting (e.g., Qwen, Gemma).
- Overfitting Risk
โ ๏ธ : Models like Qwen and Gemma 2 exhibited sensitivity to extended fine-tuning, with performance declining after the first epoch in some cases, indicating a risk of overfitting.
We selected seven state-of-the-art, sub-10B parameter models, using their 4-bit quantized versions from Unsloth.
- Llama 2 (7B)
- Llama 3 (8B)
- Llama 3.1 (8B)
- Gemma 1 (7B)
- Gemma 2 (9B)
- Mistral v3 (7.3B)
- Qwen 2 (7B)
- Orca-Math-200k: Used for fine-tuning on mathematical reasoning tasks. We used the first 10,000 samples for training.
- BioInstruct: A dataset with over 25,000 instructions for a wide range of biomedical NLP tasks. We used the first 10,000 samples for training.
- Frameworks: We used QLoRA for memory-efficient fine-tuning combined with the Unsloth library for a 30x speedup and 60% memory reduction.
- Hardware: Training was conducted on Kaggle's T4 GPUs (16GB VRAM).
- Epochs: Models were trained for an initial epoch, evaluated, and then fine-tuned for two additional epochs.
.
โโโ Code/
โ โโโ train_math.ipynb # Jupyter notebook for training models on the Orca-Math dataset.
โ โโโ train_bio.ipynb # Jupyter notebook for training models on the BioInstruct dataset.
โ โโโ Automated_benchmarking.ipynb # Notebook to run and evaluate all models.
โ
โโโ Benchmarks/
โ โโโ Base Model Benchmarks/ # Evaluation results for the original, non-finetuned models.
โ โโโ QLoRA Benchmarks/ # Evaluation results for the fine-tuned models (Epoch 1 & 3).
โ
โโโ Benchmark Data/
โ โโโ Llama-math.xlsx # Human evaluation scores for Llama models on math tasks.
โ โโโ Qwen-bio.xlsx # Human evaluation scores for Qwen models on bio tasks.
โ โโโ ... # (And so on for all models and tasks).
โ
โโโ Questions with solutions/
โโโ Maths_Benchmark.txt # The set of questions used for evaluating math reasoning.
โโโ bio_benchmark.txt # The set of questions used for evaluating biomedical recall.
- Setup: Ensure you have a suitable environment (like Kaggle or Google Colab with a T4 GPU).
- Train Models:
- Run the
train_math.ipynb
notebook to fine-tune the models on the math dataset. - Run the
train_bio.ipynb
notebook to fine-tune the models on the biomedical dataset.
- Run the
- Evaluate Performance:
- Use the
Automated_benchmarking.ipynb
notebook to generate responses from both the base and fine-tuned models using the questions in theQuestions with solutions/
directory.
- Use the
- Analyze Results: The raw model outputs are saved in the
Benchmarks/
directory, and the compiled human-rated scores are available in theBenchmark Data/
spreadsheets.