This project fine-tunes the Gemma-3-1B-IT model using Reinforcement Learning (RL) with the Group Relative Policy Optimization (GRPO) algorithm. The goal is to enhance the model's reasoning and solution generation for mathematical problem-solving using the GSM8K dataset.
- LoRA fine-tuning with Unsloth for efficient training.
- GRPO reward-based training with multiple reward functions.
- Dataset processing for GSM8K, extracting structured answers.
- Custom reward functions to ensure correct formatting and correctness.
- Fast inference support with vLLM.
└── abdo-ashraf-gemma-3-rl-finetuning/
├── README.md # Project documentation
├── LICENSE # MIT License
├── __init__.py # Package initialization
├── config.json # Training configurations
├── main.py # Entry point to start training
├── requirements.txt # Dependencies
├── train.py # Training script
├── Notebooks/ # Jupyter notebooks for analysis
├── outputs/ # Model checkpoints and logs
└── utils/ # Utility scripts
├── __init__.py
├── load_gsm8k_dataset.py # Dataset loader
└── reward_funcs.py # Reward functions for RL
- Python 3.9+
- CUDA-enabled GPU (recommended)
-
Clone the repository:
git clone https://github.com/yourusername/abdo-ashraf-gemma-3-rl-finetuning.git cd abdo-ashraf-gemma-3-rl-finetuning
-
Create a virtual environment and activate it:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
To start the training process, run:
python main.py --config_path config.json
Modify config.json
to adjust parameters such as:
model_id
: Pretrained model IDmax_seq_length
: Sequence length for traininglora_r
: Rank for LoRA fine-tuninglearning_rate
: Learning rate for GRPO trainingmax_steps
: Total training steps
Reward functions guide reinforcement learning by scoring model outputs:
- correctness_reward_func: Rewards correct answers.
- int_reward_func: Rewards integer answers.
- strict_format_reward_func: Ensures strict XML formatting.
- soft_format_reward_func: Allows flexible formatting.
- xmlcount_reward_func: Penalizes extra content outside XML tags.
- Trained models and logs are saved in the
outputs/
directory. - Modify
output_dir
inconfig.json
to change the save location.
For same question and system prompt
System Prompt:
You are given a problem.
Think about the problem and provide your working out.
Place it between <reasoning> and </reasoning>.
Then, provide your solution between <answer></answer>
Question:
Betty is saving money for a new wallet which costs $100. Betty has only half of the money she needs. Her parents decided to give her $15 for that purpose, and her grandparents twice as much as her parents. How much more money does Betty need to buy the wallet?
Model Response:
<answer>100.0</answer>
Model Response:
<reasoning>
Let the cost of the wallet be $100.
Betty has half of the money she needs, so she has $\frac{1}{2} \times 100 = 50$.
Her parents gave her $15.
Her grandparents gave her twice as much as her parents, so grandparents gave her $2 \times 15 = 30$.
The total amount of money her grandparents gave her is $30$.
So, the total amount of money Betty has is $50 + 15 + 30 = 95$.
The amount of money she needs is $100 - 95 = 5$.
Therefore, Betty needs $5 more to buy the wallet.
</reasoning>
<answer>5</answer>
This demonstrates that Fine-tuning significantly improved the model’s ability to break down the problem and provide a structured response with correct reasoning.
This project is licensed under the MIT License. See LICENSE for details.
For questions, feel free to reach out:
- Author: Abdelrhman Ashraf
- Email: abdoashraff185@gmail.com