Gemma-3-RL-Finetuning

Overview

This project fine-tunes the Gemma-3-1B-IT model using Reinforcement Learning (RL) with the Group Relative Policy Optimization (GRPO) algorithm. The goal is to enhance the model's reasoning and solution generation for mathematical problem-solving using the GSM8K dataset.

Features

LoRA fine-tuning with Unsloth for efficient training.
GRPO reward-based training with multiple reward functions.
Dataset processing for GSM8K, extracting structured answers.
Custom reward functions to ensure correct formatting and correctness.
Fast inference support with vLLM.

Project Structure

└── abdo-ashraf-gemma-3-rl-finetuning/
    ├── README.md            # Project documentation
    ├── LICENSE              # MIT License
    ├── __init__.py          # Package initialization
    ├── config.json          # Training configurations
    ├── main.py              # Entry point to start training
    ├── requirements.txt     # Dependencies
    ├── train.py             # Training script
    ├── Notebooks/           # Jupyter notebooks for analysis
    ├── outputs/             # Model checkpoints and logs
    └── utils/               # Utility scripts
        ├── __init__.py      
        ├── load_gsm8k_dataset.py  # Dataset loader
        └── reward_funcs.py   # Reward functions for RL

Installation

Prerequisites

Python 3.9+
CUDA-enabled GPU (recommended)

Setup

Clone the repository:

git clone https://github.com/yourusername/abdo-ashraf-gemma-3-rl-finetuning.git
cd abdo-ashraf-gemma-3-rl-finetuning

Create a virtual environment and activate it:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Usage

Running Training

To start the training process, run:

python main.py --config_path config.json

Configuration

Modify config.json to adjust parameters such as:

model_id: Pretrained model ID
max_seq_length: Sequence length for training
lora_r: Rank for LoRA fine-tuning
learning_rate: Learning rate for GRPO training
max_steps: Total training steps

Reward Functions

Reward functions guide reinforcement learning by scoring model outputs:

correctness_reward_func: Rewards correct answers.
int_reward_func: Rewards integer answers.
strict_format_reward_func: Ensures strict XML formatting.
soft_format_reward_func: Allows flexible formatting.
xmlcount_reward_func: Penalizes extra content outside XML tags.

Model Checkpoints and Outputs

Trained models and logs are saved in the outputs/ directory.
Modify output_dir in config.json to change the save location.

Fine-Tuning Results

For same question and system prompt

System Prompt:

You are given a problem.
Think about the problem and provide your working out.
Place it between <reasoning> and </reasoning>.
Then, provide your solution between <answer></answer>

Question:

Betty is saving money for a new wallet which costs $100. Betty has only half of the money she needs. Her parents decided to give her $15 for that purpose, and her grandparents twice as much as her parents. How much more money does Betty need to buy the wallet?

Before Fine-Tuning

Model Response:

<answer>100.0</answer>

After Fine-Tuning

Model Response:

<reasoning>
Let the cost of the wallet be $100.
Betty has half of the money she needs, so she has $\frac{1}{2} \times 100 = 50$.
Her parents gave her $15.
Her grandparents gave her twice as much as her parents, so grandparents gave her $2 \times 15 = 30$.
The total amount of money her grandparents gave her is $30$.
So, the total amount of money Betty has is $50 + 15 + 30 = 95$.
The amount of money she needs is $100 - 95 = 5$.
Therefore, Betty needs $5 more to buy the wallet.
</reasoning>
<answer>5</answer>

This demonstrates that Fine-tuning significantly improved the model’s ability to break down the problem and provide a structured response with correct reasoning.

License

This project is licensed under the MIT License. See LICENSE for details.

Acknowledgments

Contact

For questions, feel free to reach out:

Author: Abdelrhman Ashraf
Email: abdoashraff185@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Gemma-3-RL-Finetuning

Overview

Features

Project Structure

Installation

Prerequisites

Setup

Usage

Running Training

Configuration

Reward Functions

Model Checkpoints and Outputs

Fine-Tuning Results

Before Fine-Tuning

After Fine-Tuning

License

Acknowledgments

Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Notebooks		Notebooks
outputs/checkpoint-250		outputs/checkpoint-250
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
config.json		config.json
main.py		main.py
requirements.txt		requirements.txt
train.py		train.py

License

abdo-ashraf/Gemma-3-RL-Finetuning

Folders and files

Latest commit

History

Repository files navigation

Gemma-3-RL-Finetuning

Overview

Features

Project Structure

Installation

Prerequisites

Setup

Usage

Running Training

Configuration

Reward Functions

Model Checkpoints and Outputs

Fine-Tuning Results

Before Fine-Tuning

After Fine-Tuning

License

Acknowledgments

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages