RL4LMS is a powerful and flexible library designed for fine-tuning large language models (LLMs) using reinforcement learning, with a primary focus on the GRPO (Generalized Reinforcement Policy Optimization) algorithm. This library provides researchers and practitioners with a robust framework for implementing custom reward functions, environments, and training loops to optimize language models for specific tasks.
- Features
- Installation
- Quick Start
- Project Structure
- Custom Reward Functions
- Documentation
- Contributing
- License
- Contact
- Acknowledgments
RL4LMS comes packed with powerful features designed to streamline the process of fine-tuning language models:
- ๐ Flexible Reward Function API: Intuitive interface for defining custom reward functions tailored to your specific task
- ๐ค HuggingFace Integration: Seamless compatibility with all HuggingFace Transformers models
- โก Efficient Training: Optimized for both single and multi-GPU training with minimal setup
- ๐งฉ Extensible Architecture: Modular design that makes it easy to add new components and environments
- ๐ Built-in Evaluation: Comprehensive tools for monitoring and evaluating model performance
- ๐ฎ Wordle Environment: Built-in Wordle game environment for RL training and experimentation
RL4LMS can be installed with just a few simple steps:
-
Clone the repository
git clone https://github.com/YanCotta/reinforcement-fine-tuning-llms-with-grpo.git cd reinforcement-fine-tuning-llms-with-grpo
-
Set up a virtual environment (recommended):
# Create and activate virtual environment python -m venv venv # On Windows: .\venv\Scripts\activate # On macOS/Linux: source venv/bin/activate
-
Install the package in development mode
pip install -e .
-
Install additional dependencies
pip install -r requirements.txt
For contributing to the project or running tests:
pip install -e ".[dev]"
RL4LMS includes a ready-to-use implementation for fine-tuning language models on the Wordle game. Here's how to get started:
-
Prepare your environment as described in the Installation section
-
Run the example script:
python examples/wordle_finetuning.py
Here's a minimal example showing how to use RL4LMS to fine-tune a model:
from rl4lms.trainer import GRPOTrainer
from rl4lms.reward_functions.wordle import WordleRewardFunction
from rl4lms.envs.wordle_env import WordleEnv
# Initialize components
model = AutoModelForCausalLM.from_pretrained("gpt2")
ref_model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
reward_fn = WordleRewardFunction()
# Create trainer and start training
trainer = GRPOTrainer(
model=model,
ref_model=ref_model,
tokenizer=tokenizer,
reward_fn=reward_fn,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
batch_size=8,
num_epochs=3,
learning_rate=1e-5,
output_dir="./wordle_grpo_output"
)
trainer.train()
rl4lms/
โโโ envs/ # Environment implementations
โ โโโ __init__.py
โ โโโ wordle_env.py # Wordle game environment
โโโ losses/
โ โโโ __init__.py
โ โโโ grpo_loss.py # GRPO loss implementation
โโโ models/ # Model architectures
โ โโโ __init__.py
โโโ reward_functions/ # Reward function implementations
โ โโโ __init__.py
โ โโโ base.py # Base reward function class
โ โโโ wordle.py # Wordle-specific reward functions
โโโ trainer/
โ โโโ __init__.py
โ โโโ grpo_trainer.py # Training loop implementation
โโโ utils/ # Utility functions
โโโ __init__.py
examples/ # Example scripts
โโโ wordle_finetuning.py # Wordle fine-tuning example
tests/ # Unit tests
โโโ test_reward_functions.py
To create a custom reward function, inherit from the RewardFunction
base class and implement the __call__
method:
from rl4lms.reward_functions import RewardFunction
import torch
class MyRewardFunction(RewardFunction):
def __init__(self, **kwargs):
super().__init__()
# Initialize any parameters
def __call__(self, prompt_texts, generated_texts, **kwargs):
"""
Calculate rewards for generated text.
Args:
prompt_texts: List of input prompts
generated_texts: List of generated texts to score
**kwargs: Additional metadata
Returns:
torch.Tensor: Tensor of rewards for each generated text
"""
# Calculate rewards here
rewards = torch.ones(len(generated_texts)) # Example: return 1 for each text
return rewards
For detailed documentation, including API references, advanced usage examples, and tutorials, please visit our documentation site.
We welcome contributions from the community! Whether you're fixing bugs, adding new features, or improving documentation, your help is greatly appreciated.
- Fork the repository on GitHub
- Clone your fork locally
- Create a new branch for your changes
- Commit your changes with clear, descriptive messages
- Push your changes to your fork
- Open a Pull Request with a clear description of your changes
-
Install development dependencies:
pip install -e ".[dev]"
-
Run tests:
pytest tests/
-
Format your code:
black . isort .
-
Check for code style issues:
flake8 src tests mypy src
For questions, suggestions, or support, please reach out:
- Email: yanpcotta@gmail.com
- GitHub: @YanCotta
- Issues: Open an issue
This project is licensed under the MIT License - see the LICENSE file for details.
- This project was inspired by the course "Reinforcement Fine-Tuning LLMs With GRPO".
- Built with โค๏ธ using PyTorch and HuggingFace Transformers.