Behavior Injection: Preparing Language Models for Reinforcement Learning

Overview

This repo provides an official implementation of Behavior Injection: Preparing Language Models for Reinforcement Learning. In this paper, we analyze the training per-step influence in RL finetuning and identify two key factors: (1) rollout accuracy, and (2) data co-influence, which quantifies how much the training data affects performance on other samples. We then propose to inject exploration and exploitation behaviors to prepare LLM for reinforcement learning finetuning.

Installation

We use Anaconda or Miniconda to manage python environment.

Create conda env,

cd Bridge-LLM-reasoning
conda create -n bridge python=3.10
conda activate bridge

Install PyTorch according to your platform and cuda version, we use pytorch 2.6.0 with CUDA 12.4 here:

pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124

Install flash attention and vllm (we use 0.8.1),

pip install flash-attn --no-build-isolation
pip install vllm==0.8.1

Install iGSM-reasoning,
```
cd iGSM-reasoning
pip install -e .
```
See iGSM-reasoning/README.md for detailed introduction of iGSM-reasoning.
Install simple VeRL (it is mainly derived from VeRL, our main modifications are to simplify some codes):
```
cd ../simple_verl
pip install -e .
```
Login wandb and huggingface:
```
wandb login
huggingface-cli login
```

Prepare iGSM dataset

We will prepare the dataset for both SFT and RL training.

Go to the first level directory Bridge-LLM-reasoning/,
```
cd ..
```
Generate datasets for SFT and RL: run
```
experiment/process_data/data_generation.sh
```
It will make dir data/iGSM and save the dataset in the dir.
Preprocess them and Convert them to .parquet file: run
```
experiment/process_data/preprocess_igsm_data.sh
```
The data preprocessing includes adding system prompt, applying SFT query and answer templates and other miscs.

Run SFT and RL experiments

We suppose the experiment is run on a 2xA100(80G) server. Run experiments (SFT + RL) by

experiment/run_qwen2.5-1.5B-igsm.sh

You can use other scripts to run with other models. It includes two parts:

SFT training. The model will be saved to model/sft dir. Remember to modify training data path if you use another one.
RL training. The model will be saved to checkpoints/{project_name} dir. You may need to convert the .pt files to .safetensors when using vllm for inference or pushing it to Huggingface.

You can also

modify batch_size like xxx_batch_size_per_gpu according to the memory usage.
decrease gpu_memory_utilization if GPU memory is not enough.
offload parameter / optimizer if GPU memory is still not enough, but it will significantly slow down the experiment.
un-comment export VLLM_ATTENTION_BACKEND=XFORMERS at the beginning of the script if you encounter vllm bugs on V1 engine, then we will use engine V0 for inference.

Evaluation

Currently, we manually run vllm inference from the checkpoints. See evaluation readme.md for instructions.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
experiment		experiment
iGSM-reasoning		iGSM-reasoning
simple_verl		simple_verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Behavior Injection: Preparing Language Models for Reinforcement Learning

Overview

Installation

Prepare iGSM dataset

Run SFT and RL experiments

Evaluation

About

Uh oh!

Releases

Packages

Languages

License

czp16/Bridge-LLM-reasoning

Folders and files

Latest commit

History

Repository files navigation

Behavior Injection: Preparing Language Models for Reinforcement Learning

Overview

Installation

Prepare iGSM dataset

Run SFT and RL experiments

Evaluation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages