Skip to content

yifan123/flow_grpo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Flow-GRPO:
Training Flow Matching Models via Online RL

         

📝 Updates

🤗 Model

Task Model
GenEval 🤗GenEval
Text Rendering 🤗Text
Human Preference Alignment 🤗PickScore

🚀 Quick Started

1. Environment Set Up

Clone this repository and install packages.

git clone https://github.com/yifan123/flow_grpo.git
cd flow_grpo
conda create -n flow_grpo python=3.10.16
pip install -e .

2. Reward Preparation

The steps above only install the current repository. Since each reward model may rely on different versions, combining them in one Conda environment can cause version conflicts. To avoid this, we adopt a remote server setup inspired by ddpo-pytorch. You only need to install the specific reward model you plan to use.

GenEval

Please create a new Conda virtual environment and install the corresponding dependencies according to the instructions in reward-server.

OCR

Please install paddle-ocr:

pip install paddlepaddle-gpu==2.6.2
pip install paddleocr==2.9.1
pip install python-Levenshtein

Then, pre-download the model using the Python command line:

from paddleocr import PaddleOCR
ocr = PaddleOCR(use_angle_cls=False, lang="en", use_gpu=False, show_log=False)

Pickscore

PickScore requires no additional installation.

DeQA

Please create a new Conda virtual environment and install the corresponding dependencies according to the instructions in reward-server.

UnifiedReward

We use sglang to deploy the reward service, and we also recommend using sglang or vllm for deploying VLM-based reward models. After installing sglang, please run the following command to launch UnifiedReward:

python -m sglang.launch_server --model-path CodeGoat24/UnifiedReward-7b-v1.5 --api-key flowgrpo --port 17140 --chat-template chatml-llava --enable-p2p-check --mem-fraction-static 0.85

3. Start Training

Single-node training:

bash scripts/single_node/main.sh

Multi-node training:

# Master node
bash scripts/multi_node/main.sh
# Other nodes
bash scripts/multi_node/main1.sh
bash scripts/multi_node/main2.sh

🏁 Multi Reward Training

For multi-reward settings, you can pass in a dictionary where each key is a reward name and the corresponding value is its weight. For example:

{
    "pickscore": 0.5,
    "ocr": 0.2,
    "aesthetic": 0.3
}

This means the final reward is a weighted sum of the individual rewards.

The following reward models are currently supported:

  • Geneval evaluates T2I models on complex compositional prompts.
  • OCR provides an OCR-based reward.
  • PickScore is a general-purpose T2I reward model trained on human preferences.
  • DeQA is a multimodal LLM-based image quality assessment model that measures the impact of distortions and texture damage on perceived quality.
  • ImageReward is a general-purpose T2I reward model capturing text-image alignment, visual fidelity, and safety.
  • QwenVL is an experimental reward model using prompt engineering.
  • Aesthetic is a CLIP-based linear regressor predicting image aesthetic scores.
  • JPEG_Compressibility measures image size as a proxy for quality.
  • UnifiedReward is a state-of-the-art reward model for multimodal understanding and generation, topping the human preference leaderboard.

✨ Important Hyperparameters

You can adjust the parameters in config/dgx.py to tune different hyperparameters. An empirical finding is that config.sample.train_batch_size * num_gpu / config.sample.num_image_per_prompt * config.sample.num_batches_per_epoch = 48, i.e., group_number=48, group_size=24. Additionally, setting config.train.gradient_accumulation_steps = config.sample.num_batches_per_epoch // 2 also yields good performance.

🤗 Acknowledgement

This repo is based on ddpo-pytorch and diffusers. We thank the authors for their valuable contributions to the AIGC community. Special thanks to Kevin Black for the excellent ddpo-pytorch repo.

⭐Citation

@misc{liu2025flowgrpo,
      title={Flow-GRPO: Training Flow Matching Models via Online RL}, 
      author={Jie Liu and Gongye Liu and Jiajun Liang and Yangguang Li and Jiaheng Liu and Xintao Wang and Pengfei Wan and Di Zhang and Wanli Ouyang},
      year={2025},
      eprint={2505.05470},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.05470}, 
}

About

An official implementation of Flow-GRPO: Training Flow Matching Models via Online RL

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published