Flow-GRPO:
Training Flow Matching Models via Online RL

📝 Updates

[2025.05.15]: 🔥We showcase image examples from three tasks and their training evolution at https://gongyeliu.github.io/Flow-GRPO. Check them out!
[2025.05.13]: 🔥We now provide an online demo for all three tasks at https://huggingface.co/spaces/jieliu/SD3.5-M-Flow-GRPO. You're welcome to try it out!

🤗 Model

Task	Model
GenEval	🤗GenEval
Text Rendering	🤗Text
Human Preference Alignment	🤗PickScore

🚀 Quick Started

1. Environment Set Up

Clone this repository and install packages.

git clone https://github.com/yifan123/flow_grpo.git
cd flow_grpo
conda create -n flow_grpo python=3.10.16
pip install -e .

2. Reward Preparation

The steps above only install the current repository. Since each reward model may rely on different versions, combining them in one Conda environment can cause version conflicts. To avoid this, we adopt a remote server setup inspired by ddpo-pytorch. You only need to install the specific reward model you plan to use.

GenEval

Please create a new Conda virtual environment and install the corresponding dependencies according to the instructions in reward-server.

OCR

Please install paddle-ocr:

pip install paddlepaddle-gpu==2.6.2
pip install paddleocr==2.9.1
pip install python-Levenshtein

Then, pre-download the model using the Python command line:

from paddleocr import PaddleOCR
ocr = PaddleOCR(use_angle_cls=False, lang="en", use_gpu=False, show_log=False)

Pickscore

PickScore requires no additional installation.

DeQA

Please create a new Conda virtual environment and install the corresponding dependencies according to the instructions in reward-server.

UnifiedReward

We use sglang to deploy the reward service, and we also recommend using sglang or vllm for deploying VLM-based reward models. After installing sglang, please run the following command to launch UnifiedReward:

python -m sglang.launch_server --model-path CodeGoat24/UnifiedReward-7b-v1.5 --api-key flowgrpo --port 17140 --chat-template chatml-llava --enable-p2p-check --mem-fraction-static 0.85

3. Start Training

Single-node training:

bash scripts/single_node/main.sh

Multi-node training:

# Master node
bash scripts/multi_node/main.sh
# Other nodes
bash scripts/multi_node/main1.sh
bash scripts/multi_node/main2.sh

🏁 Multi Reward Training

For multi-reward settings, you can pass in a dictionary where each key is a reward name and the corresponding value is its weight. For example:

{
    "pickscore": 0.5,
    "ocr": 0.2,
    "aesthetic": 0.3
}

This means the final reward is a weighted sum of the individual rewards.

The following reward models are currently supported:

Geneval evaluates T2I models on complex compositional prompts.
OCR provides an OCR-based reward.
PickScore is a general-purpose T2I reward model trained on human preferences.
DeQA is a multimodal LLM-based image quality assessment model that measures the impact of distortions and texture damage on perceived quality.
ImageReward is a general-purpose T2I reward model capturing text-image alignment, visual fidelity, and safety.
QwenVL is an experimental reward model using prompt engineering.
Aesthetic is a CLIP-based linear regressor predicting image aesthetic scores.
JPEG_Compressibility measures image size as a proxy for quality.
UnifiedReward is a state-of-the-art reward model for multimodal understanding and generation, topping the human preference leaderboard.

✨ Important Hyperparameters

You can adjust the parameters in config/dgx.py to tune different hyperparameters. An empirical finding is that config.sample.train_batch_size * num_gpu / config.sample.num_image_per_prompt * config.sample.num_batches_per_epoch = 48, i.e., group_number=48, group_size=24. Additionally, setting config.train.gradient_accumulation_steps = config.sample.num_batches_per_epoch // 2 also yields good performance.

🤗 Acknowledgement

This repo is based on ddpo-pytorch and diffusers. We thank the authors for their valuable contributions to the AIGC community. Special thanks to Kevin Black for the excellent ddpo-pytorch repo.

⭐Citation

@misc{liu2025flowgrpo,
      title={Flow-GRPO: Training Flow Matching Models via Online RL}, 
      author={Jie Liu and Gongye Liu and Jiajun Liang and Yangguang Li and Jiaheng Liu and Xintao Wang and Pengfei Wan and Di Zhang and Wanli Ouyang},
      year={2025},
      eprint={2505.05470},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.05470}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
config		config
dataset		dataset
flow_grpo		flow_grpo
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Flow-GRPO:
Training Flow Matching Models via Online RL

📝 Updates

🤗 Model

🚀 Quick Started

1. Environment Set Up

2. Reward Preparation

GenEval

OCR

Pickscore

DeQA

UnifiedReward

3. Start Training

🏁 Multi Reward Training

✨ Important Hyperparameters

🤗 Acknowledgement

⭐Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

yifan123/flow_grpo

Folders and files

Latest commit

History

Repository files navigation

Flow-GRPO:Training Flow Matching Models via Online RL

📝 Updates

🤗 Model

🚀 Quick Started

1. Environment Set Up

2. Reward Preparation

GenEval

OCR

Pickscore

DeQA

UnifiedReward

3. Start Training

🏁 Multi Reward Training

✨ Important Hyperparameters

🤗 Acknowledgement

⭐Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Flow-GRPO:
Training Flow Matching Models via Online RL

Packages