We release the UnifiedReward -- the first unified reward model for multimodal understanding and generation assessment, enabling both pairwise ranking and pointwise scoring, which can be employed for vision model preference alignment.
π₯π₯ We release UnifiedReward-qwen-[3b/7b/32b], the more powerful unified reward models built upon Qwen2.5-VL-Instruct!!
π₯ We release vLLM inference code for UnifiedReward-qwen in vllm_qwen
directory!
π₯ We release SGLang inference code for UnifiedReward-llava in sglang_llava
directory!
π We appreciate the excellent work Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO, which provides further evidence of the robustness and effectiveness of UnifiedReward in image generation RL tasks.
Method | HPS | ImageReward | UnifiedReward |
---|---|---|---|
Janus-Pro + DPO | 77.3 | 77.7 | 80.0 |
Janus-Pro + GRPO | 79.2 | 79.3 | 81.0 |
Janus-Pro + Best-of-4 | 82.1 | 82.4 | 84.5 |
π We appreciate the Flow-GRPO team for using UnifiedReward-7B as their image generation quality evaluation metric!
π We appreciate the mradermacher team for providing the GGUF version of our models!!
π We sincerely thank the Hunyuan team of Tencent for providing the evaluation results on several T2I models using UnifiedReward-qwen-7b!! The evaluation was conducted on 400 prompts sourced from here.
Model | Alignment | Coherence | Style |
---|---|---|---|
Flux-pro-ultra | 3.6453 | 3.8193 | 3.4971 |
Imagen-4.0 | 3.6792 | 3.8049 | 3.4756 |
Recraft-v3 | 3.6611 | 3.8409 | 3.5158 |
OpenAI-GPT-image-1 | 3.6890 | 3.8448 | 3.4960 |
Imagen-3.0 | 3.6733 | 3.8027 | 3.4674 |
Seedream-3.0 | 3.6927 | 3.8218 | 3.4887 |
We release UnifiedReward-Think -- the first unified multimodal CoT reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks.
Please refer to the project page for details.
π₯π₯ We release UnifiedReward-Think-qwen-7b, a more powerful unified multimodal CoT reward model built upon UnifiedReward-qwen-7b!!!!
π₯π₯ We released Gradio for UnifiedReward-Think!
π We are actively gathering feedback from the community to improve our models. We welcome your input and encourage you to stay updated through our repository!!
Please leave us a star β if you find our work helpful.
-
[2025/5] π₯π₯ We released [UnifiedReward-qwen-[3b/7b/32b], the more powerful unified reward models built upon Qwen2.5-VL-Instruct!!(https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)! All its inference and evaluation codes are provided in
./inference_qwen
and./benchmark_evaluation
directory, respectively. -
[2025/5] π₯π₯ We released UnifiedReward-Think-7b, the first unified multimodal CoT reward model. See project page for details.
-
[2025/4] π₯π₯ We released UnifiedReward-0.5B. Feel free to use it based on your needs.
-
[2025/4] π₯π₯ We updated UnifiedReward-7B, incorporating valuable feedback from the community, and released UnifiedReward-7B-v1.5 by introducing pointwise scoring for generated images across three dimensions: alignment, coherence, and style, each rated on a continuous scale from 1 to 5.
- Alignment quantifies how well an image matches its prompt.
- Coherence assesses the logical consistency of the image and the absence of artifacts or visual glitches.
- Style reflects the visual appeal of the image, independent of the prompt.
Welcome to try the latest version, and the inference code is in
inference_qwen/image_generation/qwen_point_score_ACS_image_generation.py
and./inference/point_score_ACS_image_generation.py
. -
[2025/3] π₯π₯ We released all training datasets and model checkpoints.
-
[2025/3] π₯π₯ We released all training, inference, and evaluation code.
-
[2025/3] π₯ We launched the project page and paper.
Reward Model | Method | Image Generation | Image Understanding | Video Generation | Video Understanding |
---|---|---|---|---|---|
PickScore | Point | β | |||
HPS | Point | β | |||
ImageReward | Point | β | |||
LLaVA-Critic | Pair/Point | β | |||
IXC-2.5-Reward | Pair/Point | β | β | ||
VideoScore | Point | β | |||
LiFT | Point | β | |||
VisionReward | Point | β | β | ||
VideoReward | Point | β | |||
UnifiedReward (Ours) | Pair/Point | β | β | β | β |
VLRewardBench Comparison Results
Models | General | Hallu. | Reason. | Overall Accuracy | Macro Accuracy |
---|---|---|---|---|---|
Gemini-1.5-Pro | 50.8 | 72.5 | 64.2 | 67.2 | 62.5 |
GPT-4o | 49.1 | 67.6 | 70.5 | 65.8 | 62.4 |
LLaVA-Critic | 47.4 | 38.5 | 53.8 | 46.9 | 46.6 |
OV-7B | 32.2 | 20.1 | 57.1 | 29.6 | 36.5 |
UnifiedReward | 76.5 | 58.1 | 65.1 | 67.5 | 66.6 |
GenAI-Bench(Image) Comparison Results
Method | GenAI-Bench | |
---|---|---|
tau | diff | |
PickScore | 53.2 | 67.2 |
HPSv2 | 51.6 | 68.4 |
ImageReward | 47.8 | 65.0 |
VisionReward | 46.8 | 66.4 |
OV-7B | 39.7 | 53.2 |
UnifiedReward | 54.8 | 70.9 |
GenAI-Bench(Video) and VideoGen-Reward Comparison Results
Method | GenAI-Bench | VideoGen-Reward | ||
---|---|---|---|---|
tau | diff | tau | diff | |
VideoScore | 46.2 | 70.6 | 42.1 | 49.9 |
LiFT | 41.2 | 60.1 | 40.6 | 58.3 |
VisionReward | 52.1 | 73.1 | 57.4 | 68.2 |
VideoReward | 50.2 | 73.3 | 60.1 | 73.9 |
OV-7B | 40.8 | 51.4 | 40.4 | 50.2 |
UnifiedReward | 60.7 | 77.2 | 66.6 | 79.3 |
- Clone this repository and navigate to the UnifiedReward folder:
git clone https://github.com/CodeGoat24/UnifiedReward.git
cd UnifiedReward
- Install the inference package:
conda create -n unifiedreward python=3.10 -y
conda activate unifiedreward
pip install --upgrade pip
pip install -e ".[train]"
pip install flash_attn==2.5.8 --no-build-isolation
For Qwen2.5-VL based UnifiedReward models, you should first install the inference packages as follows:
pip install git+https://github.com/huggingface/transformers accelerate qwen-vl-utils[decord]==0.0.8
We provide reference pair ranking and point score inference code for each task in the ./inference
and ./inference_qwen
directories.
inference
βββ image_generation
βββ pair_rank_image_generation.py
βββ point_score_image_generation.py
βββ video_understanding
βββ pair_rank_video_understanding.py
βββ point_score_video_understanding.py
...
Note that our model is not constrained to a fixed input prompt style. You can flexibly adjust inputs based on your requirements.
We provide vLLM inference code for UnifiedReward-qwen in vllm_qwen
directory.
- Install vLLM
pip install vllm==0.9.0.1 transformers==4.52.4
- Deploy vLLM Server
bash vllm_qwen/vllm_server.sh
- Inference Request to vLLM Server
python vllm_qwen/vllm_inference.py
We provide SGLang inference code for UnifiedReward-llava in sglang_llava
directory.
- Install SGLang
pip install "sglang[all]"
- Deploy SGLang Server
bash sglang_llava/sglang_server.sh
- Inference Request to SGLang Server
python sglang_llava/sglang_inference.py
We use LLaMA-Factory to train the SFT model.
- Clone the LLaMA-Factory repository and install the dependencies.
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
Follow its README to prepare our released datasets.
- Run the following command to train the SFT model.
llamafactory-cli train examples/train_full/qwen2_5vl_full_sft.yaml
Please download our constructed unified preference dataset from Huggingface and put it in ./dataset/
.
dataset
βββ EvalMuse
βββ pairwise
βββ pointwise
βββ ...
βββ HPD
βββ LiFT-HRA
βββ LLaVA-Critic
βββ pairwise
βββ pointwise
βββ ...
βββ OIP
βββ ShareGPTVideo
βββ pairwise
βββ pointwise
βββ ...
βββ VideoDPO
βββ VideoFeedback
βββ train_data.yaml
bash train.sh
The data for preference data construction should adhere to the following structure:
[
{
"prompt": "",
"image": "",
},
...
]
Then
# image understanding
cd preference_data_construction/image_understanding
python infer+sift.py # you need to fill the 'image_folder' and 'data_path' in this file
# video understanding
cd preference_data_construction/video_understanding
python infer+sift.py # you need to fill the 'image_folder' and 'data_path' in this file
The training data format in data.json
should adhere to the following structure:
[
{
"id": "",
"image": "",
"prompt": "",
"chosen": "",
"rejected": ""
},
...
]
Then start training:
# image understanding
bash dpo_image_understand_ov7b.sh
# video understanding
bash dpo_video_understand_llava_video_7b.sh
Prepare Environments
cd DiffusionDPO
conda create -n diffdpo python=3.10 -y
conda activate diffdpo
pip install -r requirements.txt
Image Generation
The data for preference data construction should adhere to the following structure:
[
{
"prompt": "",
},
...
]
Then
python data_generation.py # you need to fill the 'data_path' in this file
Preference Pair Data Construction
python sift_dpo_data.py
The training data format in data.json
should adhere to the following structure:
[
{
"id": "",
"caption": "",
"jpg_0": "", #chosen image path
"jpg_1": "", #rejected image path
"label_0": 1,
},
...
]
Then start training:
bash launchers/turbo_dpo.sh
Prepare Environments
cd VideoDPO
conda create -n videodpo python=3.10 -y
conda activate videodpo
pip install -r requirements.txt
Prepare Checkpoints
Run following instruction to download VideoCrafter checkpoints.
mkdir -p checkpoints/vc2
wget -P checkpoints/vc2 https://huggingface.co/VideoCrafter/VideoCrafter2/resolve/main/model.ckpt
Please download our constructed T2V-Turbo model and its reference model from Huggingface and put it in ./checkpoints/t2v-turbo
.
Video Generation
The data for preference data construction should adhere to the following structure:
[
{
"prompt": "",
},
...
]
Then
bash data_generation.sh # you need to fill '--prompts_file' in this file
Preference Pair Data Construction
python sift_dpo_data.py
The training data format in data.json
should adhere to the following structure:
[
{
"id": "",
"caption": "",
"chosen": "", # chosen video path
"rejected": "", # rejected video path
},
...
]
Then start training:
bash run.sh
We provide several evaluation code in ./benchmark_evaluation
directory.
We provide evaluation code for GenAI-Bench-Video, GenAI-Bench-Image, VideoGen-RewardBench and VL-RewardBench benchmarks.
We provide evaluation code for MSRVTT, MSVD, and TGIF benchmarks while using the VLMEvalKit toolkit for evaluating LongVideoBench, MLVU, and Video-MME benchmarks with 64 input frames.
We use LMMs-Eval toolkit to evaluate LLaVABench, WildVision, LLaVABench-Wilder, LiveBench, and MMHal benchmarks.
We utilize the image reward model, i.e., PickScore, HPS and ImageReward for quality assessment.
VBench is used for video generation assessment.
If you have any comments or questions, please open a new issue or feel free to contact Yibin Wang.
In this work, reward model and image/video understanding DPO code is based on LLaVA-Next, while image and video generation DPO is based on DiffusionDPO and VideoDPO.
We also utilize LMMs-Eval and VLMEvalKit toolkits for evaluation.
Thanks to all the contributors!
@article{UnifiedReward-Think,
title={Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning.},
author={Wang, Yibin and Li, Zhimin and Zang, Yuhang and Wang, Chunyu and Lu, Qinglin, and Jin, Cheng and Wang, Jiaqi},
journal={arXiv preprint arXiv:2505.03318},
year={2025}
}
@article{UnifiedReward,
title={Unified Reward Model for Multimodal Understanding and Generation.},
author={Wang, Yibin and Zang, Yuhang and Li, Hao and Jin, Cheng and Wang, Jiaqi},
journal={arXiv preprint arXiv:2503.05236},
year={2025}
}