Skip to content

XueZeyue/Awesome-Visual-Generation-Alignment-Survey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 

Repository files navigation

Awesome-Visual-Generation-Alignment-Survey

The collection of awesome papers on the aligement of visual generation models (including AR and diffusion models)

We also use this repo as a reference.

We welcome community contributions!

Tutorial on Reinforcement Learning

First Two Works in Each Subfield:

Traditional Policy Gradient

  • DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models, [pdf]
  • Training Diffusion Models with Reinforcement Learning, [pdf]

GRPO

  • DanceGRPO: Unleashing GRPO on Visual Generation, [pdf]
  • Flow-GRPO: Training Flow Matching Models via Online RL, [pdf]

DPO

  • Diffusion Model Alignment Using Direct Preference Optimization, [pdf]
  • Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model, [pdf]

Reward Feedback Learning

  • ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation, [pdf]
  • Directly Fine-Tuning Diffusion Models on Differentiable Rewards, [pdf]

Alignment on AR models

  • Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step [pdf]
  • Autoregressive Image Generation Guided by Chains of Thought [pdf]

Reward Models

  • ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation, [pdf]
  • Human Preference Score: Better Aligning Text-to-Image Models with Human Preference, [pdf]

Alignment on Diffusion/Flow Models

Reinforcement Learning-based (RLHF)

  • DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models, [pdf]
  • Training Diffusion Models with Reinforcement Learning, [pdf]
  • Score as Action: Fine-Tuning Diffusion Generative Models by Continuous-time Reinforcement Learning, [pdf]
  • Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards, [pdf]
  • Schedule On the Fly: Diffusion Time Prediction for Faster and Better Image Generation, [pdf]
  • DanceGRPO: Unleashing GRPO on Visual Generation, [pdf]
  • Flow-GRPO: Training Flow Matching Models via Online RL, [pdf]
  • MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE, [pdf]
  • TempFlow-GRPO: When Timing Matters for GRPO in Flow Models, [pdf]
  • Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning, [pdf]

DPO-based (referred to here)

  • Diffusion Model Alignment Using Direct Preference Optimization, [pdf]
  • Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model, [pdf]
  • A Dense Reward View on Aligning Text-to-Image Diffusion with Preference, [pdf]
  • Aligning Diffusion Models by Optimizing Human Utility, [pdf]
  • Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization, [pdf]
  • Scalable Ranked Preference Optimization for Text-to-Image Generation, [pdf]
  • Prioritize Denoising Steps on Diffusion Model Preference Alignment via Explicit Denoised Distribution Estimation, [pdf]
  • PatchDPO: Patch-level DPO for Finetuning-free Personalized Image Generation. arXiv 2024, [pdf]
  • SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation, [pdf]
  • Margin-aware Preference Optimization for Aligning Diffusion Models without Reference, [pdf]
  • DSPO: Direct Score Preference Optimization for Diffusion Model Alignment, [pdf]
  • Direct Distributional Optimization for Provable Alignment of Diffusion Models, [pdf]
  • Boost Your Human Image Generation Model via Direct Preference Optimization, [pdf]
  • Curriculum Direct Preference Optimization for Diffusion and Consistency Models, [pdf]
  • Aesthetic Post-Training Diffusion Models from Generic Preferences with Step-by-step Preference Optimization, [pdf]
  • Personalized Preference Fine-tuning of Diffusion Models, [pdf]
  • Calibrated Multi-Preference Optimization for Aligning Diffusion Models, [pdf]
  • InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment, [pdf]
  • CHATS: Combining Human-Aligned Optimization and Test-Time Sampling for Text-to-Image Generation, [pdf]
  • D-Fusion: Direct Preference Optimization for Aligning Diffusion Models with Visually Consistent Samples, [pdf]
  • Smoothed Preference Optimization via ReNoise Inversion for Aligning Diffusion Models with Varied Human Preferences, [pdf]
  • Refining Alignment Framework for Diffusion Models with Intermediate-Step Preference Ranking, [pdf]
  • Aligning Text to Image in Diffusion Models is Easier Than You Think, [pdf]
  • OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization, [pdf]
  • VideoDPO: Omni-Preference Alignment for Video Diffusion Generation, [pdf]
  • DreamDPO: Aligning Text-to-3D Generation with Human Preferences via Direct Preference Optimization, [pdf]
  • Flow-DPO: Improving Video Generation with Human Feedback, [pdf]
  • HuViDPO: Enhancing Video Generation through Direct Preference Optimization for Human-Centric Alignment, [pdf]
  • Diffusion-NPO: Negative Preference Optimization for Better Preference Aligned Generation of Diffusion Models, [pdf]
  • Fine-Tuning Diffusion Generative Models via Rich Preference Optimization, [pdf]

Reward Feedback Learning

  • ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation, [pdf]
  • Aligning Text-to-Image Diffusion Models with Reward Backpropagation, [pdf]
  • Directly Fine-Tuning Diffusion Models on Differentiable Rewards, [pdf]
  • Feedback Efficient Online Fine-Tuning of Diffusion Models, [pdf]
  • Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models, [pdf]
  • Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward, [pdf]
  • InstructVideo: Instructing Video Diffusion Models with Human Feedback, [pdf]
  • IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation, [pdf]

Technical Reports

We only list the reports using alignment methods:

  • Seedream 2.0: A native chinese-english bilingual image generation foundation model, [pdf]
  • Seedream 3.0 technical report, [pdf]
  • Seedance 1.0: Exploring the Boundaries of Video Generation Models, [pdf]
  • Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model, [pdf]
  • Qwen-Image Technical Report, [pdf]
  • Skywork-UniPic2, [pdf]

Alignment on AR models (referred to here)

  • Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step, [pdf]
  • Autoregressive Image Generation Guided by Chains of Thought, [pdf]
  • LightGen: Efficient Image Generation through Knowledge Distillation and Direct Preference Optimization, [pdf]
  • SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL, [pdf]
  • UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation, [pdf]
  • UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning, [pdf]
  • ReasonGen-R1: CoT for Autoregressive Image Generation model through SFT and RL, [pdf]
  • Unlocking Aha Moments via Reinforcement Learning: Advancing Collaborative Visual Comprehension and Generation, [pdf]
  • Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO, [pdf]
  • CoT-lized Diffusion: Let’s Reinforce T2I Generation Step-by-step, [pdf]
  • X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again, [pdf]
  • T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT, [pdf]
  • AR-GRPO: Training Autoregressive Image Generation Models via Reinforcement Learning, [pdf]

Benchmarks & Reward Models

Benchmarks

  • DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers, [pdf]
  • Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark, [pdf]
  • LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation, [pdf]
  • VPGen & VPEval: Visual Programming for Text-to-Image Generation and Evaluation, [pdf]
  • GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment, [pdf]
  • Holistic Evaluation of Text-to-Image Models, [pdf]
  • Social Reward: Evaluating and Enhancing Generative AI through Million-User Feedback from an Online Creative Community, [pdf]
  • Rich Human Feedback for Text to Image Generation, [pdf]
  • Learning Multi-Dimensional Human Preference for Text-to-Image Generation, [pdf]
  • Evaluating Text-to-Visual Generation with Image-to-Text Generation, [pdf]
  • Multimodal Large Language Models Make Text-to-Image Generative Models Align Better, [pdf]
  • Measuring Style Similarity in Diffusion Models, [pdf]
  • T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-Image Generation, [pdf]
  • DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation, [pdf]
  • PAL: Sample-Efficient Personalized Reward Modeling for Pluralistic Alignment, [pdf]
  • Video-Bench: Human-Aligned Video Generation Benchmark, [pdf]

Reward Models

  • Human Preference Score: Better Aligning Text-to-Image Models with Human Preference, [pdf]
  • ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation, [pdf]
  • Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation, [pdf]
  • Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis, [pdf]
  • Improving Video Generation with Human Feedback, [pdf]
  • VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation, [pdf]
  • VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation, [pdf]
  • Unified Reward Model for Multimodal Understanding and Generation, [pdf]
  • LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment, [pdf]
  • HPSv3: Towards Wide-Spectrum Human Preference Score, [pdf]

Alignment with Prompt Engineering

  • Optimizing Prompts for Text-to-Image Generation, [pdf]
  • RePrompt: Automatic Prompt Editing to Refine AI-Generative Art Towards Precise Expressions, [pdf]
  • Improving Text-to-Image Consistency via Automatic Prompt Optimization, [pdf]
  • Dynamic Prompt Optimizing for Text-to-Image Generation, [pdf]
  • T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT, [pdf]
  • RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning, [pdf]

About

A survey for visual generation alignment

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •