GitHub - RainBowLuoCS/Awesome-Unified-Multimodal-Understanding-and-Generation: 📰 Must-read papers on Unified Multimodal Understanding and Generation (constantly updating 🤗).

This paper list is only used to record papers I read in the daily arxiv for personal needs. I hope this will contribute to the unified multimodal generation and understanding community. If you find I missed some important and exciting work, it would be super helpful to let me know. Thanks!

📢 News

🎉 [2025-06-10] Project Beginning 🥳

📜 Notice

This repository is constantly updating 🤗 ...

You can directly click on the title to jump to the corresponding PDF link location

🔍 Method

1️⃣ Image and Text

DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception. Run Luo, Yunshui Li, Longze Chen, Wanwei He, Ting-En Lin, Ziqiang Liu, Lei Zhang, Zikai Song, Xiaobo Xia, Tongliang Liu, Min Yang, Binyuan Hui. ICLR 2025.

2️⃣ Audio and Text

LLaMA-Omni 2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis. Qingkai Fang, Yan Zhou, Shoutao Guo, Shaolei Zhang, Yang Feng. ACL 2025.

3️⃣ Image, Video and Text

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation. Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, Song Han, Yao Lu. ICLR 2025.

4️⃣ Image, Audio and Text

VITA: Towards Open-Source Interactive Omni Multimodal LLM. Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Yuhang Dai, Meng Zhao, Yi-Fan Zhang, Shaoqi Dong, Yangze Li, Xiong Wang, Haoyu Cao, Di Yin, Long Ma, Xiawu Zheng, Rongrong Ji, Yunsheng Wu, Ran He, Caifeng Shan, Xing Sun. Arxiv 2024/08/09.
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions. Kai Chen, Yunhao Gou, Runhui Huang, Zhili Liu, Daxin Tan, Jing Xu, Chunwei Wang, Yi Zhu, Yihan Zeng, Kuo Yang, Dingdong Wang, Kun Xiang, Haoyuan Li, Haoli Bai, Jianhua Han, Xiaohui Li, Weike Jin, Nian Xie, Yu Zhang, James T. Kwok, Hengshuang Zhao, Xiaodan Liang, Dit-Yan Yeung, Xiao Chen, Zhenguo Li, Wei Zhang, Qun Liu, Jun Yao, Lanqing Hong, Lu Hou, Hang Xu. CVPR2025.
OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-time Emotional Speech Synthesis. Run Luo, Ting-En Lin, Haonan Zhang, Yuchuan Wu, Xiong Liu, Min Yang, Yongbin Li, Longze Chen, Jiaming Li, Lei Zhang, Yangyi Chen, Xiaobo Xia, Hamid Alinejad-Rokny, Fei Huang. Arxiv 2024/09/15.
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities. Zhifei Xie, Changqiao Wu. Arxiv 2024/10/15.
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction. Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, Long Ma, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, Ran He. Arxiv 2025/01/03.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📢 News

📜 Notice

🔍 Method

1️⃣ Image and Text

2️⃣ Audio and Text

3️⃣ Image, Video and Text

4️⃣ Image, Audio and Text

5️⃣ Image, Audio, Video and Text

About

Uh oh!

Releases

Packages

RainBowLuoCS/Awesome-Unified-Multimodal-Understanding-and-Generation

Folders and files

Latest commit

History

Repository files navigation

📢 News

📜 Notice

🔍 Method

1️⃣ Image and Text

2️⃣ Audio and Text

3️⃣ Image, Video and Text

4️⃣ Image, Audio and Text

5️⃣ Image, Audio, Video and Text

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages