Skip to content

RainBowLuoCS/Awesome-Unified-Multimodal-Understanding-and-Generation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 

Repository files navigation

LICENSE Awesome commit PR GitHub Repo stars

This paper list is only used to record papers I read in the daily arxiv for personal needs. I hope this will contribute to the unified multimodal generation and understanding community. If you find I missed some important and exciting work, it would be super helpful to let me know. Thanks!

📢 News

🎉 [2025-06-10] Project Beginning 🥳

📜 Notice

This repository is constantly updating 🤗 ...

You can directly click on the title to jump to the corresponding PDF link location

🔍 Method

1️⃣ Image and Text

  1. DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception. Run Luo, Yunshui Li, Longze Chen, Wanwei He, Ting-En Lin, Ziqiang Liu, Lei Zhang, Zikai Song, Xiaobo Xia, Tongliang Liu, Min Yang, Binyuan Hui. ICLR 2025.

2️⃣ Audio and Text

  1. LLaMA-Omni 2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis. Qingkai Fang, Yan Zhou, Shoutao Guo, Shaolei Zhang, Yang Feng. ACL 2025.

3️⃣ Image, Video and Text

  1. VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation. Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, Song Han, Yao Lu. ICLR 2025.

4️⃣ Image, Audio and Text

  1. VITA: Towards Open-Source Interactive Omni Multimodal LLM. Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Yuhang Dai, Meng Zhao, Yi-Fan Zhang, Shaoqi Dong, Yangze Li, Xiong Wang, Haoyu Cao, Di Yin, Long Ma, Xiawu Zheng, Rongrong Ji, Yunsheng Wu, Ran He, Caifeng Shan, Xing Sun. Arxiv 2024/08/09.
  2. EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions. Kai Chen, Yunhao Gou, Runhui Huang, Zhili Liu, Daxin Tan, Jing Xu, Chunwei Wang, Yi Zhu, Yihan Zeng, Kuo Yang, Dingdong Wang, Kun Xiang, Haoyuan Li, Haoli Bai, Jianhua Han, Xiaohui Li, Weike Jin, Nian Xie, Yu Zhang, James T. Kwok, Hengshuang Zhao, Xiaodan Liang, Dit-Yan Yeung, Xiao Chen, Zhenguo Li, Wei Zhang, Qun Liu, Jun Yao, Lanqing Hong, Lu Hou, Hang Xu. CVPR2025.
  3. OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-time Emotional Speech Synthesis. Run Luo, Ting-En Lin, Haonan Zhang, Yuchuan Wu, Xiong Liu, Min Yang, Yongbin Li, Longze Chen, Jiaming Li, Lei Zhang, Yangyi Chen, Xiaobo Xia, Hamid Alinejad-Rokny, Fei Huang. Arxiv 2024/09/15.
  4. Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities. Zhifei Xie, Changqiao Wu. Arxiv 2024/10/15.
  5. VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction. Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, Long Ma, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, Ran He. Arxiv 2025/01/03.

5️⃣ Image, Audio, Video and Text

About

📰 Must-read papers on Unified Multimodal Understanding and Generation (constantly updating 🤗).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published