This paper list is only used to record papers I read in the daily arxiv for personal needs. I hope this will contribute to the unified multimodal generation and understanding community. If you find I missed some important and exciting work, it would be super helpful to let me know. Thanks!
🎉 [2025-06-10] Project Beginning 🥳
This repository is constantly updating 🤗 ...
You can directly click on the title to jump to the corresponding PDF link location
- DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception. Run Luo, Yunshui Li, Longze Chen, Wanwei He, Ting-En Lin, Ziqiang Liu, Lei Zhang, Zikai Song, Xiaobo Xia, Tongliang Liu, Min Yang, Binyuan Hui. ICLR 2025.
- LLaMA-Omni 2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis. Qingkai Fang, Yan Zhou, Shoutao Guo, Shaolei Zhang, Yang Feng. ACL 2025.
- VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation. Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, Song Han, Yao Lu. ICLR 2025.
- VITA: Towards Open-Source Interactive Omni Multimodal LLM. Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Yuhang Dai, Meng Zhao, Yi-Fan Zhang, Shaoqi Dong, Yangze Li, Xiong Wang, Haoyu Cao, Di Yin, Long Ma, Xiawu Zheng, Rongrong Ji, Yunsheng Wu, Ran He, Caifeng Shan, Xing Sun. Arxiv 2024/08/09.
- EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions. Kai Chen, Yunhao Gou, Runhui Huang, Zhili Liu, Daxin Tan, Jing Xu, Chunwei Wang, Yi Zhu, Yihan Zeng, Kuo Yang, Dingdong Wang, Kun Xiang, Haoyuan Li, Haoli Bai, Jianhua Han, Xiaohui Li, Weike Jin, Nian Xie, Yu Zhang, James T. Kwok, Hengshuang Zhao, Xiaodan Liang, Dit-Yan Yeung, Xiao Chen, Zhenguo Li, Wei Zhang, Qun Liu, Jun Yao, Lanqing Hong, Lu Hou, Hang Xu. CVPR2025.
- OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-time Emotional Speech Synthesis. Run Luo, Ting-En Lin, Haonan Zhang, Yuchuan Wu, Xiong Liu, Min Yang, Yongbin Li, Longze Chen, Jiaming Li, Lei Zhang, Yangyi Chen, Xiaobo Xia, Hamid Alinejad-Rokny, Fei Huang. Arxiv 2024/09/15.
- Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities. Zhifei Xie, Changqiao Wu. Arxiv 2024/10/15.
- VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction. Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, Long Ma, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, Ran He. Arxiv 2025/01/03.