Connector-S

🔥 A collection of must-read papers and resources related to connectors in MLLMs.

The organization of papers refers to our survey "Connector-S: A Survey of Connectors in Multi-modal Large Language Models".

Please let us know if you find out a mistake or have any suggestions.

If you find our survey useful for your research, please cite the following paper:

@article{zhu2025connector,
  title={Connector-S: A Survey of Connectors in Multi-modal Large Language Models},
  author={Zhu, Xun and Zhang, Zheng and Chen, Xi and Shi, Yiming and Li, Miao and Wu, Ji},
  journal={arXiv preprint arXiv:2502.11453},
  year={2025}
}

🔔 News

🎉 [2025/04/29] Connector-S is accepted by IJCAI 2025 Survey Track!
✨ [2025/03/16] We create this repository to maintain and expand a paper list on connectors in MLLMs. More papers are coming soon!
💥 [2025/02/18] Our survey is released! See Connector-S for the paper!

🌟 Introduction

With the rapid advancements in multi-modal large language models (MLLMs), connectors play a pivotal role in bridging diverse modalities and enhancing model performance. However, the design and evolution of connectors have not been comprehensively analyzed, leaving gaps in understanding how these components function and hindering the development of more powerful connectors.

we systematically review the current progress of connectors in MLLMs and present a structured taxonomy that categorizes connectors into atomic operations (mapping, compression, mixture of experts) and holistic designs (multi-layer, multi-encoder, multi-modal scenarios), highlighting their technical contributions and advancements. Furthermore, we list several promising research frontiers and challenges, including high-resolution input, dynamic compression, guide information selection, combination strategy, and interpretability.

Table of Content

Connector-S

Paper List

Atomic Connector Operations

Atomic connector operations refer to the basic components of MLLM connectors, which are designed as simple yet versatile units tailored to different functional requirements of basic scenarios. By utilizing these atomic operations, connectors can achieve mapping, compression, and expert integration. Furthermore, they can be combined to create more complex connectors, bridging the modality gap in a targeted and flexible way.

Mapping

Mapping operations first flatten 2D or 3D features into 1D in a specific order and directly align the dimension of representations from other modalities with textual token embeddings.

Linear

[NeurIPS 23] "Visual Instruction Tuning". Liu et al. [Paper] [Resource]
[arXiv 23] "LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models". Zhang et al. [Paper] [Resource]
[NeurIPS 24] "Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models". Jiao et al. [Paper]
[NeurIPS 24] "VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing". Fei et al. [Paper] [Resource]
[EMNLP 24] "M2PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning". Wang et al. [Paper] [Resource]
[CIKM 24] "ChefFusion: Multimodal Foundation Model Integrating Recipe and Food Image Generation". Li et al. [Paper] [Resource]
[CVPR 24] "Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs". Ranasinghe et al. [Paper]
[CVPR 24] "VTimeLLM: Empower LLM to Grasp Video Moments". Huang et al. [Paper] [Resource]
[CVPR 24] "LocLLM: Exploiting Generalizable Human Keypoint Localization via Large Language Model". Wang et al. [Paper] [Resource]
[ACM MM 24]"LLaVA-VSD: Large Language-and-Vision Assistant for Visual Spatial Description". Jin et al. [Paper] [Resource]
[Computers and Education: Artificial Intelligence] "LLaVA-docent: Instruction tuning with multimodal large language model to support art appreciation education". Lee et al. [Paper]
[AAAI 24] "BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions". Hu et al. [Paper] [Resource]
[arXiv 24] "LM4LV: A Frozen Large Language Model for Low-level Vision Tasks". Zheng et al. [Paper] [Resource]
[ICLR 25] "mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models". Ye et al. [Paper] [Resource]

MLP

[ECCV 24] "ShareGPT4V: Improving Large Multi-Modal Models with Better Captions". Chen et al. [Paper] [Resource]
[ACL 24] "GeoGPT4V:Towards Geometric Multi-modal Large Language Models with Geometric Image Generation". Cai et al. [Paper] [Resource]
[Science China Information Sciences 24] "How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites". Chen et al. [Paper] [Resource]
[CVPR 24] "CogVLM: Visual Expert for Pretrained Language Models". Wang et al. [Paper] [Resource]
[CVPR 24] "Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception". He et al. [Paper] [Resource]
[CVPR 24] "Improved Baselines with Visual Instruction Tuning". Liu et al. [Paper] [Resource]
[EMNLP 24] "MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model". Huo et al. [Paper] [Resource]
[EMNLP 24] "Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models". Jiang et al. [Paper] [Resource]
[EMNLP 24] "Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering". Hao et al. [Paper] [Resource]
[NeurIPS 24] "ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models". Wu et al. [Paper] [Resource]
[ICLR 24] "DREAMLLM: SYNERGISTIC MULTIMODAL COMPREHENSION AND CREATION". Dong et al. [Paper] [Resource]
[ICME 24] "3DMIT: 3D MULTI-MODAL INSTRUCTION TUNING FOR SCENE UNDERSTANDING". Li et al. [Paper] [Resource]
[arXiv 23] "VCoder: Versatile Vision Encoders for Multimodal Large Language Models". Jain et al. [Paper] [Resource]
[arXiv 24] "ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models". Ge et al. [Paper] [Resource]
[arXiv 24] "InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks". Chen et al. [Paper] [Resource]
[arXiv 24] "Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs". Ranasinghe et al. [Paper]
[arXiv 24] "Yi: Open Foundation Models by 01.AI". Young et al. [Paper] [Resource]
[arXiv 24] "MoE-LLaVA: Mixture of Experts for Large Vision-Language Models". Lin et al. [Paper] [Resource]
[arXiv 24] "RoboMP2: A Robotic Multimodal Perception-Planning Framework with Multimodal Large Language Models". Lv et al. [Paper] [Resource]

Compression

Spatial Relation

Simple Operation

[NeurIPS 24] "DeTikZify: Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ". belouadi et al. [Paper] [Resource]
[CVPR 24] "Generative Multimodal Models are In-Context Learners". Sun et al. [Paper] [Resource]
[ACM MM 24] "EAGLE: Egocentric AGgregated Language-video Engine". Bi et al. [Paper] [Resource]
[ACL 24] "Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models". Yao et al. [Paper] [Resource]
[ICLR 25] "PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning". Xu et al. [Paper] [Resource]
[arXiv 23] "MiniGPT-v2: Large Language Model As a Unified Interface for Vision-Language Multi-task Learning". Chen et al. [Paper] [Resource]
[arXiv 24] "Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight". Huang et al. [Paper] [Resource]
[arXiv 24] "DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models". Yao et al. [Paper] [Resource]

CNN

[arXiv 23] "MACAW-LLM: MULTI-MODAL LANGUAGE MODELING WITH IMAGE, AUDIO, VIDEO, AND TEXT INTEGRATION". Lyu et al. [Paper] [Resource]
[CVPR 24] "Honeybee: Locality-enhanced Projector for Multimodal LLM". Cha et al. [Paper] [Resource]
[ECCV 24] "MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training". McKinzie et al. [Paper]

Variants

[CVPR 24] "Honeybee: Locality-enhanced Projector for Multimodal LLM". Cha et al. [Paper] [Resource]
[NeurIPS 24] "MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models". Shen et al. [Paper] [Resource]

Semantic Perception

Q-Former

[NeurIPS 23] "InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning". Dai et al. [Paper] [Resource]
[ICLR 23] "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models". Li et al. [Paper] [Resource]
[ACL ARR 24] "UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation". Zhao et al. [Paper] [Resource]
[CVPR 24] "SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection". Qi et al. [Paper] [Resource]
[CVPR 24] "Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval". Jang et al. [Paper] [Resource
[CVPR 24] "Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld". Yang et al. [Paper] [Resource]
[ACM TKDD 24] "TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model". Chen et al. [Paper]
[ICLR 24] "MMICL: EMPOWERING VISION-LANGUAGE MODEL WITH MULTI-MODAL IN-CONTEXT LEARNING". Zhao et al. [Paper] [Resource]
[ICLR 24] "MINIGPT-4:ENHANCING VISION-LANGUAGE UNDERSTANDING WITH ADVANCED LARGE LANGUAGE MODELS". Zhu et al. [Paper] [Resource]
[ICLR 24] "EMU: GENERATIVE PRETRAINING IN MULTIMODALITY". Sun et al. [Paper] [Resource]
[ICML 24] "TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones". Yuan et al. [Paper]
[arXiv 24] "Towards Event-oriented Long Video Understanding". Du et al. [Paper] [Resource]
[ACL 24] "UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion". Li et al. [Paper] [Resource]
[ACM MM 24] "Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding". Wu et al. [Paper] [Resource]
[ACM MM 24] "CREAM: Coarse-to-Fine Retrieval and Multi-modal Efficient Tuning for Document VQA". Zhang et al. [Paper]
[ACM MM 24] "GPT4Video: A Unified Multimodal Large Language Model for Instruction-Followed Understanding and Safety-Aware". Wang et al. [Paper]
[MICCAI 24] "PathAlign: A vision–language model for whole slide images in histopathology". Ahmed et al. [Paper]
[BIBM 24] "C2RG: Parameter-efficient Adaptation of 3D Vision and Language Foundation Model for Coronary CTA Report Generation". Ye et al. [Paper]
[NeurIPS 24] "What matters when building vision-language models?". Laurençon et al. [Paper] [Resource]
[AAAI 24] "Structure-aware multimodal sequential learning for visual dialog". Kim et al. [Paper]
[AAAI 24] "BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions". Hu et al. [Paper] [Resource]
[AAAI 25] "PlanLLM: Video Procedure Planning with Refinable Large Language Models". Yang et al. [Paper] [Resource]

Resampler

[NeurIPS 22] "Flamingo: a Visual Language Model for Few-Shot Learning". Alayrac et al. [Paper]
[NeurIPS 24] "Voila-A: Aligning Vision-Language Models with User's Gaze Attention". Yan et al. [Paper]
[CVPR 24] "Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models". Li et al. [Paper] [Resource]
[ACL 24] "InfiMM: Advancing Multimodal Understanding with an Open-Sourced Visual Language Model". Liu et al. [Paper] [Resource]

Abstractor

[ACM MM 24] "mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model". Hu et al. [Paper] [Resource]
[CVPR 24] "mPLUG-OwI2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration". Ye et al. [Paper] [Resource]
[ICML 24] "GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model". Li et al. [Paper] [Resource]
[arXiv 24] "Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare". Zhu et al. [Paper] [Resource]
[arXiv 24] "Q-ALIGN: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels". Wu et al. [Paper] [Resource]

Variants

[EMNLP 24] "Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM". Eom et al. [Paper]
[arXiv 24] "ParGo: Bridging Vision-Language with Partial and Global Views". Wang et al. [Paper] [Resource]

Mixture of Experts

Vanilla MoE

[NeurIPS 24] "CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts". Cha et al. [Paper] [Resource]
[BIBM 24]"SurgFC: Multimodal Surgical Function Calling Framework on the Demand of Surgeons". Chen et al. [Paper] [Resource]
[ICLR 25] "CHARTMOE: MIXTURE OF DIVERSELY ALIGNED EXPERT CONNECTOR FOR CHART UNDERSTANDING". Xu et al. [Paper] [Resource]

X-Guided MoE

Modality-Guided

[CVPR 24] "OneLLM: One Framework to Align All Modalities with Language". Han et al. [Paper] [Resource]

Text-Guided

[ACM MM 24] "Q-MoE: Connector for MLLMs with Text-Driven Routing". Wang et al. [Paper]

Task-Guided

[NeurIPS 24] "Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE". Zhu et al. [Paper] [Resource]

Variant MoE

[CVPR 24] "V∗: Guided Visual Search as a Core Mechanism in Multimodal LLMs". Wu and Xie [Paper] [Resource]

Holistic Connector Designs

Multi-Layer Scenario

[ECCV 24] "Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models". Ma et al. [Paper] [Resource]
[CVPR 24] "GLaMM: Pixel Grounding Large Multimodal Model". Rasheed et al. [Paper] [Resource]
[CVPR 24] "LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge". Chen et al. [Paper] [Resource]
[NeurIPS 24] "Dense Connector for MLLMs". Yao et al. [Paper] [Resource]
[arXiv 24] "TokenPacker:Efficient Visual Projector for Multimodal LLM". Li et al. [Paper] [Resource]
[arXiv 24] "MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding". Cao et al. [Paper] [Resource]
[arXiv 24] "TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models". Yu et al. [Paper] [Resource]

Multi-Encoder Scenario

[arXiv 24] "DeepSeek-VL: Towards Real-World Vision-Language Understanding". Lu et al. [Paper] [Resource]
[arXiv 24] "SPHINX: THE JOINT MIXING OF WEIGHTS, TASKS,AND VISUAL EMBEDDINGS FOR MULTI-MODAL LARGE LANGUAGE MODELS". Lin et al. [Paper] [Resource]
[ICML 24] "SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models". Liu et al. [Paper] [Resource]
[CoRL 24] "OpenVLA:An Open-Source Vision-Language-Action Model". Kim et al. [Paper] [Resource]
[ICLR 24] "From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models". Jiang et al. [Paper]
[CVPR 24] "Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs". Tong et al. [Paper]
[ECCV 24] "BRAVE : Broadening the visual encoding of vision-language models". Kar et al. [Paper] [Resource]
[ACM MM 24] "LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound". Guo et al. [Paper]
[NeurIPS 24] "MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models". Shen et al. [Paper] [Resource]
[NeurIPS 24] "Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs". Tong et al. [Paper] [Resource]
[NeurIPS 24] "MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model". Jiang et al. [Paper]
[ICLR 25] "EAGLE: EXPLORING THE DESIGN SPACE FOR MULTIMODAL LLMS WITH MIXTURE OF ENCODERS". Shi et al. [Paper] [Resource]
[AAAI 25] "Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference". Zhao et al. [Paper] [Resource]

Multi-Modal Scenario

[TMLR 24] "LLaVA-OneVision: Easy Visual Task Transfer". Li et al. [Paper] [Resource]
[CVPR 24] "OneLLM: One Framework to Align All Modalities with Language". Han et al. [Paper] [Resource]
[ACL 24] "Recognizing Everything from All Modalities at Once:Grounded Multimodal Universal Information Extraction". Zhang et al. [Paper] [Resource]
[ACL 24] "GroundingGPT: Language Enhanced Multi-modal Grounding Model". Li et al. [Paper] [Resource]
[ICML 24] "MACAW-LLM: MULTI-MODAL LANGUAGE MODELING WITH IMAGE, AUDIO, VIDEO, AND TEXT INTEGRATION". Lyu et al. [Paper] [Resource]
[ECCV 24] "Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time". Chowdhury et al. [Paper] [Resource]
[ECCV 25] "CAT : Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios". Ye et al. [Paper] [Resource]
[EMNLP 24] "AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model". Moon et al. [Paper]
[arXiv 23] "VideoPoet: A Large Language Model for Zero-Shot Video Generation". Kondratyuk et al. [Paper] [Resource]
[arXiv 23] "PandaGPT:One Model To Instruction-Follow Them All". Su et al. [Paper] [Resource]
[arXiv 24] "Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution". Wang et al. [Paper] [Resource]
[arXiv 24] "World Knowledge-Enhanced Reasoning Using Instruction-guided Interactor in Autonomous Driving". Zhai et al. [Paper]

Future Directions and Challenges

High-Resolution Input

[EMNLP 23] "UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model". Ye et al. [Paper]
[Science China Information Sciences 24] "How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites". Chen et al. [Paper] [Resource]
[arXiv 24] "SPHINX: THE JOINT MIXING OF WEIGHTS, TASKS,AND VISUAL EMBEDDINGS FOR MULTI-MODAL LARGE LANGUAGE MODELS". Lin et al. [Paper] [Resource]
[ICML 24] "SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models". Liu et al. [Paper] [Resource]
[BIBM 24] "PA-LLaVA: A Large Language-Vision Assistant for Human Pathology Image Understanding". Dai et al. [Paper] [Resource]
[ECCV 24] "LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images". Xu et al. [Paper] [Resource]
[CVPR 24] "Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models". Li et al. [Paper] [Resource]
[NeurIPS 24] "VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks". Wu et al. [Paper] [Resource]
[NeurIPS 24] "DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs". Meng et al. [Paper] [Resource]
[(NeurIPS 24] "InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD". Dong et al. [Paper] [Resource]
[AAAI 25] "HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models". Arif et al. [Paper] [Resource]

Dynamic Compression

[arXiv 24] "FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression". Zhu et al. [Paper]
[NeurIPS 24] "Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model". Liu et al. [Paper] [Resource]
[ICLR 24] "UNIFIED LANGUAGE-VISION PRETRAINING IN LLM WITH DYNAMIC DISCRETE VISUAL TOKENIZATION". Jin et al. [Paper] [Resource]
[AAAI 25] "DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming". Zhang et al. [Paper]

Guide Information Selection

[NeurIPS 24] "MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model". Jiang et al. [Paper]
[ACM MM 24] "Semantic Alignment for Multimodal Large Language Models". Wu et al. [Paper] [Resource]
[ACL 24] "MLeVLM: Improve Multi-level Progressive Capabilities based on Multimodal Large Language Model for Medical Visual Question Answering". Xu et al. [Paper] [Resource]
[AAAI 25] "Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine". Huang et al. [Paper] [Resource]
[arXiv 24] "TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings". Yan et al. [Paper]
[arXiv 24] "PPLLAVA: VARIED VIDEO SEQUENCE UNDERSTANDING WITH PROMPT GUIDANCE". Liu et al. [Paper] [Resource]
[arXiv 24] "World Knowledge-Enhanced Reasoning Using Instruction-guided Interactor in Autonomous Driving". Zhai et al. [Paper]

Combination Strategy

[NeurIPS 24] "Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs". Tong et al. [Paper] [Resource]
[ICLR 25] "EAGLE: EXPLORING THE DESIGN SPACE FOR MULTIMODAL LLMS WITH MIXTURE OF ENCODERS". Shi et al. [Paper] [Resource]

Interpretability

[EMNLP 24] "MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model". Huo et al. [Paper] [Resource]
[arXiv 24] "DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models". Yao et al. [Paper] [Resource]

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md

MSIIP/Connector-S

Folders and files

Latest commit

History

Repository files navigation

Connector-S

🔔 News

🌟 Introduction

Table of Content

Paper List

Atomic Connector Operations

Mapping

Linear

MLP

Compression

Spatial Relation

Simple Operation

CNN

Variants

Semantic Perception

Q-Former

Resampler

Abstractor

Variants

Mixture of Experts

Vanilla MoE

X-Guided MoE

Modality-Guided

Text-Guided

Task-Guided

Variant MoE

Holistic Connector Designs

Multi-Layer Scenario

Multi-Encoder Scenario

Multi-Modal Scenario

Future Directions and Challenges

High-Resolution Input

Dynamic Compression

Guide Information Selection

Combination Strategy

Interpretability

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Packages