Skip to content

Some questions about encoder-free VLMs #3

@bollossom

Description

@bollossom

How wonderful the Work Mono-InternVL is!!! I have the following questions that I would like to ask you, and I would be extremely grateful if you could provide some answers.

Firstly, regarding the convolutional architecture, can the dynamic resolution used in the InternVL series [1] enhance performance? I have observed that directly increasing the resolution, such as in ConvLava without using dynamic resolution, results in slow token growth but normal performance. However, other encoder-free models like Mono-InternVL [2] and HoVLE [3] do employ dynamic resolution. In your opinion, should encoder-free models use dynamic resolution?

Secondly, for both encoder-free and encoder-based models, the attention values in the first few layers typically show relatively weak interaction between user prompt tokens and vision tokens [2]. Do you think this is a key factor affecting the performance of encoder-free models?

[1] Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

[2] Mono-internvl: Pushing the boundaries of monolithic multimodal large language models with endogenous visual pre-training

[3] HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions