-
Notifications
You must be signed in to change notification settings - Fork 0
Description
How wonderful the Work Mono-InternVL is!!! I have the following questions that I would like to ask you, and I would be extremely grateful if you could provide some answers.
Firstly, regarding the convolutional architecture, can the dynamic resolution used in the InternVL series [1] enhance performance? I have observed that directly increasing the resolution, such as in ConvLava without using dynamic resolution, results in slow token growth but normal performance. However, other encoder-free models like Mono-InternVL [2] and HoVLE [3] do employ dynamic resolution. In your opinion, should encoder-free models use dynamic resolution?
Secondly, for both encoder-free and encoder-based models, the attention values in the first few layers typically show relatively weak interaction between user prompt tokens and vision tokens [2]. Do you think this is a key factor affecting the performance of encoder-free models?
[1] Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks