You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi OpenGVLab team, thank you very much for all your excellent models.
In the InternVideo 2.5 paper section 3.1, it is mentioned that:
(1) uniform token pruning in early layers to maintain structural integrity while reducing computational overhead, and (2) attention-guided token selection in deeper layers to retain task-relevant essences.
Regarding the second point, attention-guided token selection, could you please share the specific method you used? Since this process involves attention weight, it may not be compatible with Flash Attention 2. Does this lead to excessive memory consumption?