You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the paper, it is stated: "We also learn a CLIP-style InternVideo2 indicated by InternVideo2clip. It is post-pretrained from InternVideo2s2 by only preserving video and text encoders and contrastive loss."
May I ask what training dataset was used for InternVideo2clip? Does it include any Chinese data? Approximately how much data would be required to fine-tune it effectively?