You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello! I just want to know that the number of images (I) and text (T) is usually the same when the model is trained, but during the zero-shot inference stage, the number of input images and text is likely to be inconsistent. How did you solve it? Thanks!