Zero-shot inference

Hello! I just want to know that the number of images (I) and text (T) is usually the same when the model is trained, but during the zero-shot inference stage, the number of input images and text is likely to be inconsistent. How did you solve it? Thanks!