[Docs] About video/multi-image training format: auxiliary sequential tags like Image-1 or Frame-1 #1036

shuoyinn · 2025-05-08T10:20:29Z

Hello, thanks for your awesome works!

I have 2 questions about the data format details to ask:

In your technical report (Intern-VL 2.5) multi-image and video data part, you've mentioned adding auxiliary sequential tags liken "Image-1: " or "Frame-1: " before the placeholder. However, in your doc https://internvl.readthedocs.io/en/latest/get_started/chat_data_format.html, there isn't. I want to know if it will hurt the performance of the models if I don't use these tags during either inference or finetuning, since you trained with these tags
In your huggingface repository, you've mentioned using these tags in multi-image/video data. Nevertheless, in the video scenario, the example auxiliary tag is "Frame1" instead of "Frame-1".

I only want to guarantee that I can make the best use of your great models!

No response

Provide feedback