You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm currently trying to reproduce the results reported in Table 3 of the paper, specifically for the EgoSchema benchmark. However, despite trying various instruction formats, the best accuracy I could achieve is 36.04%, which is significantly lower than the reported 46.5%. The best-performing prompt I found was the one used in Video-LLaMA-2:
Select the best answer to the following multiple-choice question based on the video.
{question}
Options:
(A) {a0}
(B) {a1}
(C) {a2}
(D) {a3}
(E) {a4}
Answer with the option's letter from the given choices directly and only give the best option. The best answer is:
Could you kindly clarify:
Which instruction prompt(s) were used for evaluating the video comprehension benchmarks?
What generation configurations (e.g., temperature, top_p, do_sample, num_beams, max_new_tokens, etc.) were used for each video comprehension task (both multi-choice and open-ended VQA benchmarks)?