Skip to content

Clarification on Instructions Used for Video Comprehension Benchmarks & cannot reproduce results #5

@junghye01

Description

@junghye01

Hi, thank you for your great work

I'm currently trying to reproduce the results reported in Table 3 of the paper, specifically for the EgoSchema benchmark. However, despite trying various instruction formats, the best accuracy I could achieve is 36.04%, which is significantly lower than the reported 46.5%. The best-performing prompt I found was the one used in Video-LLaMA-2:

Select the best answer to the following multiple-choice question based on the video.
{question}
Options:
(A) {a0}
(B) {a1}
(C) {a2}
(D) {a3}
(E) {a4}
Answer with the option's letter from the given choices directly and only give the best option. The best answer is:

Could you kindly clarify:

  1. Which instruction prompt(s) were used for evaluating the video comprehension benchmarks?

  2. What generation configurations (e.g., temperature, top_p, do_sample, num_beams, max_new_tokens, etc.) were used for each video comprehension task (both multi-choice and open-ended VQA benchmarks)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions