Clarification on Instructions Used for Video Comprehension Benchmarks & cannot reproduce results

Hi, thank you for your great work

I'm currently trying to reproduce the results reported in Table 3 of the paper, specifically for the EgoSchema benchmark. However, despite trying various instruction formats, the best accuracy I could achieve is 36.04%, which is significantly lower than the reported 46.5%. The best-performing prompt I found was the one used in Video-LLaMA-2:

```
Select the best answer to the following multiple-choice question based on the video.
{question}
Options:
(A) {a0}
(B) {a1}
(C) {a2}
(D) {a3}
(E) {a4}
Answer with the option's letter from the given choices directly and only give the best option. The best answer is:
```

Could you kindly clarify:

1. Which instruction prompt(s) were used for evaluating the video comprehension benchmarks?

2. What generation configurations (e.g., temperature, top_p, do_sample, num_beams, max_new_tokens, etc.) were used for each video comprehension task (both multi-choice and open-ended VQA benchmarks)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarification on Instructions Used for Video Comprehension Benchmarks & cannot reproduce results #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarification on Instructions Used for Video Comprehension Benchmarks & cannot reproduce results #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions