Hi, I ran the test and found that many models can only answer 320 questions correctly. Why is that? Why is it 320, and if I enter 330, except for the initial "conetext windows", all subsequent ones fail,

I tested gemini2.0-flash, gemini2.5-flash, gemma3-27b, qwen3_235b-a22b,
 among which,2.0-flash, 2.5-flash, qwen235b-a22b,  are all exactly 320.


This is my prompt
```
Here are n five-digit additions in the form of
Qn. xn+yn,

You need to answer in the form of
An. {anwser}
, do not group,

example:
`
A1. 79281
A2. 138779
A3. 139180
...
`
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hi, I ran the test and found that many models can only answer 320 questions correctly. Why is that? Why is it 320, and if I enter 330, except for the initial "conetext windows", all subsequent ones fail, #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Hi, I ran the test and found that many models can only answer 320 questions correctly. Why is that? Why is it 320, and if I enter 330, except for the initial "conetext windows", all subsequent ones fail, #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions