-
Notifications
You must be signed in to change notification settings - Fork 69
Open
Description
Hi DeepSeek Team,
First of all, thank you for open-sourcing DeepSeek-V3.2-Exp and sharing the impressive benchmark results.
Our team has been working to reproduce the results reported in your official release, but we’ve encountered some discrepancies that we’d like to understand better.
Reproduction Setup
- Inference:
SGLang v0.5.4.post1on H200 * 8 - Model weights: from the official Hugging Face repository (
commit 9d2f599,mainbranch) - Sampling parameters:
temperature = 0.6 top_p = 0.95 max_tokens = 32768 k = 64 # for pass@1 computation, following DeepSeek-R1 paper - Datasets:
- Prompt format:
Solve the following math problem step by step. The last line of your response should be of the form Answer: $ANSWER (without quotes) where $ANSWER is the answer to the problem. {question} Remember to put your answer on its own line after "Answer:", and you do not need to use a \boxed command.
Our Reproduction Result
| Mode | AIME2025 Pass@1 (report → ours) |
|---|---|
| Thinking | 89.3 → 64.9 |
We have double-checked dataset integrity, prompts, and sampling settings, but still observed a noticeable performance gap.
Questions
Could you please share a bit more about your evaluation configuration? For example:
- The exact inference environment or framework version used
- Whether any custom decoding, filtering, or re-ranking steps were applied
- Any additional prompt preprocessing or formatting details
Any clarification would be greatly appreciated!
Thanks again for releasing such a high-quality model.
Metadata
Metadata
Assignees
Labels
No labels