Skip to content

Reproduction Inquiry: DeepSeek-V3.2-Exp AIME2025 Result #44

@HermitSun

Description

@HermitSun

Hi DeepSeek Team,

First of all, thank you for open-sourcing DeepSeek-V3.2-Exp and sharing the impressive benchmark results.
Our team has been working to reproduce the results reported in your official release, but we’ve encountered some discrepancies that we’d like to understand better.


Reproduction Setup

  • Inference: SGLang v0.5.4.post1 on H200 * 8
  • Model weights: from the official Hugging Face repository (commit 9d2f599, main branch)
  • Sampling parameters:
    temperature = 0.6
    top_p = 0.95
    max_tokens = 32768
    k = 64  # for pass@1 computation, following DeepSeek-R1 paper
    
  • Datasets:
  • Prompt format:
    Solve the following math problem step by step. The last line of your response should be of the form Answer: $ANSWER (without quotes) where $ANSWER is the answer to the problem.
    
    {question}
    
    Remember to put your answer on its own line after "Answer:", and you do not need to use a \boxed command.
    

Our Reproduction Result

Mode AIME2025 Pass@1 (report → ours)
Thinking 89.3 → 64.9

We have double-checked dataset integrity, prompts, and sampling settings, but still observed a noticeable performance gap.


Questions

Could you please share a bit more about your evaluation configuration? For example:

  • The exact inference environment or framework version used
  • Whether any custom decoding, filtering, or re-ranking steps were applied
  • Any additional prompt preprocessing or formatting details

Any clarification would be greatly appreciated!
Thanks again for releasing such a high-quality model.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions