DCLM baseline low score

Good day! Thank you for the article, you have an interesting approach to taxonomy, which I have not seen before, and which seems promising. The idea of ​​training for such a task qwen 0.5b + to take as a basis the taxonomy that qwen knows from pretrain data, is cool for me.

I have a question about data evaluation. Why did your dclm baseline give a metric only 3% higher than random guessing (27.7% vs. 25%)? I did not see the exact benchmark, but those MMLU STEMs that are on hf do not contain a rocket scene of questions. https://huggingface.co/datasets/TIGER-Lab/MMLU-STEM/viewer/default/test?p=4 It seems to me that the accuracy of the DSLM baseline should be higher.

Can you please describe the experiment in more detail?

DCLM-baseline. DCLM-baseline is a 3.6T token pre-training dataset based on Common Crawl. It is deduplicated, heuristically-filtered, and labeled using a model-based classifier. For classification, the authors train a
fastText classifier on instruction-formatted data from OpenHermes 2.5 and r/ExplainLikeImFive subreddit
[Teknium, 2023]. DCLM-baseline is curated by selecting the top 10% of documents after de-duplication and
heuristic filters based on this classifier score [Li et al., 2025].9 The HuggingFace dataset card notes the dataset is
not intended for domains such as code and math. 

<img width="612" height="186" alt="Image" src="https://github.com/user-attachments/assets/0518fff7-328b-4b11-812f-2ce26d61862b" />
<img width="1002" height="413" alt="Image" src="https://github.com/user-attachments/assets/a4921fd8-a505-45cc-ac94-710b679e6cf6" />


Thank you, wait for you answer :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DCLM baseline low score #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DCLM baseline low score #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions