Skip to content

DCLM baseline low score #3

@kazakovaanastasia

Description

@kazakovaanastasia

Good day! Thank you for the article, you have an interesting approach to taxonomy, which I have not seen before, and which seems promising. The idea of ​​training for such a task qwen 0.5b + to take as a basis the taxonomy that qwen knows from pretrain data, is cool for me.

I have a question about data evaluation. Why did your dclm baseline give a metric only 3% higher than random guessing (27.7% vs. 25%)? I did not see the exact benchmark, but those MMLU STEMs that are on hf do not contain a rocket scene of questions. https://huggingface.co/datasets/TIGER-Lab/MMLU-STEM/viewer/default/test?p=4 It seems to me that the accuracy of the DSLM baseline should be higher.

Can you please describe the experiment in more detail?

DCLM-baseline. DCLM-baseline is a 3.6T token pre-training dataset based on Common Crawl. It is deduplicated, heuristically-filtered, and labeled using a model-based classifier. For classification, the authors train a
fastText classifier on instruction-formatted data from OpenHermes 2.5 and r/ExplainLikeImFive subreddit
[Teknium, 2023]. DCLM-baseline is curated by selecting the top 10% of documents after de-duplication and
heuristic filters based on this classifier score [Li et al., 2025].9 The HuggingFace dataset card notes the dataset is
not intended for domains such as code and math.

Image Image

Thank you, wait for you answer :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions