-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Good day! Thank you for the article, you have an interesting approach to taxonomy, which I have not seen before, and which seems promising. The idea of training for such a task qwen 0.5b + to take as a basis the taxonomy that qwen knows from pretrain data, is cool for me.
I have a question about data evaluation. Why did your dclm baseline give a metric only 3% higher than random guessing (27.7% vs. 25%)? I did not see the exact benchmark, but those MMLU STEMs that are on hf do not contain a rocket scene of questions. https://huggingface.co/datasets/TIGER-Lab/MMLU-STEM/viewer/default/test?p=4 It seems to me that the accuracy of the DSLM baseline should be higher.
Can you please describe the experiment in more detail?
DCLM-baseline. DCLM-baseline is a 3.6T token pre-training dataset based on Common Crawl. It is deduplicated, heuristically-filtered, and labeled using a model-based classifier. For classification, the authors train a
fastText classifier on instruction-formatted data from OpenHermes 2.5 and r/ExplainLikeImFive subreddit
[Teknium, 2023]. DCLM-baseline is curated by selecting the top 10% of documents after de-duplication and
heuristic filters based on this classifier score [Li et al., 2025].9 The HuggingFace dataset card notes the dataset is
not intended for domains such as code and math.


Thank you, wait for you answer :)