(ICLR Oral, DiSF) Combatting Dimensional Collapse in LLM Pre-training Data via Diversified File Selection.
| 📑 Paper | 🐱 Github Repo |
Ziqing Fan1,2 , Siyuan Du2,3 , Shengchao Hu1,2, Pingjie Wang1,2, Li Shen4, Ya Zhang1,2, Dacheng Tao5, Yanfeng Wang1,2
1 Shanghai Jiao Tong University, 2 Shanghai AI Laboratory, 3 Fudan University, 4 Shenzhen Campus of Sun Yat-sen University, 5 Nanyang Technological University,
visualization code release Done
other baseline code release
improvement on recent code
extracted data, and model release
evaluation
For environment, we provide the following command to construct based on Tinyllama repo(https://github.com/jzhang38/TinyLlama/blob/main/PRETRAIN.md):
pip install --index-url https://download.pytorch.org/whl/nightly/cu118 --pre 'torch>=2.1.0dev'
pip uninstall ninja -y && pip install ninja -U
pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention
python setup.py install
cd csrc/rotary && pip install .
cd ../layer_norm && pip install .
cd ../xentropy && pip install .
cd ../.. && rm -rf flash-attention
pip install -r ./requirements.txt tokenizers sentencepiece
As for detailed environments used in our experiments, we provide them in environment.txt file.
For data, you can download SlimPajama-627B through following command:
cd /path/to/dataset
git lfs install
git clone https://huggingface.co/datasets/cerebras/SlimPajama-627B
or try other sources
git clone https://gitee.com/hf-datasets/SlimPajama-627B
You should first tokenize the datasets and divide them into chunks:
python ./TinyLLama/scripts/prepare_slimpajama.py --source_path /path/to/SlimPajama \
--tokenizer_path data/llama --destination_path data/slim_star_combined \
--split validation --percentage 1.0
python ./TinyLLama/scripts/prepare_slimpajama.py --source_path /path/to/SlimPajama \
--tokenizer_path data/llama --destination_path data/slim_star_combined \
--split train --percentage 1.0
In parallel, you need to extract text features:
cd ./DISF
python extract_feature.py
Notably, you should run 10 times of this command and modify the path in the python file to extract all chunk files into features.
See ./DISF to select files via DISF
See ./TinyLLama to pre-train model.
to be continued
- You should first extract features of your selected files (See Data pre-processing part)
- Before using following codes to visualize dimensional collapse, you should define your data path and fig save path in ./Visual&verify/collapse.py
cd ./Visual&verify/
python collapse.py
- Similarily, before using following codes to calculate dominance score, you should define your data path in ./Visual&verify/dominance_score.py
cd ./Visual&verify/
python dominance_score.py
If you find this work is relevant with your research or applications, please feel free to cite our work!
@article{fan2025combatting,
title={Combatting Dimensional Collapse in LLM Pre-Training Data via Diversified File Selection},
author={Fan, Ziqing and Du, Siyuan and Hu, Shengchao and Wang, Pingjie and Shen, Li and Zhang, Ya and Tao, Dacheng and Wang, Yanfeng},
journal={arXiv preprint arXiv:2504.20644},
year={2025}
}