Skip to content

(ICLR Oral) DiSF Combatting Dimensional Collapse in LLM Pre-training Data via Diversified File Selection

Notifications You must be signed in to change notification settings

MediaBrain-SJTU/DiSF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

(ICLR Oral, DiSF) Combatting Dimensional Collapse in LLM Pre-training Data via Diversified File Selection.

Version License

Ziqing Fan1,2 , Siyuan Du2,3 , Shengchao Hu1,2, Pingjie Wang1,2, Li Shen4, Ya Zhang1,2, Dacheng Tao5, Yanfeng Wang1,2

1 Shanghai Jiao Tong University, 2 Shanghai AI Laboratory, 3 Fudan University, 4 Shenzhen Campus of Sun Yat-sen University, 5 Nanyang Technological University,

To do list

visualization code release Done
other baseline code release
improvement on recent code
extracted data, and model release
evaluation

1. Environment

For environment, we provide the following command to construct based on Tinyllama repo(https://github.com/jzhang38/TinyLlama/blob/main/PRETRAIN.md):

pip install --index-url https://download.pytorch.org/whl/nightly/cu118 --pre 'torch>=2.1.0dev'
pip uninstall ninja -y && pip install ninja -U
pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention
python setup.py install
cd csrc/rotary && pip install .
cd ../layer_norm && pip install .
cd ../xentropy && pip install .
cd ../.. && rm -rf flash-attention
pip install -r ./requirements.txt tokenizers sentencepiece

As for detailed environments used in our experiments, we provide them in environment.txt file.

2. Data Prepare

For data, you can download SlimPajama-627B through following command:

cd /path/to/dataset  
git lfs install  
git clone https://huggingface.co/datasets/cerebras/SlimPajama-627B  

or try other sources

git clone https://gitee.com/hf-datasets/SlimPajama-627B

3. Data pre-processing

You should first tokenize the datasets and divide them into chunks:

python ./TinyLLama/scripts/prepare_slimpajama.py --source_path /path/to/SlimPajama \
--tokenizer_path data/llama  --destination_path data/slim_star_combined \
--split validation --percentage 1.0  
python ./TinyLLama/scripts/prepare_slimpajama.py --source_path /path/to/SlimPajama \
--tokenizer_path data/llama  --destination_path data/slim_star_combined \
--split train --percentage 1.0

In parallel, you need to extract text features:

cd ./DISF
python extract_feature.py  

Notably, you should run 10 times of this command and modify the path in the python file to extract all chunk files into features.

4. Data Selection

See ./DISF to select files via DISF

5. Pre-train

See ./TinyLLama to pre-train model.

6. Model Version Transfer and Evaluation

to be continued

7. Visualization and Verification of Dimensional Collapse

  1. You should first extract features of your selected files (See Data pre-processing part)
  2. Before using following codes to visualize dimensional collapse, you should define your data path and fig save path in ./Visual&verify/collapse.py
cd ./Visual&verify/
python collapse.py  
  1. Similarily, before using following codes to calculate dominance score, you should define your data path in ./Visual&verify/dominance_score.py
cd ./Visual&verify/
python dominance_score.py  

Citation

If you find this work is relevant with your research or applications, please feel free to cite our work!

@article{fan2025combatting,
  title={Combatting Dimensional Collapse in LLM Pre-Training Data via Diversified File Selection},
  author={Fan, Ziqing and Du, Siyuan and Hu, Shengchao and Wang, Pingjie and Shen, Li and Zhang, Ya and Tao, Dacheng and Wang, Yanfeng},
  journal={arXiv preprint arXiv:2504.20644},
  year={2025}
}

About

(ICLR Oral) DiSF Combatting Dimensional Collapse in LLM Pre-training Data via Diversified File Selection

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published