GitHub - MediaBrain-SJTU/DiSF: (ICLR Oral) DiSF Combatting Dimensional Collapse in LLM Pre-training Data via Diversified File Selection

(ICLR Oral, DiSF) Combatting Dimensional Collapse in LLM Pre-training Data via Diversified File Selection.

| 📑 Paper | 🐱 Github Repo |

Ziqing Fan^1,2, Siyuan Du^2,3, Shengchao Hu^1,2, Pingjie Wang^1,2, Li Shen⁴, Ya Zhang^1,2, Dacheng Tao⁵, Yanfeng Wang^1,2

¹ Shanghai Jiao Tong University, ² Shanghai AI Laboratory, ³ Fudan University, ⁴ Shenzhen Campus of Sun Yat-sen University, ⁵ Nanyang Technological University,

To do list

visualization code release Done
other baseline code release
improvement on recent code
extracted data, and model release
evaluation

1. Environment

For environment, we provide the following command to construct based on Tinyllama repo(https://github.com/jzhang38/TinyLlama/blob/main/PRETRAIN.md):

pip install --index-url https://download.pytorch.org/whl/nightly/cu118 --pre 'torch>=2.1.0dev'
pip uninstall ninja -y && pip install ninja -U
pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention
python setup.py install
cd csrc/rotary && pip install .
cd ../layer_norm && pip install .
cd ../xentropy && pip install .
cd ../.. && rm -rf flash-attention
pip install -r ./requirements.txt tokenizers sentencepiece

As for detailed environments used in our experiments, we provide them in environment.txt file.

2. Data Prepare

For data, you can download SlimPajama-627B through following command:

cd /path/to/dataset  
git lfs install  
git clone https://huggingface.co/datasets/cerebras/SlimPajama-627B

or try other sources

git clone https://gitee.com/hf-datasets/SlimPajama-627B

3. Data pre-processing

You should first tokenize the datasets and divide them into chunks:

python ./TinyLLama/scripts/prepare_slimpajama.py --source_path /path/to/SlimPajama \
--tokenizer_path data/llama  --destination_path data/slim_star_combined \
--split validation --percentage 1.0  
python ./TinyLLama/scripts/prepare_slimpajama.py --source_path /path/to/SlimPajama \
--tokenizer_path data/llama  --destination_path data/slim_star_combined \
--split train --percentage 1.0

In parallel, you need to extract text features:

cd ./DISF
python extract_feature.py

Notably, you should run 10 times of this command and modify the path in the python file to extract all chunk files into features.

4. Data Selection

See ./DISF to select files via DISF

5. Pre-train

See ./TinyLLama to pre-train model.

6. Model Version Transfer and Evaluation

to be continued

7. Visualization and Verification of Dimensional Collapse

You should first extract features of your selected files (See Data pre-processing part)
Before using following codes to visualize dimensional collapse, you should define your data path and fig save path in ./Visual&verify/collapse.py

cd ./Visual&verify/
python collapse.py

Similarily, before using following codes to calculate dominance score, you should define your data path in ./Visual&verify/dominance_score.py

cd ./Visual&verify/
python dominance_score.py

Citation

If you find this work is relevant with your research or applications, please feel free to cite our work!

@article{fan2025combatting,
  title={Combatting Dimensional Collapse in LLM Pre-Training Data via Diversified File Selection},
  author={Fan, Ziqing and Du, Siyuan and Hu, Shengchao and Wang, Pingjie and Shen, Li and Zhang, Ya and Tao, Dacheng and Wang, Yanfeng},
  journal={arXiv preprint arXiv:2504.20644},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

(ICLR Oral, DiSF) Combatting Dimensional Collapse in LLM Pre-training Data via Diversified File Selection.

| 📑 Paper | 🐱 Github Repo |

To do list

1. Environment

2. Data Prepare

3. Data pre-processing

4. Data Selection

5. Pre-train

6. Model Version Transfer and Evaluation

7. Visualization and Verification of Dimensional Collapse

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
DISF		DISF
TinyLLama		TinyLLama
Visual&verify		Visual&verify
README.md		README.md
environment.txt		environment.txt
requirements.txt		requirements.txt

MediaBrain-SJTU/DiSF

Folders and files

Latest commit

History

Repository files navigation

(ICLR Oral, DiSF) Combatting Dimensional Collapse in LLM Pre-training Data via Diversified File Selection.

| 📑 Paper | 🐱 Github Repo |

To do list

1. Environment

2. Data Prepare

3. Data pre-processing

4. Data Selection

5. Pre-train

6. Model Version Transfer and Evaluation

7. Visualization and Verification of Dimensional Collapse

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages