This is the repo for the paper Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining.
We will release the model checkpoints, datasets and the code within next few weeks.
- [21 October, 2024]: We release the labeled SlimPajama datasets.
- [14 October, 2024]: We release our 1.3B model checkpoints and BERT Topic Classifier.
TODOs:
- Model Checkpoints
- BERT Topic Model Checkpoint
- Labeled Slimpajama-670B datasets
- Code for baselines and methods - will be released after acceptance
- Summarize data efficient pretraining methods ......