Jinlong Pang, Jiaheng Wei, Ankit Parag Shah, Zhaowei Zhu, Yaxuan Wang, Chen Qian, Yang Liu, Yujia Bao and Wei Wei.
UCSC-REAL Lab, University of California, Santa Cruz
- [2025.01.22] 👏👏 Accepted by ICLR 2025.
- [2024.11.10] 📢📢 Release the curated dataset.
- [2024.10.08] 🚀🚀 Release the code of DS2.
This project is motivated by a phenomenon that the errors of LLM-generated raw rating scores are widespread and vary significantly across different LLMs. Then, we introduce DS2, a diversity-aware score curation method for data selection.
- Prompt-based LLM Rating: We generate an initial quality score for each data sample using advanced LLMs.
- Curated Quality Score Generation: This step corrects potential rating score errors from the previous step by leveraging the Score Transition Matrix to derive a curated quality score.
- Long-tail Diversity Score Generation: We score the diversity of each example by measuring the distance between feature embeddings, identifying samples that fall outside common clusters, which tend to be more distinct.
- Final Data Selection: We prioritize data by first sorting based on the curated scores and then by the long-tail scores. This dual sorting strategy helps with removing poor-quality outliers while ensuring a diverse, high-quality dataset.
One can download the evaluation/training data by
# eval data
bash model_finetune/scripts/prepare_eval_data.sh
# train data
bash model_finetune/scripts/prepare_train_data.sh
To run training, evaluation, or inference for finetuned models, you need to install the required packages by running the following command (after installing pytorch):
pip install -r requirements.txt
In this project, we use three labeling models to generate rating scores, including GPT-4o-mini, Mistral-7B-Instruct-v0.3, LLaMA-3.1-8B-Instruct. In particular, we can use the GPT API call to generate the model answers by executing the code located in the LLM_scoring
path:
cd LLM_scoring && bash labeling_datasets_api.sh
For open-source models such as LLaMA and Mistral, you can submit the jobs via launcher to the cluster, i.e.,
cd LLM_scoring && launcher run job_labeling.yaml
or generate scores locally using
cd LLM_scoring && bash scoring_datasets_local.sh
Th score curation codebase is from Docta in the ./score_curation
path. You can execute the score curation by running
cd score_curation && bash diagnose_tulu.sh
The corresponding curation report files could be found in the path ./score_curation/results
.
Given the existing score curation reports, you can directly use the following jupyter notebooks to do data selection including all baselines: data_gen_baselines_all.ipynb
. The generated subsets can be further used for LLM instruction tuning. Other selected datasets used for ablation study can be also generated from the following jupyter notebooks located in the ./score_curation
path: data_gen_score_curation.ipynb
and data_gen_data_scale.ipynb
. In particular, we use data_gen_score_curation.ipynb
to generate subsets after curating machine-generated raw scores.
We implement nine baselines consists of Random, Perplexity, KNN, LESS, Completion_length, Full data, Alpagasus (label-filtered), DEITA (diversity-filtered), Ours w/o. curation and Ours.
Given the selected subsets in the path model_finetune/selected_data/
, you can use the code base from TULU to finetune base models (Mistral or LLaMA) and then do evaluation.
In particular, you can submit the jobs via launcher under the path model_finetune/
. For example, you can submit the job by running the code
cd model_finetune/ && launcher run job_pipeline_all.yaml
Futhermore, we can also execute the code locally, e.g.,
cd model_finetune/ && bash run_pipeline_all.sh
One can present the final result by running
python model_finetune/read_results.py
The final results of LLM judging compared with human-annotated dataset LIMA can be found in lima_compare_plot.ipynb
. Moreover, for the tabular results, you can check the reading_results.ipynb
jupyter notebook.
If you used this repository, please cite our work:
@article{pang2024improving,
title={Improving Data Efficiency via Curating LLM-Driven Rating Systems},
author={Pang, Jinlong and Wei, Jiaheng and Shah, Ankit Parag and Zhu, Zhaowei and Wang, Yaxuan and Qian, Chen and Liu, Yang and Bao, Yujia and Wei, Wei},
journal={International Conference on Learning Representations},
year={2025}
}