DS2: Improving Data Efficiency via Curating LLM-Driven Rating Systems

Jinlong Pang, Jiaheng Wei, Ankit Parag Shah, Zhaowei Zhu, Yaxuan Wang, Chen Qian, Yang Liu, Yujia Bao and Wei Wei.

UCSC-REAL Lab, University of California, Santa Cruz

🎉🎉 News

[2025.01.22] 👏👏 Accepted by ICLR 2025.
[2024.11.10] 📢📢 Release the curated dataset.
[2024.10.08] 🚀🚀 Release the code of DS2.

Brief Introduction

This project is motivated by a phenomenon that the errors of LLM-generated raw rating scores are widespread and vary significantly across different LLMs. Then, we introduce DS2, a diversity-aware score curation method for data selection.

Prompt-based LLM Rating: We generate an initial quality score for each data sample using advanced LLMs.
Curated Quality Score Generation: This step corrects potential rating score errors from the previous step by leveraging the Score Transition Matrix to derive a curated quality score.
Long-tail Diversity Score Generation: We score the diversity of each example by measuring the distance between feature embeddings, identifying samples that fall outside common clusters, which tend to be more distinct.
Final Data Selection: We prioritize data by first sorting based on the curated scores and then by the long-tail scores. This dual sorting strategy helps with removing poor-quality outliers while ensuring a diverse, high-quality dataset.

Dataset preparation

One can download the evaluation/training data by

# eval data
bash model_finetune/scripts/prepare_eval_data.sh

# train data 
bash model_finetune/scripts/prepare_train_data.sh

Environment Setup

To run training, evaluation, or inference for finetuned models, you need to install the required packages by running the following command (after installing pytorch):

pip install -r requirements.txt

🚀🚀 Get Started

🧩 Step 1. LLM-prompt-based rating

In this project, we use three labeling models to generate rating scores, including GPT-4o-mini, Mistral-7B-Instruct-v0.3, LLaMA-3.1-8B-Instruct. In particular, we can use the GPT API call to generate the model answers by executing the code located in the LLM_scoring path:

cd LLM_scoring && bash labeling_datasets_api.sh

For open-source models such as LLaMA and Mistral, you can submit the jobs via launcher to the cluster, i.e.,

cd LLM_scoring && launcher run job_labeling.yaml

or generate scores locally using

cd LLM_scoring && bash scoring_datasets_local.sh

🧩 Step 2. Score curation

Th score curation codebase is from Docta in the ./score_curation path. You can execute the score curation by running

cd score_curation && bash diagnose_tulu.sh

The corresponding curation report files could be found in the path ./score_curation/results.

🧩 Step 3. Data selection

Given the existing score curation reports, you can directly use the following jupyter notebooks to do data selection including all baselines: data_gen_baselines_all.ipynb. The generated subsets can be further used for LLM instruction tuning. Other selected datasets used for ablation study can be also generated from the following jupyter notebooks located in the ./score_curation path: data_gen_score_curation.ipynb and data_gen_data_scale.ipynb. In particular, we use data_gen_score_curation.ipynb to generate subsets after curating machine-generated raw scores.

We implement nine baselines consists of Random, Perplexity, KNN, LESS, Completion_length, Full data, Alpagasus (label-filtered), DEITA (diversity-filtered), Ours w/o. curation and Ours.

🧩 Step 4. Finetune & Evaluation

Given the selected subsets in the path model_finetune/selected_data/, you can use the code base from TULU to finetune base models (Mistral or LLaMA) and then do evaluation. In particular, you can submit the jobs via launcher under the path model_finetune/. For example, you can submit the job by running the code

cd model_finetune/ && launcher run job_pipeline_all.yaml

Futhermore, we can also execute the code locally, e.g.,

cd model_finetune/ && bash run_pipeline_all.sh

One can present the final result by running

python model_finetune/read_results.py

Final results

The final results of LLM judging compared with human-annotated dataset LIMA can be found in lima_compare_plot.ipynb. Moreover, for the tabular results, you can check the reading_results.ipynb jupyter notebook.

Citation

If you used this repository, please cite our work:

@article{pang2024improving,
  title={Improving Data Efficiency via Curating LLM-Driven Rating Systems},
  author={Pang, Jinlong and Wei, Jiaheng and Shah, Ankit Parag and Zhu, Zhaowei and Wang, Yaxuan and Qian, Chen and Liu, Yang and Bao, Yujia and Wei, Wei},
  journal={International Conference on Learning Representations},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
LLM_scoring		LLM_scoring
data_scale_results		data_scale_results
model_finetune		model_finetune
score_curation		score_curation
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
calculate_data_scale_results.py		calculate_data_scale_results.py
checking_len.ipynb		checking_len.ipynb
data_statistical.ipynb		data_statistical.ipynb
embedding_dist_explore.py		embedding_dist_explore.py
extract_knn_indices.py		extract_knn_indices.py
lima_compare_plot.ipynb		lima_compare_plot.ipynb
pipeline_overview.png		pipeline_overview.png
plot_alpagasus.ipynb		plot_alpagasus.ipynb
reading_results.ipynb		reading_results.ipynb
requirements.txt		requirements.txt
update_dataset_huggingface.ipynb		update_dataset_huggingface.ipynb
upload_models_huggingface.py		upload_models_huggingface.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DS2: Improving Data Efficiency via Curating LLM-Driven Rating Systems

🎉🎉 News

Brief Introduction

Dataset preparation

Environment Setup

🚀🚀 Get Started

🧩 Step 1. LLM-prompt-based rating

🧩 Step 2. Score curation

🧩 Step 3. Data selection

🧩 Step 4. Finetune & Evaluation

Final results

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

JlPang863/LLM_data_selection

Folders and files

Latest commit

History

Repository files navigation

DS2: Improving Data Efficiency via Curating LLM-Driven Rating Systems

🎉🎉 News

Brief Introduction

Dataset preparation

Environment Setup

🚀🚀 Get Started

🧩 Step 1. LLM-prompt-based rating

🧩 Step 2. Score curation

🧩 Step 3. Data selection

🧩 Step 4. Finetune & Evaluation

Final results

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages