Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification

Hsun-Yu Kuo, Yin-Hsiang Liao, Yu-Chieh Chao, Wei-Yun Ma, Pu-Jen Cheng

This repository contains the official implementation of the paper "Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification", presented as a Spotlight poster at ICLR 2025. The code builds upon the Hugging Face Transformers text classification script to provide our weighted loss methods (IMP-Loss and DIMP-Loss) in text classification tasks using synthetic data generated by GPT 3.5.

Abstract

Synthetic data augmentation via Large Language Models (LLMs) allows researchers to leverage additional training data, thus enhancing the performance of downstream tasks, especially when real-world data is scarce. However, the generated data can deviate from the real-world data, and this misalignment can bring deficient outcomes while applying the trained model to applications. Therefore, we proposed efficient weighted-loss approaches to align synthetic data with real-world distribution by emphasizing high-quality and diversified data generated by LLMs with using merely a little real-world data. We empirically assessed the effectiveness of our methods on multiple text classification tasks, and the results showed leveraging our approaches on a BERT-level model robustly outperformed standard cross-entropy and other data weighting approaches, providing potential solutions to effectively leveraging synthetic data from any suitable data generator.

Installation

If you prefer manual installation, ensure the following packages are installed:

torch==2.4.1
transformers==4.46.2
accelerate==1.2.1
wandb
datasets
evaluate

conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=12.4 -c pytorch -c nvidia

pip install transformers==4.46.2

pip install accelerate==1.2.1

pip install wandb

pip install datasets

pip install evaluate

pip install scikit-learn

Fine-tuning with DIMP-Loss for MRPC

To train the BERT model with DIMP-Loss, you can either use your own trained quality checker model or the one we provide. If using your own model, update the quality_checker_model parameter with the corresponding W&B artifact name. Ensure the quality checker model is uploaded as an artifact to W&B.

Alternatively, to use our provided quality checker model, simply run:

python run.py configs/config_DIMP.json

This command will automatically download the quality checker model from W&B and train the BERT model with DIMP-Loss using the configuration specified in configs/config_DIMP.json. During training, a W&B link will be generated, enabling you to monitor the training process and results in real time. You can either log in to your W&B account or use anonymous mode to access the run. For example, a typical run might look like this: Example Run.

Fine-tuning with IMP-Loss for MRPC

To use your own trained models as the quality checker and diversity checker, update the quality_checker_model and diversity_checker_model parameters, respectively. Ensure both models are uploaded as artifacts to W&B.

Alternatively, to use our provided quality checker and diversity checker models, run the following command:

python run.py configs/config_IMP.json

This command will automatically download the provided quality checker and diversity checker models from W&B and train the BERT model with IMP-Loss using the configuration specified in configs/config_IMP.json. During training, a W&B link will be generated, enabling you to monitor the training process and results in real time. You can either log in to your W&B account or use anonymous mode to access the run. For example, a typical run might look like this: Example Run.

Fine-tuning with CE-Loss (baseline) for MRPC

python run.py configs/config_baseline.json

Important Parameters

model_name_or_path:
Specifies the pretrained model path or identifier from Hugging Face's model hub (e.g., bert-base-uncased, vinai/bertweet-base, hsunyu/epfl_ml_project2/twitter_full_bertweet_large:v1). If using a W&B model, set use_wandb_model to True and specify the model name in the wandb_model key.
problem_type:
Defines the task type. Examples include:
- "single_label_classification": For text classification with cross-entropy loss (CE-Loss).
- "single_label_classification_dimp": For the DIMP-Loss approach.
- "single_label_classification_imp": For the IMP-Loss approach.
wandb_dataset:
Specifies the W&B dataset artifact name for training and evaluation. Examples:
- hsunyu/DIMP-Loss/quality-checker_glue_mrpc_bert:v0: LLM-generated data for MRPC benchmark.
use_wandb_model:
Boolean indicating whether to load a pretrained model from a W&B artifact. Useful for reproducibility. (This one must be set to True in this repo.)
quality_checker_model:
Refers to the W&B artifact for the quality checker model used in DIMP-Loss training. Example: hsunyu/DIMP-Loss/quality-checker_glue_mrpc_bert:v0.
diversity_checker_model:
Refers to the W&B artifact for the quality checker model used in DIMP-Loss training. Example: hsunyu/DIMP-Loss/diversity-checker_IMP_glue_mrpc_bert_5:v1.
per_device_train_batch_size:
Defines the batch size per device during training. Default: 128.
num_train_epochs:
Specifies the total number of training epochs. Default: 3.0.

Citation

If you find our work or code useful in your research, you could cite those with following Bibtex:

@inproceedings{
kuo2025not,
title={Not All {LLM}-Generated Data Are Equal: Rethinking Data Weighting in Text Classification},
author={Hsun-Yu Kuo and Yin-Hsiang Liao and Yu-Chieh Chao and Wei-Yun Ma and Pu-Jen Cheng},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=oI5tZaWkF9}
}

Acknowledgements

Apart from the individuals and organisations acknowledged in the paper, we would also like to extend our sincere gratitude to agbld (Chia-Yu Yeh) for their invaluable contributions, particularly for providing essential computational resources and support for this release.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
configs		configs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
models.py		models.py
run.py		run.py
upload.py		upload.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification

Abstract

Installation

Fine-tuning with DIMP-Loss for MRPC

Fine-tuning with IMP-Loss for MRPC

Fine-tuning with CE-Loss (baseline) for MRPC

Important Parameters

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

Hsun-Yu/DIMP-Loss

Folders and files

Latest commit

History

Repository files navigation

Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification

Abstract

Installation

Fine-tuning with DIMP-Loss for MRPC

Fine-tuning with IMP-Loss for MRPC

Fine-tuning with CE-Loss (baseline) for MRPC

Important Parameters

Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages