This repository contains the official code for the ACL 2025 paper: "From Lists to Emojis: How Format Bias Affects Model Alignment"
This project reveals that preference models in RLHF are strongly biased toward specific text formats like lists and emojis. We show LLMs can exploit this bias to inflate benchmark scores. Our work highlights the need to disentangle format from content for better model alignment.
Check our paper for more information.
Since our code framework is designed to test various reward and generative models, and these models may have conflicting dependencies, there isn't one universal environment that works for everything. To get started, you should create an environment based on the specific models you plan to use for your initial tests.
We identify seven distinct patterns within responses: length, emoji, bold, exclamation, list, link, and affirmative. We compare the proportions of samples with certain formats in both the preferred and unpreferred responses. When a significant difference inthese proportions is observed, we identify it as a potential bias pattern.
To evaluate format bias in the preference dataset, run the following command:
python src/eval_script/eval_dataset.py --dataset_path <your_dataset_path>We provides three methods to evaluate format bias in different types of reward models. You can find the detailed evaluation methods for reward models in Section 2.2 of our paper.
-
Numerical Reward Models: These models return a numerical score. Examples include models based on Bradley-Terry (BT) loss, such as:
-
Pairwise Preference Models: These models output a discrete preference, indicating which of two responses for the same prompt is better. Examples include:
-
Implicit Reward Models: These models are inferred from the log-likelihood of an original Supervised Fine-Tuning (SFT) model and the model aligned with Direct Preference Optimization (DPO). An example is:
- Zephyr-Beta-Mistral-7B (This model was aligned with DPO on the base Mistral-SFT-Beta-7B model).
To evaluate a model, use the following commands. Be sure to replace the placeholder values with your specific file paths and model names. You can use your own bias evaluation dataset by replacing src/data/augment_pairs.json.
# Numerical Reward Model
python src/eval_script/eval_rm.py --dataset_name_or_path src/data/augment_pairs.json --output_dir <your_log_file> --reward_name_or_path <your_reward_model> --tokenizer_path <your_tokenizer>
# Pairwise Preference Model
python src/eval_script/eval_pm.py --dataset_name_or_path src/data/augment_pairs.json --output_dir <your_log_file> --reward_name_or_path <your_reward_model> --tokenizer_path <your_tokenizer>
# Implicit Reward Model
python src/eval_script/eval_dpo.py --dataset_name_or_path src/data/augment_pairs.json --output_dir <your_log_file> --reward_name_or_path <your_reward_model> --tokenizer_path <your_tokenizer>Next, we conduct controllable experiments to investigate how biases transfer from preference data to the reward model, and further to the downstream RLHF-aligned model. For simplicity, we focus on the bold pattern and list pattern.
We borrow the training script from RLHFlow to train the reward model. The training datasets are as follows:
The reward models are as follows:
| Huggingface Reward Model Name | Training Dataset |
|---|---|
| 1231czx/llama3_it_unbiased_ver3 | Base |
| 1231czx/llama3_it_ultra_original_100 | Base + 100 Bold |
| 1231czx/llama3_it_ultra_original_250 | Base + 250 Bold |
| 1231czx/llama3_it_ultra_original_500 | Base + 500 Bold |
| 1231czx/llama3_it_list_attack100_v3 | Base + 100 List |
| 1231czx/llama3_it_ultra_list250 | Base + 250 List |
| 1231czx/llama3_it_ultra_list500 | Base + 500 List |
| 1231czx/llama3_it_listattack1000 | Base + 1000 List |
| 1231czx/llama3_it_listattack1000_bold500 | Base + 500 Bold + 1000 List |
Additionally, we aligned the Llama-3-8B-it model with various reward models using both online and offline DPO algorithms. The resulting aligned chat models are as follows:
You can find the detailed experiment results in Section 3 of our paper.
We implement a simple debiasing method with reordering trick to tackle the sparsity of certain patterns. You can find the detailed debiasing method and experimental results in Section 4 of our paper.
To train a unbiased reward model from biased preference dataset, run the following command:
cd src/train_script && ./debias.shIf you use our code, please cite our paper:
@misc{zhang2025listsemojisformatbias,
title={From Lists to Emojis: How Format Bias Affects Model Alignment},
author={Xuanchang Zhang and Wei Xiong and Lichang Chen and Tianyi Zhou and Heng Huang and Tong Zhang},
year={2025},
eprint={2409.11704},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.11704},
}