Official implementation of Calibrating LLMs with Information-Theoretic Evidential Deep Learning (ICLR 2025)
- Install "setuptools":
pip install setuptools
. - Git clone this repository.
- Navigate to the root level of the repository (where
setup.cfg
is located) and runpip install -e .
(Note: Don't forget the dot.
at the end of the command). - (Optional) Run
huggingface-cli login
to log in to HuggingFace-Hub, and usewandb login
to log in to WandB. - (Optional) Go through the Docs of mmengine.Config
to know how to use the
Config
.
Assume you want to fine-tune Llama3-8B on the OBQA dataset using IB-EDL. Run the following command:
python tools/evidential_ft.py configs/obqa_llama3_8b/ib_obqa_llama3_8b.yaml \
-w workdirs/ib_edl/obqa/ \
-n ib_obqa_llama3_8b
To run the training with a different IB regularization strength, you can use -o vib.beta=NEW-VALUE
in the training
command.
Since the configuration file contains the following entry:
process_preds:
npz_file: "obqa.npz"
The training program will save the predictions as a file named obqa.npz
.
After completing the fine-tuning using the command above, you should have:
- A LoRA checkpoint of Llama3-8B trained on OBQA.
- The
obqa.npz
file, which contains predictions on the OBQA dataset (assumed to be the in-distribution (ID) dataset for OOD detection).
The OOD detection can be done as follows:
Step 1: Obtain predictions on the OOD Dataset: Assume that the CSQA dataset is the OOD dataset. You can generate predictions on this dataset using the checkpoint trained on OBQA:
python tools/evidential_ft.py configs/ood_llama3_8b/ib_obqa_csqa_llama3_8b.yaml \
-w workdirs/ib_edl/csqa/ \
-s \
-o model.peft_path=workdirs/ib_edl/obqa/checkpoint-XXX
This command will evaluate the model on CSQA and store the predictions in a file named csqa.npz
.
Step 2: Run OOD detection script as follows:
python tools/ood.py workdirs/ib_edl/obqa/obqa.npz workdirs/ib_edl/csqa/csqa.npz
In the Appendix of the paper, we introduced a post-hoc calibration technique to further enhance the performance of IB-EDL. To use this technique, follow these steps:
Step 1: Visualize the calibration curve: Open a Jupyter Notebook, load the predictions using numpy
, and use
ib_edl.plot_calibration_curve_and_ece
to visualize the calibration curve.
Step 2: Set the sigma multiplier value: Based on the calibration curve, choose an appropriate value for the sigma multiplier in the post-hoc calibration technique. Then, re-run the inference on the validation set with the chosen sigma multiplier:
python tools/evidential_ft.py path/to/config.yaml \
-s \
-o model.peft=path/to/model/checkpoint-XXX vib.sigma_mult=NEW-VALUE
Or you can modify the configuration file directly by updating the following entry:
vib:
sigma_mult: NEW-VALUE
It is recommended to repeat this process on the validation set to determine the best hyperparameter value for
vib.sigma_mult
.
Currently, IB-EDL is implemented only for multiple-choice QA tasks, which follow a classification setting. To extend it for open-ended generation tasks, some adaptations to the implementation are required. Contributions are welcome—feel free to submit a pull request!
@inproceedings{
li2025calibrating,
title={Calibrating {LLM}s with Information-Theoretic Evidential Deep Learning},
author={Yawei Li and David R{\"u}gamer and Bernd Bischl and Mina Rezaei},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=YcML3rJl0N}
}