Skip to content

[ICPR 2024] Official Implementation of the paper "Beyond Labels: Aligning Large Language Models with Human-like Reasoning" accepted in the 27th International Conference on Pattern Recognition.

License

MIT, MIT licenses found

Licenses found

MIT
LICENSE
MIT
LICENSE-ETHICS
Notifications You must be signed in to change notification settings

apurba-nsu-rnd-lab/DFAR

Repository files navigation

Beyond Labels: Aligning Large Language Models with Human-like Reasoning

Official implementation of the paper [ICPR 2024]


DFAR: Dataset for Aligning Reasons

This work introduces the Dataset for Aligning Reasons (DFAR), a modified version of the ETHICS dataset by Hendrycks et al1. DFAR consists of ethical statements, their corresponding labels, and reasons explaining the ethical judgments.

The DFAR dataset is available here

Train-test split: The DFAR dataset is divided into two subsets: a training set and a test set, with a split ratio of 90% to 10%. Consequently, the training set consists of 4,500 data points, while the test set comprises 500 data points.

NOTE: Label consists of two distinct values: 0 and 1, where 0 represents ethical, and 1 represents unethical.

DFAR Dataset Statistics and Demographic Profile of Dataset Annotators

Dataset Statistics Annotator's Details
Types of Domains Commonsense, Justice Total no. of annotators 12
Min. Text Length 151 No. of female annotators 6
Max. Text Length 1171 No. of male annotators 6
Avg. Text Length 467.45 Avg. age 23
Ethical Instances 2886 (57.7%) Annotators with prior AI knowledge 5
Unethical Instances 2114 (42.3%) Profession Student, Engineer, Housewife
Total Instances 5000 Education Background High School, Undergraduate

Methodology

Figure: (a) Fine-tuning using labels only and (b) Fine-tuning using both labels & reasons on the DFAR dataset. The first approach involves training the model on the ethical-unethical labels without incorporating the accompanying reasons. LLM L produces $\hat{y_i}$ based on the input $x_i$ that passes through the embedding layer. LLM's weights are being updated based on the loss. In our novel approach, LLM (L) generates $\hat{y_i}$ and $\hat{r_i}$ based on the input $x_i$. LLM is fine-tuned based on the loss ($\mathcal{L}$) between embeddings of $\hat{y_i}$, $\hat{r_i}$, and $y_i$, $r_i$ of the dataset.

Screenshot from 2024-08-17 22-58-50


Results

Comparison of Evaluation Results on DFAR and ETHOS

Note: ↑ (higher is better), ↓ (lower is better). `-` denotes results that are not applicable.

Method Models DFAR MAR (%) ↓ DFAR Acc. (%) ↑ ETHOS MAR (%) ↓ ETHOS Acc. (%) ↑
Non-Generative Methods SVM - 69.4 - 66.4
Random Forests - 78.6 - 65.0
Gradient Boosting - 63.2 - 64.3
Logistic Regression - 67.8 - 66.9
BERT - 78.6 - 79.9
DistilBERT - 78.2 - 80.4
Generative Models Mistral 7B
(Pre-trained)
35.4 45.4 9.6 54.7
Mistral 7B
(Fine-tuned L)
18.6 47.4 10.6 56.8
Mistral 7B
(Ours L+R)
12.2 82.2 5.3 59.6
Llama-2 7B
(Pre-trained)
52.0 36.4 32.8 12.0
Llama-2 7B
(Fine-tuned L)
38.4 62.8 33.7 54.1
Llama-2 7B
(Ours L+R)
9.4 89.4 18.6 78.8

Notes

  1. The non-generative models were fine-tuned on both DFAR and ETHOS datasets and evaluated within these datasets.
  2. The generative models were fine-tuned solely on the DFAR dataset and evaluated within the dataset (DFAR) as well as on cross-dataset (ETHOS). They could not be fine-tuned on ETHOS due to the absence of reasoning in the dataset.

Installation

pip install torch -qU
pip install transformers -qU
pip install trl -qU
pip install accelerate -qU
pip install bitsandbytes -qU
pip install peft -qU
pip install datasets -qU

Citation

If you wish to cite this work, feel free to use this BibTeX reference:

@article{kabir2024beyond,
  title={Beyond Labels: Aligning Large Language Models with Human-like Reasoning},
  author={Kabir, Muhammad Rafsan and Sultan, Rafeed Mohammad and Asif, Ihsanul Haque and Ahad, Jawad Ibn and Rahman, Fuad and Amin, Mohammad Ruhul and Mohammed, Nabeel and Rahman, Shafin},
  journal={arXiv preprint arXiv:2408.11879},
  year={2024}
}

References

Footnotes

  1. Hendrycks, D., Burns, C., Basart, S., Critch, A.C., Li, J.L., Song, D., Steinhardt, J.: Aligning ai with shared human values. In: International Conference on Learning Representations (2021) URL

About

[ICPR 2024] Official Implementation of the paper "Beyond Labels: Aligning Large Language Models with Human-like Reasoning" accepted in the 27th International Conference on Pattern Recognition.

Resources

License

MIT, MIT licenses found

Licenses found

MIT
LICENSE
MIT
LICENSE-ETHICS

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •