Official implementation of the paper [ICPR 2024]
This work introduces the Dataset for Aligning Reasons (DFAR), a modified version of the ETHICS dataset by Hendrycks et al1. DFAR consists of ethical statements, their corresponding labels, and reasons explaining the ethical judgments.
The DFAR dataset is available here
Train-test split: The DFAR dataset is divided into two subsets: a training set and a test set, with a split ratio of 90% to 10%. Consequently, the training set consists of 4,500 data points, while the test set comprises 500 data points.
NOTE: Label consists of two distinct values: 0 and 1, where 0 represents ethical, and 1 represents unethical.
Dataset Statistics | Annotator's Details | ||
---|---|---|---|
Types of Domains | Commonsense, Justice | Total no. of annotators | 12 |
Min. Text Length | 151 | No. of female annotators | 6 |
Max. Text Length | 1171 | No. of male annotators | 6 |
Avg. Text Length | 467.45 | Avg. age | 23 |
Ethical Instances | 2886 (57.7%) | Annotators with prior AI knowledge | 5 |
Unethical Instances | 2114 (42.3%) | Profession | Student, Engineer, Housewife |
Total Instances | 5000 | Education Background | High School, Undergraduate |
Figure: (a) Fine-tuning using labels only and (b) Fine-tuning using both labels & reasons on the DFAR dataset. The first approach involves training the model on the ethical-unethical labels without incorporating the accompanying reasons. LLM L produces
Note: ↑ (higher is better), ↓ (lower is better). `-` denotes results that are not applicable.
Method | Models | DFAR MAR (%) ↓ | DFAR Acc. (%) ↑ | ETHOS MAR (%) ↓ | ETHOS Acc. (%) ↑ |
---|---|---|---|---|---|
Non-Generative Methods | SVM | - | 69.4 | - | 66.4 |
Random Forests | - | 78.6 | - | 65.0 | |
Gradient Boosting | - | 63.2 | - | 64.3 | |
Logistic Regression | - | 67.8 | - | 66.9 | |
BERT | - | 78.6 | - | 79.9 | |
DistilBERT | - | 78.2 | - | 80.4 | |
Generative Models | Mistral 7B (Pre-trained) |
35.4 | 45.4 | 9.6 | 54.7 |
Mistral 7B (Fine-tuned L) |
18.6 | 47.4 | 10.6 | 56.8 | |
Mistral 7B (Ours L+R) |
12.2 | 82.2 | 5.3 | 59.6 | |
Llama-2 7B (Pre-trained) |
52.0 | 36.4 | 32.8 | 12.0 | |
Llama-2 7B (Fine-tuned L) |
38.4 | 62.8 | 33.7 | 54.1 | |
Llama-2 7B (Ours L+R) |
9.4 | 89.4 | 18.6 | 78.8 |
- The non-generative models were fine-tuned on both DFAR and ETHOS datasets and evaluated within these datasets.
- The generative models were fine-tuned solely on the DFAR dataset and evaluated within the dataset (DFAR) as well as on cross-dataset (ETHOS). They could not be fine-tuned on ETHOS due to the absence of reasoning in the dataset.
pip install torch -qU
pip install transformers -qU
pip install trl -qU
pip install accelerate -qU
pip install bitsandbytes -qU
pip install peft -qU
pip install datasets -qU
If you wish to cite this work, feel free to use this BibTeX reference:
@article{kabir2024beyond,
title={Beyond Labels: Aligning Large Language Models with Human-like Reasoning},
author={Kabir, Muhammad Rafsan and Sultan, Rafeed Mohammad and Asif, Ihsanul Haque and Ahad, Jawad Ibn and Rahman, Fuad and Amin, Mohammad Ruhul and Mohammed, Nabeel and Rahman, Shafin},
journal={arXiv preprint arXiv:2408.11879},
year={2024}
}