Beyond Labels: Aligning Large Language Models with Human-like Reasoning

Official implementation of the paper [ICPR 2024]

DFAR: Dataset for Aligning Reasons

This work introduces the Dataset for Aligning Reasons (DFAR), a modified version of the ETHICS dataset by Hendrycks et al¹. DFAR consists of ethical statements, their corresponding labels, and reasons explaining the ethical judgments.

The DFAR dataset is available here

Train-test split: The DFAR dataset is divided into two subsets: a training set and a test set, with a split ratio of 90% to 10%. Consequently, the training set consists of 4,500 data points, while the test set comprises 500 data points.

NOTE: Label consists of two distinct values: 0 and 1, where 0 represents ethical, and 1 represents unethical.

DFAR Dataset Statistics and Demographic Profile of Dataset Annotators

Dataset Statistics		Annotator's Details
Types of Domains	Commonsense, Justice	Total no. of annotators	12
Min. Text Length	151	No. of female annotators	6
Max. Text Length	1171	No. of male annotators	6
Avg. Text Length	467.45	Avg. age	23
Ethical Instances	2886 (57.7%)	Annotators with prior AI knowledge	5
Unethical Instances	2114 (42.3%)	Profession	Student, Engineer, Housewife
Total Instances	5000	Education Background	High School, Undergraduate

Methodology

Figure: (a) Fine-tuning using labels only and (b) Fine-tuning using both labels & reasons on the DFAR dataset. The first approach involves training the model on the ethical-unethical labels without incorporating the accompanying reasons. LLM L produces $\hat{y_i}$ based on the input $x_i$ that passes through the embedding layer. LLM's weights are being updated based on the loss. In our novel approach, LLM (L) generates $\hat{y_i}$ and $\hat{r_i}$ based on the input $x_i$. LLM is fine-tuned based on the loss ($\mathcal{L}$) between embeddings of $\hat{y_i}$, $\hat{r_i}$, and $y_i$, $r_i$ of the dataset.

Results

Comparison of Evaluation Results on DFAR and ETHOS

Note: ↑ (higher is better), ↓ (lower is better). `-` denotes results that are not applicable.

Method	Models	DFAR MAR (%) ↓	DFAR Acc. (%) ↑	ETHOS MAR (%) ↓	ETHOS Acc. (%) ↑
Non-Generative Methods	SVM	-	69.4	-	66.4
	Random Forests	-	78.6	-	65.0
	Gradient Boosting	-	63.2	-	64.3
	Logistic Regression	-	67.8	-	66.9
	BERT	-	78.6	-	79.9
	DistilBERT	-	78.2	-	80.4
Generative Models	Mistral 7B (Pre-trained)	35.4	45.4	9.6	54.7
	Mistral 7B (Fine-tuned L)	18.6	47.4	10.6	56.8
	Mistral 7B (Ours L+R)	12.2	82.2	5.3	59.6
	Llama-2 7B (Pre-trained)	52.0	36.4	32.8	12.0
	Llama-2 7B (Fine-tuned L)	38.4	62.8	33.7	54.1
	Llama-2 7B (Ours L+R)	9.4	89.4	18.6	78.8

Notes

The non-generative models were fine-tuned on both DFAR and ETHOS datasets and evaluated within these datasets.
The generative models were fine-tuned solely on the DFAR dataset and evaluated within the dataset (DFAR) as well as on cross-dataset (ETHOS). They could not be fine-tuned on ETHOS due to the absence of reasoning in the dataset.

Installation

pip install torch -qU
pip install transformers -qU
pip install trl -qU
pip install accelerate -qU
pip install bitsandbytes -qU
pip install peft -qU
pip install datasets -qU

Citation

If you wish to cite this work, feel free to use this BibTeX reference:

@article{kabir2024beyond,
  title={Beyond Labels: Aligning Large Language Models with Human-like Reasoning},
  author={Kabir, Muhammad Rafsan and Sultan, Rafeed Mohammad and Asif, Ihsanul Haque and Ahad, Jawad Ibn and Rahman, Fuad and Amin, Mohammad Ruhul and Mohammed, Nabeel and Rahman, Shafin},
  journal={arXiv preprint arXiv:2408.11879},
  year={2024}
}

References

Hendrycks, D., Burns, C., Basart, S., Critch, A.C., Li, J.L., Song, D., Steinhardt, J.: Aligning ai with shared human values. In: International Conference on Learning Representations (2021) URL ↩

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
DFAR dataset		DFAR dataset
FT_(L)_Llama2.ipynb		FT_(L)_Llama2.ipynb
FT_(L)_Mistral.ipynb		FT_(L)_Mistral.ipynb
FT_(L+R)_Llama2.ipynb		FT_(L+R)_Llama2.ipynb
FT_(L+R)_Mistral.ipynb		FT_(L+R)_Mistral.ipynb
LICENSE		LICENSE
LICENSE-ETHICS		LICENSE-ETHICS
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Uh oh!

Repository files navigation

Beyond Labels: Aligning Large Language Models with Human-like Reasoning

DFAR: Dataset for Aligning Reasons

DFAR Dataset Statistics and Demographic Profile of Dataset Annotators

Methodology

Results

Comparison of Evaluation Results on DFAR and ETHOS

Notes

Installation

Citation

References

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

Licenses found

apurba-nsu-rnd-lab/DFAR

Folders and files

Latest commit

History

Repository files navigation

Beyond Labels: Aligning Large Language Models with Human-like Reasoning

DFAR: Dataset for Aligning Reasons

DFAR Dataset Statistics and Demographic Profile of Dataset Annotators

Methodology

Results

Comparison of Evaluation Results on DFAR and ETHOS

Notes

Installation

Citation

References

Footnotes

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages