README

This README gives an overview of datasets that have been created in the course of investigating methods for evaluating LLM-generated texts with humans.

Overview of the Dataset

The dataset consists of five data files:

LLM_generated_texts
Results_AB
Results_DirectQuality
Results_BinaryDecision
Results_BWS

Below, you will find a detailed explanation of each dataset, its structure, and how to interpret the variables.

Text Labels for Variables in Datasets 2.-5.

Text Number	Text Description
1	Chat GPT Best
2	Chat GPT Worst
3	LLaMA Best
4	LLaMA Worst
5	Mistral Best
6	Mistral Worst
7	Luminous Best
8	Luminous Worst

1. LLM_generated_texts

This dataset contains answers given by different LLM to versions of our prompts, which are outlined in the paper. The dataset contains two columns: index and answer. The column "index" provides information on which LLM generated a given text and the column "answer" stores the responses of the given LLM.

2. Results_AB

This dataset contains data for A/B Testing. The variable names are structured as follows:

Prefix: Represents the metric measured (e.g., Honesty).
Number: Denotes the combination tested (e.g., GPT best and GPT worst text).

Example: Understanding the Metric `Honesty`

The combinations for Honesty are as follows:

Variable Name	Text Combination
Honesty_1	GPT best vs GPT worst
Honesty_2	GPT best vs LLaMA best
Honesty_3	GPT best vs LLaMA worst
Honesty_4	GPT best vs Mistral best
Honesty_5	GPT best vs Mistral worst
Honesty_6	GPT best vs Luminousbase best
Honesty_7	GPT best vs Luminousbase worst
Honesty_8	GPT worst vs LLaMA best
Honesty_9	GPT worst vs LLaMA worst
Honesty_10	GPT worst vs Mistral best
Honesty_11	GPT worst vs Mistral worst
Honesty_12	GPT worst vs Luminousbase best
Honesty_13	GPT worst vs Luminousbase worst
Honesty_14	LLaMA best vs LLaMA worst
Honesty_15	LLaMA best vs Mistral best
Honesty_16	LLaMA best vs Mistral worst
Honesty_17	LLaMA best vs Luminousbase best
Honesty_18	LLaMA best vs Luminousbase worst
Honesty_19	LLaMA worst vs Mistral best
Honesty_20	LLaMA worst vs Mistral worst
Honesty_21	LLaMA worst vs Luminousbase best
Honesty_22	LLaMA worst vs Luminousbase worst
Honesty_23	Mistral best vs Mistral worst
Honesty_24	Mistral best vs Luminousbase best
Honesty_25	Mistral best vs Luminousbase worst
Honesty_26	Mistral worst vs Luminousbase best
Honesty_27	Mistral worst vs Luminousbase worst
Honesty_28	Luminousbase best vs Luminousbase worst

3. Results_DirectQuality

This dataset contains the ratings for each text, where every metric is paired with the corresponding text number. Each variable is rated on a scale of 1 to 5, where:

1: Low agreement
5: Strong agreement

Variable Interpretation

Text Number	Text Description
1	Chat GPT Best
2	Chat GPT Worst
3	LLaMA Best
4	LLaMA Worst
5	Mistral Best
6	Mistral Worst
7	Luminous Best
8	Luminous Worst

Note: When calculating averages:

Individual Texts: Calculated independently.
LLMs: Best and worst texts are combined (e.g., Honesty_1 and Honesty_2 → Honesty_GPT).

4. Results_BinaryDecision

This dataset records binary decisions (e.g., Yes/No) for each text. The metrics and text descriptions match the structure in Results_DirectQuality.

Note on Calculations

Individual Texts: Calculated independently.
LLMs: Best and worst texts are combined (e.g., Honesty_1 and Honesty_2 → Honesty_GPT).

5. Results_BWS

This dataset contains two BIBD (Balanced Incomplete Block Design) configurations for the metrics Honesty and Comprehensibility:

BIBD1: Honesty
BIBD2: Comprehensibility

Generated BIBD Tables

BIBD1 (Honesty):

Row	[1]	[2]	[3]	[4]
1	2	3	4	6
2	1	3	5	7
3	1	4	6	7
4	4	5	7	8
5	1	5	6	8
6	2	5	6	8
7	3	4	7	8
8	3	4	5	8

BIBD2 (Comprehensibility):

Row	[1]	[2]	[3]	[4]
1	1	5	6	7
2	1	2	3	6
3	4	5	6	8
4	3	5	7	8
5	1	5	6	8
6	1	2	4	8
7	2	3	5	7
8	2	4	7	8

Text-to-Number Mapping

Text Number	Text Description
1	Chat GPT Best
2	Chat GPT Worst
3	LLaMA Best
4	LLaMA Worst
5	Mistral Best
6	Mistral Worst
7	Luminous Best
8	Luminous Worst

Example Interpretation of BIBD

A combination like H1 refers to Combination 1 of Honesty.
Prefix:
- B: Selected as the best text.
- W: Selected as the worst text.

Example: BH1 → Best text for Honesty in combination 1.

Calculation Notes

Individual Texts: Best and worst texts are calculated independently.
LLMs: Best and worst texts are combined.

Visual Representations

The visualizations of the BIBD configurations are provided in the attached images:

BIBD1.png: Honesty
BIBD2.png: Comprehensibility

Results for Study Reproduction

The file Tables_Results_for_Study_Reproduction_Methods_LLM.pdf contains detailed results that we deem useful for identifying differences and similarities when this study is reproduced.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
sem_sim/sem_sim		sem_sim/sem_sim
AB_Clusteranalyse.R		AB_Clusteranalyse.R
AB_PairwiseComparisonTexts.R		AB_PairwiseComparisonTexts.R
AB_RankingLLMs.R		AB_RankingLLMs.R
AB_RankingTexte.R		AB_RankingTexte.R
AB_SampleLLMs.R		AB_SampleLLMs.R
AB_SampleTexts.R		AB_SampleTexts.R
BIBD1.png		BIBD1.png
BIBD2.png		BIBD2.png
BWS_AdditionalInfo_Texts.R		BWS_AdditionalInfo_Texts.R
BWS_AdditionalInformation_LLMs.R		BWS_AdditionalInformation_LLMs.R
BWS_RankingTextUndLLMs.R		BWS_RankingTextUndLLMs.R
BWS_SampleComprehensibility.R		BWS_SampleComprehensibility.R
BWS_SampleHonesty.R		BWS_SampleHonesty.R
BWS_SampleTotal.R		BWS_SampleTotal.R
BinaryDecision_ClusterLLMs.R		BinaryDecision_ClusterLLMs.R
BinaryDecision_ClusterTexts.R		BinaryDecision_ClusterTexts.R
BinaryDecision_MetricRankingLLMs.R		BinaryDecision_MetricRankingLLMs.R
BinaryDecision_MetricRankingTexts.R		BinaryDecision_MetricRankingTexts.R
BinaryDecision_SampleLLMs.R		BinaryDecision_SampleLLMs.R
BinaryDecision_SampleText.R		BinaryDecision_SampleText.R
BinaryDecision_ScaleRankingLLMs.R		BinaryDecision_ScaleRankingLLMs.R
BinaryDecision_ScaleRankingText.R		BinaryDecision_ScaleRankingText.R
BinaryDecision_TotalRankingLLMs.R		BinaryDecision_TotalRankingLLMs.R
BinaryDecision_TotalRankingTexts.R		BinaryDecision_TotalRankingTexts.R
Correlation_BERT_Ranking.qmd		Correlation_BERT_Ranking.qmd
CorrelationsOfMethods.R		CorrelationsOfMethods.R
DirectQuality_LLMs.R		DirectQuality_LLMs.R
FH_deskriptiv1.qmd		FH_deskriptiv1.qmd
LLM_generated_texts.xlsx		LLM_generated_texts.xlsx
License.rtf		License.rtf
README.md		README.md
Results_AB.xlsx		Results_AB.xlsx
Results_BWS.xlsx		Results_BWS.xlsx
Results_BinaryDecision.xlsx		Results_BinaryDecision.xlsx
Results_DirectQuality.xlsx		Results_DirectQuality.xlsx
Tables_Results_for_Study_Reproduction_Methods_LLM.pdf		Tables_Results_for_Study_Reproduction_Methods_LLM.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

README

Overview of the Dataset

Text Labels for Variables in Datasets 2.-5.

1. LLM_generated_texts

2. Results_AB

Example: Understanding the Metric `Honesty`

3. Results_DirectQuality

Variable Interpretation

4. Results_BinaryDecision

Note on Calculations

5. Results_BWS

Generated BIBD Tables

Text-to-Number Mapping

Example Interpretation of BIBD

Calculation Notes

Visual Representations

Results for Study Reproduction

About

Uh oh!

Releases

Packages

Languages

License

audiolabs/human-evaluation-of-llm

Folders and files

Latest commit

History

Repository files navigation

README

Overview of the Dataset

Text Labels for Variables in Datasets 2.-5.

1. LLM_generated_texts

2. Results_AB

Example: Understanding the Metric Honesty

3. Results_DirectQuality

Variable Interpretation

4. Results_BinaryDecision

Note on Calculations

5. Results_BWS

Generated BIBD Tables

Text-to-Number Mapping

Example Interpretation of BIBD

Calculation Notes

Visual Representations

Results for Study Reproduction

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Example: Understanding the Metric `Honesty`

Packages