Amulet-DA-and-Maxim

AMULET - (A)nalyze (MUL)ti-(T)urn conversations

This is the official code and associated datasets for the paper titled

AMULET: Putting Complex Multi-Turn Conversations on the Stand with LLM Juries. Sahana Ramnath, Anurag Mudgil, Brihi Joshi, Skyler Hallinan, Xiang Ren.

For any questions, please contact Sahana Ramnath [sramnath@usc.edu].

Prompts

We provide all GPT-4o prompts in all_prompts.py.

io_prompt_final: input - conversation with preference responses, output - preferred response 1 or 2
wexpl_prompt_final: input - conversation with preference responses, output - preferred response 1 or 2 + NL explanation
dialogact_prompt_final: input - conversation with preference responses, output - conversation annotated with dialog acts + preferred response 1 or 2 + NL explanation
maxim_prompt_final: input - conversation with preference responses, output - maxim satisfaction + preferred response 1 or 2 + NL explanation
dialogact_prompt_final_claude (for Claude): input - conversation with preference responses, output - conversation annotated with dialog acts in a JSON format + preferred response 1 or 2 + NL explanation

Loading the dataset

We work with four evaluation datasets: anthropic-train/test, wildfeedback and nectar.

load_all_data.py contains the code to load all datasets.
all_valid_data - folder which contains all the cleaned evaluation sets.

Note: wildfeedback-train's data can be found here due to Github's upload limit.

Running the judges

api_[io,wexpl,da,maxim].py have the GPT-4o and Claude generation codes. You will have to fill in OpenAI / Claude keys.
lm_any.py has the code to run a local model such as Qwen-2.5-32B-it: python lm_any.py --which-dataset anthropic-test --save-path fill-here --which-model Qwen/Qwen2.5-32B-Instruct --which-mode maxim

The folder final_outputs has all the DA and maxim annotations we obtained with GPT-4o, Claude and Qwen-2.5-32B-it.

Note: wildfeedback-train's outputs can be found here due to Github's upload limit.

Running the reward models

The reward-models folder has the codes to run INF-ORM, QRM, and Skywork-v0.2.

The folder final_outputs have all the RM scores we obtained.

Calculating accuracy of judge and jury

accuracy.py gives the accuracy for every judge, RM and jury we have experimented on

All analyses

Codes and graphs for all analyses in Section 3 in the paper is in the folder analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
all_valid_data		all_valid_data
analysis		analysis
final_outputs		final_outputs
reward-models		reward-models
README.md		README.md
accuracy.py		accuracy.py
all_prompts.py		all_prompts.py
api_da.py		api_da.py
api_io.py		api_io.py
api_maxim.py		api_maxim.py
api_wexpl.py		api_wexpl.py
lm_any.py		lm_any.py
load_all_data.py		load_all_data.py
loading_nectar.py		loading_nectar.py
loading_wildfeedback.py		loading_wildfeedback.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Amulet-DA-and-Maxim

AMULET - (A)nalyze (MUL)ti-(T)urn conversations

Prompts

Loading the dataset

Running the judges

Running the reward models

Calculating accuracy of judge and jury

All analyses

About

Uh oh!

Releases

Packages

Languages

INK-USC/Amulet-DA-and-Maxim

Folders and files

Latest commit

History

Repository files navigation

Amulet-DA-and-Maxim

AMULET - (A)nalyze (MUL)ti-(T)urn conversations

Prompts

Loading the dataset

Running the judges

Running the reward models

Calculating accuracy of judge and jury

All analyses

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages