GitHub - rdnfn/feedback-forensics: Feedback Forensics: An open-source toolkit to measure AI personality

Feedback Forensics is an open-source toolkit to measure AI personality changes. Beyond raw capabilities, model personality traits, such as tone and sycophancy, also matter to users. Feedback Forensics can help you track (1) personality changes encouraged by your human (or AI) feedback datasets (tutorial), and (2) personality traits exhibited by your AI models (tutorial). Feedback Forensics includes a Python API, an annotation CLI, and a Gradio visualisation app. We also provide a corresponding online platform tracking personality traits in popular models and datasets.

Use-case 1: Finding personality changes encouraged by feedback data	Use-case 2: Measuring personality changes across models
What personality traits is Chatbot Arena encouraging?	What personality traits changed between Llama 3 and Llama 4?

Docs 📖

See https://docs.feedbackforensics.com/

Online usage 🌍

See our online platform to track personality traits in popular models and datasets. No local installation required.

Local usage 🌓

To track personality traits in your own datasets and models, install Feedback Forensics locally.

Installation

pip install feedback-forensics

Getting started

To start the app locally, run the following command in your terminal:

feedback-forensics -d data/output/example/annotated_pairs.json

This will start the Gradio interface on localhost port 7860 (e.g. http://localhost:7860).

Note

The online results are currently not available when running locally.

Next steps

See the getting started guides in the docs to analyse your own feedback datasets and models.

Python interface

Feedback Forensics can also be used to interpret annotator data within Python. Below is a minimal example:

import feedback_forensics as ff

# load dataset from AnnotatedPairs json file produced by ICAI package
dataset = ff.DatasetHandler()
dataset.add_data_from_path("data/output/example/annotated_pairs.json")

overall_metrics = dataset.get_overall_metrics()
annotator_metrics = dataset.get_annotator_metrics()

How does it work?

Input. As shown in the figure above, we take pairwise model response data as input, where each datapoint consists of a prompt (yellow) and two corresponding model responses (white).

Step 1: Annotate Data. In the first step, we add annotations to each datapoint selecting response A, response B, both or neither responses. To understand personality traits encouraged by human preferences, we include a (1) human annotation (green) selecting the human-preferred response. Such annotations can be imported from external sources (e.g. Chatbot Arena) alongside the pairwise model response data. To understand the personality traits exhibited by a target model (e.g. a Claude model), we add a (2) target model annotation (red) using hard-coded rules on response metadata to select the response generated by the model (if available). Finally, using AI annotators, we add (3) personality annotations (blue) that select the response that exhibits a trait more (e.g. that is more confident). We collect one such annotation per datapoint and tested trait.

Step 2: Compute Metrics. In the second step, we compare these annotations to compute personality metrics. To understand how much a specific personality trait is encouraged by human feedback (Result A), we compare human annotations to personality annotations for that trait. High agreement (measured via strength metric) indicates that the trait (or a highly correlated trait) is encouraged by human feedback. Low agreement indicates that the trait is discouraged. Similarly, to observe how much a target model exhibits a certain trait (Result B), we compare target model annotations to that trait's personality annotations. High agreement indicates that the trait uniquely identifies the model (relative to other models in dataset), i.e. the model exhibits the trait more than other models. Low agreement indicates the model exhibits the trait less than other models.

Citation ✍︎

If you find Feedback Forensics useful in your research, please consider citing the project:

@software{feedbackforensics,
  author = {Findeis, Arduin and Kaufmann, Timo and H{\"u}llermeier, Eyke and Mullins, Robert},
  title = {Feedback Forensics: An open-source toolkit to measure AI personality changes},
  url = {https://github.com/rdnfn/feedback-forensics},
  year = {2025}
}

License ✌︎

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 546 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
data		data
docs		docs
notebooks		notebooks
src/feedback_forensics		src/feedback_forensics
.gitignore		.gitignore
.readthedocs.yml		.readthedocs.yml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Docs 📖

Online usage 🌍

Local usage 🌓

Installation

Getting started

Python interface

How does it work?

Citation ✍︎

License ✌︎

About

Uh oh!

Releases 15

Uh oh!

Contributors 2

Languages

License

rdnfn/feedback-forensics

Folders and files

Latest commit

History

Repository files navigation

Docs 📖

Online usage 🌍

Local usage 🌓

Installation

Getting started

Python interface

How does it work?

Citation ✍︎

License ✌︎

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 15

Uh oh!

Contributors 2

Languages