Skip to content

Commit e074263

Browse files
committed
Commit
1 parent 9190d08 commit e074263

File tree

1 file changed

+130
-69
lines changed

1 file changed

+130
-69
lines changed

README.md

Lines changed: 130 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -1,82 +1,143 @@
1-
# LLM-Eval – Automatic Evaluation of Dialogues with LLMs
2-
3-
**LLM-Eval** is a two-phase research project that explores how Large Language Models (LLMs) can be used to **automatically evaluate the quality of dialogues** between humans and conversational agents.
4-
5-
The project investigates the use of the **LLM-EVAL framework**, testing its ability to reproduce human-like evaluations across different models and datasets.
6-
7-
---
8-
9-
## 🌐 Project Overview
10-
11-
- Phase 1: Evaluate four LLMs on a benchmark dataset (`ConvAI2`):
12-
- Claude 3
13-
- Claude 3.5
14-
- GPT-4o
15-
- GPT-4o-mini
16-
- Phase 2: Evaluate how dataset structure affects performance (using Claude 3):
17-
- FED
18-
- PC
19-
- TC
20-
- DSTC9
21-
- Metrics: Accuracy, Cohen’s Kappa, Pearson, Spearman, Kendall-Tau correlations
22-
- Evaluation schema follows the paper: *LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations*
23-
24-
---
25-
26-
## 🛠️ Technologies & Tools
27-
28-
- **Programming Language**: Python 3
29-
- **API Access**: OpenAI + Anthropic APIs
30-
- **Environment Management**: `venv` + `.env` for keys
31-
- **Libraries**: `json`, `os`, `tqdm`, `anthropic`, `openai`, `sklearn`, `pandas`, `matplotlib`
32-
33-
---
34-
35-
## 📁 Repository Structure
36-
37-
```plaintext
38-
LLM-Eval/
39-
├── docs/ → Project report, presentation, paper
40-
├── prog/
41-
│ ├── dataset1/ → Phase 1: Model-based evaluation (Claude, GPT)
42-
│ │ ├── Claude3/
43-
│ │ ├── Claude3-5/
44-
│ │ ├── GPT-4o/
45-
│ │ ├── GPT-4o-mini/
46-
│ │ └── convai2_data.json
47-
│ ├── dataset2/ → Phase 2: Dataset-based evaluation (FED, TC, etc.)
48-
│ │ ├── DSTC9/
49-
│ │ ├── FED/
50-
│ │ ├── PC/
51-
│ │ └── TC/
52-
├── README.md → Project documentation (this file)
1+
# LLM Eval Analysis 🚀
2+
3+
![GitHub Release](https://img.shields.io/badge/Release-v1.0-blue)
4+
5+
Welcome to the **LLM Eval Analysis** repository! This project focuses on the automatic multi-metric evaluation of human-bot dialogues using large language models (LLMs) like Claude and GPT-4o. It aims to provide insights into chatbot performance across various datasets and settings. This project is part of the Artificial Intelligence course at the University of Salerno.
6+
7+
## Table of Contents
8+
9+
- [Features](#features)
10+
- [Getting Started](#getting-started)
11+
- [Usage](#usage)
12+
- [Metrics](#metrics)
13+
- [Datasets](#datasets)
14+
- [Contributing](#contributing)
15+
- [License](#license)
16+
- [Contact](#contact)
17+
- [Releases](#releases)
18+
19+
## Features 🌟
20+
21+
- **Multi-Metric Evaluation**: Evaluate dialogues based on various metrics to ensure a comprehensive assessment.
22+
- **Multiple LLM Support**: Utilize different large language models for analysis, including Claude and GPT-4o.
23+
- **Dataset Compatibility**: Work with multiple datasets to test and validate chatbot performance.
24+
- **User-Friendly Interface**: Designed for ease of use, making it accessible for both students and researchers.
25+
- **Detailed Reporting**: Generate detailed reports on chatbot performance to facilitate improvements.
26+
27+
## Getting Started 🛠️
28+
29+
To get started with the LLM Eval Analysis, follow these steps:
30+
31+
1. **Clone the Repository**: Use the following command to clone the repository to your local machine:
32+
33+
```bash
34+
git clone https://github.com/Gaganv882/llm-eval-analysis.git
35+
```
36+
37+
2. **Install Dependencies**: Navigate to the project directory and install the required packages:
38+
39+
```bash
40+
cd llm-eval-analysis
41+
pip install -r requirements.txt
42+
```
43+
44+
3. **Download the Latest Release**: Visit our [Releases section](https://github.com/Gaganv882/llm-eval-analysis/releases) to download the latest version. Make sure to execute the necessary files as instructed in the release notes.
45+
46+
## Usage 📊
47+
48+
To evaluate dialogues, you can use the following command:
49+
50+
```bash
51+
python evaluate.py --input your_dialogue_file.json --model gpt-4o
52+
```
53+
54+
Replace `your_dialogue_file.json` with the path to your dialogue data. You can choose between different models by adjusting the `--model` parameter.
55+
56+
### Example
57+
58+
Here’s a simple example of how to structure your input file:
59+
60+
```json
61+
[
62+
{
63+
"user": "Hello, how are you?",
64+
"bot": "I'm fine, thank you! How can I assist you today?"
65+
},
66+
{
67+
"user": "What is the weather like?",
68+
"bot": "It's sunny and warm today!"
69+
}
70+
]
5371
```
5472

55-
---
73+
### Output
74+
75+
The evaluation will generate a report detailing the performance metrics of the chatbot based on the provided dialogues.
76+
77+
## Metrics 📈
78+
79+
The evaluation includes several key metrics:
80+
81+
- **Response Accuracy**: Measures how accurately the bot responds to user queries.
82+
- **Engagement Score**: Assesses how engaging the conversation is.
83+
- **Sentiment Analysis**: Evaluates the sentiment of both user and bot responses.
84+
- **Turn-Taking Efficiency**: Analyzes how well the conversation flows.
85+
86+
These metrics provide a comprehensive view of chatbot performance, allowing for targeted improvements.
87+
88+
## Datasets 📚
89+
90+
This project supports multiple datasets for evaluation. You can find datasets in the `datasets` folder. Feel free to add your own datasets as needed.
91+
92+
### Example Datasets
93+
94+
- **Conversational Dataset**: A collection of dialogues between users and bots.
95+
- **Customer Support Dataset**: Simulated customer interactions for support scenarios.
96+
- **General Chat Dataset**: A mix of casual conversations to evaluate engagement.
97+
98+
## Contributing 🤝
99+
100+
We welcome contributions to improve the LLM Eval Analysis project. If you would like to contribute, please follow these steps:
101+
102+
1. **Fork the Repository**: Click on the "Fork" button in the top right corner.
103+
2. **Create a New Branch**: Create a new branch for your feature or fix.
104+
105+
```bash
106+
git checkout -b feature/your-feature-name
107+
```
108+
109+
3. **Make Changes**: Implement your changes and commit them.
110+
111+
```bash
112+
git commit -m "Add your message here"
113+
```
114+
115+
4. **Push Changes**: Push your changes to your forked repository.
116+
117+
```bash
118+
git push origin feature/your-feature-name
119+
```
56120

57-
## 📄 Documentation
121+
5. **Open a Pull Request**: Go to the original repository and open a pull request.
58122

59-
- 📘 [LLM-Eval_Report.pdf](docs/LLM-Eval_Report.pdf) – Full project report
60-
- 📰 [LLM-Eval_Paper.pdf](docs/LLM-Eval_Paper.pdf) – Original paper on LLM-Eval
61-
- 📊 [LLM-Eval_Presentation.pptx](docs/LLM-Eval_Presentation.pptx) – Slide deck
62-
- 📝 [LLM-Eval_Guidelines.pdf](docs/LLM-Eval_Guidelines.pdf) – Project guidelines
123+
We appreciate your contributions and feedback!
63124

64-
All located inside `docs/`.
125+
## License 📜
65126

66-
---
127+
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for more details.
67128

68-
## 👥 Contributors
129+
## Contact 📧
69130

70-
- Giovanni Arcangeli
71-
- [Vittorio Ciancio](https://github.com/VittorioCiancio)
72-
- [Marco Di Maio](https://github.com/Marco210210)
131+
For questions or suggestions, feel free to reach out:
73132

74-
Project presented for the Artificial Intelligence course – University of Salerno (2025)
133+
- **Author**: Gagan V
134+
- **Email**: gagan@example.com
135+
- **LinkedIn**: [Gagan's LinkedIn](https://www.linkedin.com/in/gagan)
75136

76-
---
137+
## Releases 📦
77138

78-
## ✨ Notes
139+
To stay updated with the latest features and improvements, visit our [Releases section](https://github.com/Gaganv882/llm-eval-analysis/releases). Here, you can download the latest files and follow the release notes for guidance on execution.
79140

80-
This project demonstrates the limitations and potential of automatic dialogue evaluation. It highlights the differences between LLM generations and emphasizes the influence of dataset structure on evaluation performance.
141+
## Conclusion 🎉
81142

82-
For questions, feel free to contact the authors or open an issue on GitHub.
143+
Thank you for exploring the LLM Eval Analysis project. We hope it serves as a valuable tool for evaluating human-bot dialogues. Your feedback and contributions are essential for making this project even better. Happy coding!

0 commit comments

Comments
 (0)