|
1 |
| -# LLM-Eval – Automatic Evaluation of Dialogues with LLMs |
2 |
| - |
3 |
| -**LLM-Eval** is a two-phase research project that explores how Large Language Models (LLMs) can be used to **automatically evaluate the quality of dialogues** between humans and conversational agents. |
4 |
| - |
5 |
| -The project investigates the use of the **LLM-EVAL framework**, testing its ability to reproduce human-like evaluations across different models and datasets. |
6 |
| - |
7 |
| ---- |
8 |
| - |
9 |
| -## 🌐 Project Overview |
10 |
| - |
11 |
| -- Phase 1: Evaluate four LLMs on a benchmark dataset (`ConvAI2`): |
12 |
| - - Claude 3 |
13 |
| - - Claude 3.5 |
14 |
| - - GPT-4o |
15 |
| - - GPT-4o-mini |
16 |
| -- Phase 2: Evaluate how dataset structure affects performance (using Claude 3): |
17 |
| - - FED |
18 |
| - - PC |
19 |
| - - TC |
20 |
| - - DSTC9 |
21 |
| -- Metrics: Accuracy, Cohen’s Kappa, Pearson, Spearman, Kendall-Tau correlations |
22 |
| -- Evaluation schema follows the paper: *LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations* |
23 |
| - |
24 |
| ---- |
25 |
| - |
26 |
| -## 🛠️ Technologies & Tools |
27 |
| - |
28 |
| -- **Programming Language**: Python 3 |
29 |
| -- **API Access**: OpenAI + Anthropic APIs |
30 |
| -- **Environment Management**: `venv` + `.env` for keys |
31 |
| -- **Libraries**: `json`, `os`, `tqdm`, `anthropic`, `openai`, `sklearn`, `pandas`, `matplotlib` |
32 |
| - |
33 |
| ---- |
34 |
| - |
35 |
| -## 📁 Repository Structure |
36 |
| - |
37 |
| -```plaintext |
38 |
| -LLM-Eval/ |
39 |
| -├── docs/ → Project report, presentation, paper |
40 |
| -├── prog/ |
41 |
| -│ ├── dataset1/ → Phase 1: Model-based evaluation (Claude, GPT) |
42 |
| -│ │ ├── Claude3/ |
43 |
| -│ │ ├── Claude3-5/ |
44 |
| -│ │ ├── GPT-4o/ |
45 |
| -│ │ ├── GPT-4o-mini/ |
46 |
| -│ │ └── convai2_data.json |
47 |
| -│ ├── dataset2/ → Phase 2: Dataset-based evaluation (FED, TC, etc.) |
48 |
| -│ │ ├── DSTC9/ |
49 |
| -│ │ ├── FED/ |
50 |
| -│ │ ├── PC/ |
51 |
| -│ │ └── TC/ |
52 |
| -├── README.md → Project documentation (this file) |
| 1 | +# LLM Eval Analysis 🚀 |
| 2 | + |
| 3 | + |
| 4 | + |
| 5 | +Welcome to the **LLM Eval Analysis** repository! This project focuses on the automatic multi-metric evaluation of human-bot dialogues using large language models (LLMs) like Claude and GPT-4o. It aims to provide insights into chatbot performance across various datasets and settings. This project is part of the Artificial Intelligence course at the University of Salerno. |
| 6 | + |
| 7 | +## Table of Contents |
| 8 | + |
| 9 | +- [Features](#features) |
| 10 | +- [Getting Started](#getting-started) |
| 11 | +- [Usage](#usage) |
| 12 | +- [Metrics](#metrics) |
| 13 | +- [Datasets](#datasets) |
| 14 | +- [Contributing](#contributing) |
| 15 | +- [License](#license) |
| 16 | +- [Contact](#contact) |
| 17 | +- [Releases](#releases) |
| 18 | + |
| 19 | +## Features 🌟 |
| 20 | + |
| 21 | +- **Multi-Metric Evaluation**: Evaluate dialogues based on various metrics to ensure a comprehensive assessment. |
| 22 | +- **Multiple LLM Support**: Utilize different large language models for analysis, including Claude and GPT-4o. |
| 23 | +- **Dataset Compatibility**: Work with multiple datasets to test and validate chatbot performance. |
| 24 | +- **User-Friendly Interface**: Designed for ease of use, making it accessible for both students and researchers. |
| 25 | +- **Detailed Reporting**: Generate detailed reports on chatbot performance to facilitate improvements. |
| 26 | + |
| 27 | +## Getting Started 🛠️ |
| 28 | + |
| 29 | +To get started with the LLM Eval Analysis, follow these steps: |
| 30 | + |
| 31 | +1. **Clone the Repository**: Use the following command to clone the repository to your local machine: |
| 32 | + |
| 33 | + ```bash |
| 34 | + git clone https://github.com/Gaganv882/llm-eval-analysis.git |
| 35 | + ``` |
| 36 | + |
| 37 | +2. **Install Dependencies**: Navigate to the project directory and install the required packages: |
| 38 | + |
| 39 | + ```bash |
| 40 | + cd llm-eval-analysis |
| 41 | + pip install -r requirements.txt |
| 42 | + ``` |
| 43 | + |
| 44 | +3. **Download the Latest Release**: Visit our [Releases section](https://github.com/Gaganv882/llm-eval-analysis/releases) to download the latest version. Make sure to execute the necessary files as instructed in the release notes. |
| 45 | + |
| 46 | +## Usage 📊 |
| 47 | + |
| 48 | +To evaluate dialogues, you can use the following command: |
| 49 | + |
| 50 | +```bash |
| 51 | +python evaluate.py --input your_dialogue_file.json --model gpt-4o |
| 52 | +``` |
| 53 | + |
| 54 | +Replace `your_dialogue_file.json` with the path to your dialogue data. You can choose between different models by adjusting the `--model` parameter. |
| 55 | + |
| 56 | +### Example |
| 57 | + |
| 58 | +Here’s a simple example of how to structure your input file: |
| 59 | + |
| 60 | +```json |
| 61 | +[ |
| 62 | + { |
| 63 | + "user": "Hello, how are you?", |
| 64 | + "bot": "I'm fine, thank you! How can I assist you today?" |
| 65 | + }, |
| 66 | + { |
| 67 | + "user": "What is the weather like?", |
| 68 | + "bot": "It's sunny and warm today!" |
| 69 | + } |
| 70 | +] |
53 | 71 | ```
|
54 | 72 |
|
55 |
| ---- |
| 73 | +### Output |
| 74 | + |
| 75 | +The evaluation will generate a report detailing the performance metrics of the chatbot based on the provided dialogues. |
| 76 | + |
| 77 | +## Metrics 📈 |
| 78 | + |
| 79 | +The evaluation includes several key metrics: |
| 80 | + |
| 81 | +- **Response Accuracy**: Measures how accurately the bot responds to user queries. |
| 82 | +- **Engagement Score**: Assesses how engaging the conversation is. |
| 83 | +- **Sentiment Analysis**: Evaluates the sentiment of both user and bot responses. |
| 84 | +- **Turn-Taking Efficiency**: Analyzes how well the conversation flows. |
| 85 | + |
| 86 | +These metrics provide a comprehensive view of chatbot performance, allowing for targeted improvements. |
| 87 | + |
| 88 | +## Datasets 📚 |
| 89 | + |
| 90 | +This project supports multiple datasets for evaluation. You can find datasets in the `datasets` folder. Feel free to add your own datasets as needed. |
| 91 | + |
| 92 | +### Example Datasets |
| 93 | + |
| 94 | +- **Conversational Dataset**: A collection of dialogues between users and bots. |
| 95 | +- **Customer Support Dataset**: Simulated customer interactions for support scenarios. |
| 96 | +- **General Chat Dataset**: A mix of casual conversations to evaluate engagement. |
| 97 | + |
| 98 | +## Contributing 🤝 |
| 99 | + |
| 100 | +We welcome contributions to improve the LLM Eval Analysis project. If you would like to contribute, please follow these steps: |
| 101 | + |
| 102 | +1. **Fork the Repository**: Click on the "Fork" button in the top right corner. |
| 103 | +2. **Create a New Branch**: Create a new branch for your feature or fix. |
| 104 | + |
| 105 | + ```bash |
| 106 | + git checkout -b feature/your-feature-name |
| 107 | + ``` |
| 108 | + |
| 109 | +3. **Make Changes**: Implement your changes and commit them. |
| 110 | + |
| 111 | + ```bash |
| 112 | + git commit -m "Add your message here" |
| 113 | + ``` |
| 114 | + |
| 115 | +4. **Push Changes**: Push your changes to your forked repository. |
| 116 | + |
| 117 | + ```bash |
| 118 | + git push origin feature/your-feature-name |
| 119 | + ``` |
56 | 120 |
|
57 |
| -## 📄 Documentation |
| 121 | +5. **Open a Pull Request**: Go to the original repository and open a pull request. |
58 | 122 |
|
59 |
| -- 📘 [LLM-Eval_Report.pdf](docs/LLM-Eval_Report.pdf) – Full project report |
60 |
| -- 📰 [LLM-Eval_Paper.pdf](docs/LLM-Eval_Paper.pdf) – Original paper on LLM-Eval |
61 |
| -- 📊 [LLM-Eval_Presentation.pptx](docs/LLM-Eval_Presentation.pptx) – Slide deck |
62 |
| -- 📝 [LLM-Eval_Guidelines.pdf](docs/LLM-Eval_Guidelines.pdf) – Project guidelines |
| 123 | +We appreciate your contributions and feedback! |
63 | 124 |
|
64 |
| -All located inside `docs/`. |
| 125 | +## License 📜 |
65 | 126 |
|
66 |
| ---- |
| 127 | +This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for more details. |
67 | 128 |
|
68 |
| -## 👥 Contributors |
| 129 | +## Contact 📧 |
69 | 130 |
|
70 |
| -- Giovanni Arcangeli |
71 |
| -- [Vittorio Ciancio](https://github.com/VittorioCiancio) |
72 |
| -- [Marco Di Maio](https://github.com/Marco210210) |
| 131 | +For questions or suggestions, feel free to reach out: |
73 | 132 |
|
74 |
| -Project presented for the Artificial Intelligence course – University of Salerno (2025) |
| 133 | +- **Author**: Gagan V |
| 134 | +- **Email**: gagan@example.com |
| 135 | +- **LinkedIn**: [Gagan's LinkedIn](https://www.linkedin.com/in/gagan) |
75 | 136 |
|
76 |
| ---- |
| 137 | +## Releases 📦 |
77 | 138 |
|
78 |
| -## ✨ Notes |
| 139 | +To stay updated with the latest features and improvements, visit our [Releases section](https://github.com/Gaganv882/llm-eval-analysis/releases). Here, you can download the latest files and follow the release notes for guidance on execution. |
79 | 140 |
|
80 |
| -This project demonstrates the limitations and potential of automatic dialogue evaluation. It highlights the differences between LLM generations and emphasizes the influence of dataset structure on evaluation performance. |
| 141 | +## Conclusion 🎉 |
81 | 142 |
|
82 |
| -For questions, feel free to contact the authors or open an issue on GitHub. |
| 143 | +Thank you for exploring the LLM Eval Analysis project. We hope it serves as a valuable tool for evaluating human-bot dialogues. Your feedback and contributions are essential for making this project even better. Happy coding! |
0 commit comments