Skip to content

DrSquare/LLM_Evaluation

Repository files navigation

LLM Evaluation / Preference Measurement and Alignment / Interpretabilty

Overview

This repository contains a collection of presentation slides that I made for evaluating Large Language Models (LLMs), together with recent papers on search and LLM application evaluation. The slides provide a framework and measurement methods for model performance, evaluation metrics, and preference measurement methods.

Features

  • LLM evaluation presentation (including LLM 101)
  • Sentence Embedding Applications: Semantic Reformulation and Topic Modeling (including sample Jupyter Notebook)
  • LLM model summary
  • Links to LLM evaluation framework and LLM evaluation papers
  • Case studies and examples of model comparisons

LLM Evaluation and Preference Measurement (including LLM 101)

https://github.com/DrSquare/LLM_Evaluation/blob/main/LLM_Evaluation%20and%20Preference%20Measurement.pdf

Sentence Embedding Applications

https://github.com/DrSquare/LLM_Evaluation/blob/main/Sentence_Embeddings_Share_vF.ipynb https://github.com/DrSquare/LLM_Evaluation/blob/main/Semantic_Sentence_Embedding.pdf

LLM model summary

https://github.com/DrSquare/LLM_Evaluation/blob/main/LLM_Models.pdf

Case studies and examples of LLM model comparisons

  1. Chatbot Arena (LYMSYS): Comparative Eval and Leaderboard https://lmarena.ai/ https://lmarena.ai/?leaderboard https://arxiv.org/abs/2403.04132 https://arxiv.org/abs/2504.20879
  2. Multi-agent evalaution: simulated environment with task completion https://github.com/TheAgentCompany/TheAgentCompany https://the-agent-company.com/#/leaderboard
  3. Perplexity: Evaluating Online LLMs and Toloka Perplexity Case Study https://www.perplexity.ai/hub/blog/introducing-pplx-online-llms https://toloka.ai/blog/perplexity-case/
  4. LLM Online A/B Testing Framework https://www.microsoft.com/en-us/research/articles/how-to-evaluate-llms-a-complete-metric-framework/
  5. RAG Evaluation https://www.linkedin.com/pulse/evaluating-rag-systems-comprehensive-approach-assessing-kakkar-esm9c/
  6. Evaluating Large Language Models: https://www.linkedin.com/posts/minha-hwang-7440771_ai-llm-machinelearning-activity-7297997796480032769-zAva?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAAv-mQBjDr9Jz8JMW5kF3_ogzmqhJHjGXo
  7. Machine Learning Meets Cognitive Science https://www.linkedin.com/posts/minha-hwang-7440771_tom-griffiths-on-using-machine-learning-and-activity-7296558609356701697-j2P9?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAAv-mQBjDr9Jz8JMW5kF3_ogzmqhJHjGXo
  8. NOLIMA: Long-Context Evaluation Beyond Literal Matching https://www.linkedin.com/posts/minha-hwang-7440771_nolima-long-context-evaluation-beyond-activity-7295798550905425920-4Joh?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAAv-mQBjDr9Jz8JMW5kF3_ogzmqhJHjGXo
  9. LLM Evaluation - Measurement Scale Options https://www.linkedin.com/posts/minha-hwang-7440771_llm-evaluation-measurement-scale-options-activity-7295417284582330368-PXS-?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAAv-mQBjDr9Jz8JMW5kF3_ogzmqhJHjGXo
  10. The Resurgence of Survey Research in the AI Era https://www.linkedin.com/posts/minha-hwang-7440771_ai-humandata-alignment-activity-7292901672811405312-cjZf?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAAv-mQBjDr9Jz8JMW5kF3_ogzmqhJHjGXo
  11. LLM as a judge: Search Engine Ranking Relevance https://www.linkedin.com/posts/minha-hwang-7440771_large-language-models-can-accurately-predict-activity-7287267842343813121-57is?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAAv-mQBjDr9Jz8JMW5kF3_ogzmqhJHjGXo
  12. Comparing Traditional and LLM-based Search for Consumer Choice https://www.linkedin.com/posts/minha-hwang-7440771_comparing-traditional-and-llm-based-search-activity-7287260733388529665-HmrK?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAAv-mQBjDr9Jz8JMW5kF3_ogzmqhJHjGXo
  13. Interpretable User Satisfaction Estimation for Conversational SYstem with LLM https://www.linkedin.com/posts/minha-hwang-7440771_240312388-activity-7281121918601121792-7rRh?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAAv-mQBjDr9Jz8JMW5kF3_ogzmqhJHjGXo

Usage

  1. Download the slides or sample Jupyther from the repository.
  2. Open with Adobe PDF, PowerPoint, Google Slides, Visual Studio Code or any compatible viewer.
  3. Use the slides for presentations, research, or internal analysis.
  4. Modify or extend the slides and sample Jupyther Notebook as needed for your specific use case.

Contributing

Contributions are welcome! To contribute:

  1. Fork the repository.
  2. Create a new branch (feature-name or bugfix-name).
  3. Add or update slide decks and Jupyther notebook.
  4. Commit your changes and push the branch.
  5. Open a Pull Request with a detailed description.

License

This project is licensed under the MIT License. See LICENSE for more details.

Contact

For inquiries or support, open an issue or reach out to minha.hwang@gmail.com.

About

LLM Evaluation Method and Framework

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published