This repository contains a collection of presentation slides that I made for evaluating Large Language Models (LLMs), together with recent papers on search and LLM application evaluation. The slides provide a framework and measurement methods for model performance, evaluation metrics, and preference measurement methods.
- LLM evaluation presentation (including LLM 101)
- Sentence Embedding Applications: Semantic Reformulation and Topic Modeling (including sample Jupyter Notebook)
- LLM model summary
- Links to LLM evaluation framework and LLM evaluation papers
- Case studies and examples of model comparisons
https://github.com/DrSquare/LLM_Evaluation/blob/main/Sentence_Embeddings_Share_vF.ipynb https://github.com/DrSquare/LLM_Evaluation/blob/main/Semantic_Sentence_Embedding.pdf
https://github.com/DrSquare/LLM_Evaluation/blob/main/LLM_Models.pdf
- Chatbot Arena (LYMSYS): Comparative Eval and Leaderboard https://lmarena.ai/ https://lmarena.ai/?leaderboard https://arxiv.org/abs/2403.04132 https://arxiv.org/abs/2504.20879
- Multi-agent evalaution: simulated environment with task completion https://github.com/TheAgentCompany/TheAgentCompany https://the-agent-company.com/#/leaderboard
- Perplexity: Evaluating Online LLMs and Toloka Perplexity Case Study https://www.perplexity.ai/hub/blog/introducing-pplx-online-llms https://toloka.ai/blog/perplexity-case/
- LLM Online A/B Testing Framework https://www.microsoft.com/en-us/research/articles/how-to-evaluate-llms-a-complete-metric-framework/
- RAG Evaluation https://www.linkedin.com/pulse/evaluating-rag-systems-comprehensive-approach-assessing-kakkar-esm9c/
- Evaluating Large Language Models: https://www.linkedin.com/posts/minha-hwang-7440771_ai-llm-machinelearning-activity-7297997796480032769-zAva?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAAv-mQBjDr9Jz8JMW5kF3_ogzmqhJHjGXo
- Machine Learning Meets Cognitive Science https://www.linkedin.com/posts/minha-hwang-7440771_tom-griffiths-on-using-machine-learning-and-activity-7296558609356701697-j2P9?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAAv-mQBjDr9Jz8JMW5kF3_ogzmqhJHjGXo
- NOLIMA: Long-Context Evaluation Beyond Literal Matching https://www.linkedin.com/posts/minha-hwang-7440771_nolima-long-context-evaluation-beyond-activity-7295798550905425920-4Joh?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAAv-mQBjDr9Jz8JMW5kF3_ogzmqhJHjGXo
- LLM Evaluation - Measurement Scale Options https://www.linkedin.com/posts/minha-hwang-7440771_llm-evaluation-measurement-scale-options-activity-7295417284582330368-PXS-?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAAv-mQBjDr9Jz8JMW5kF3_ogzmqhJHjGXo
- The Resurgence of Survey Research in the AI Era https://www.linkedin.com/posts/minha-hwang-7440771_ai-humandata-alignment-activity-7292901672811405312-cjZf?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAAv-mQBjDr9Jz8JMW5kF3_ogzmqhJHjGXo
- LLM as a judge: Search Engine Ranking Relevance https://www.linkedin.com/posts/minha-hwang-7440771_large-language-models-can-accurately-predict-activity-7287267842343813121-57is?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAAv-mQBjDr9Jz8JMW5kF3_ogzmqhJHjGXo
- Comparing Traditional and LLM-based Search for Consumer Choice https://www.linkedin.com/posts/minha-hwang-7440771_comparing-traditional-and-llm-based-search-activity-7287260733388529665-HmrK?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAAv-mQBjDr9Jz8JMW5kF3_ogzmqhJHjGXo
- Interpretable User Satisfaction Estimation for Conversational SYstem with LLM https://www.linkedin.com/posts/minha-hwang-7440771_240312388-activity-7281121918601121792-7rRh?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAAv-mQBjDr9Jz8JMW5kF3_ogzmqhJHjGXo
- Download the slides or sample Jupyther from the repository.
- Open with Adobe PDF, PowerPoint, Google Slides, Visual Studio Code or any compatible viewer.
- Use the slides for presentations, research, or internal analysis.
- Modify or extend the slides and sample Jupyther Notebook as needed for your specific use case.
Contributions are welcome! To contribute:
- Fork the repository.
- Create a new branch (
feature-name
orbugfix-name
). - Add or update slide decks and Jupyther notebook.
- Commit your changes and push the branch.
- Open a Pull Request with a detailed description.
This project is licensed under the MIT License. See LICENSE
for more details.
For inquiries or support, open an issue or reach out to minha.hwang@gmail.com
.