LLM Evaluation / Preference Measurement and Alignment / Interpretabilty

Overview

This repository contains a collection of presentation slides that I made for evaluating Large Language Models (LLMs), together with recent papers on search and LLM application evaluation. The slides provide a framework and measurement methods for model performance, evaluation metrics, and preference measurement methods.

Features

LLM evaluation presentation (including LLM 101)
Sentence Embedding Applications: Semantic Reformulation and Topic Modeling (including sample Jupyter Notebook)
LLM model summary
Links to LLM evaluation framework and LLM evaluation papers
Case studies and examples of model comparisons

LLM Evaluation and Preference Measurement (including LLM 101)

https://github.com/DrSquare/LLM_Evaluation/blob/main/LLM_Evaluation%20and%20Preference%20Measurement.pdf

Sentence Embedding Applications

https://github.com/DrSquare/LLM_Evaluation/blob/main/Sentence_Embeddings_Share_vF.ipynb https://github.com/DrSquare/LLM_Evaluation/blob/main/Semantic_Sentence_Embedding.pdf

LLM model summary

https://github.com/DrSquare/LLM_Evaluation/blob/main/LLM_Models.pdf

Case studies and examples of LLM model comparisons

Chatbot Arena (LYMSYS): Comparative Eval and Leaderboard https://lmarena.ai/ https://lmarena.ai/?leaderboard https://arxiv.org/abs/2403.04132 https://arxiv.org/abs/2504.20879
Multi-agent evalaution: simulated environment with task completion https://github.com/TheAgentCompany/TheAgentCompany https://the-agent-company.com/#/leaderboard
Perplexity: Evaluating Online LLMs and Toloka Perplexity Case Study https://www.perplexity.ai/hub/blog/introducing-pplx-online-llms https://toloka.ai/blog/perplexity-case/
LLM Online A/B Testing Framework https://www.microsoft.com/en-us/research/articles/how-to-evaluate-llms-a-complete-metric-framework/
RAG Evaluation https://www.linkedin.com/pulse/evaluating-rag-systems-comprehensive-approach-assessing-kakkar-esm9c/
Evaluating Large Language Models: https://www.linkedin.com/posts/minha-hwang-7440771_ai-llm-machinelearning-activity-7297997796480032769-zAva?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAAv-mQBjDr9Jz8JMW5kF3_ogzmqhJHjGXo
Machine Learning Meets Cognitive Science https://www.linkedin.com/posts/minha-hwang-7440771_tom-griffiths-on-using-machine-learning-and-activity-7296558609356701697-j2P9?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAAv-mQBjDr9Jz8JMW5kF3_ogzmqhJHjGXo
NOLIMA: Long-Context Evaluation Beyond Literal Matching https://www.linkedin.com/posts/minha-hwang-7440771_nolima-long-context-evaluation-beyond-activity-7295798550905425920-4Joh?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAAv-mQBjDr9Jz8JMW5kF3_ogzmqhJHjGXo
LLM Evaluation - Measurement Scale Options https://www.linkedin.com/posts/minha-hwang-7440771_llm-evaluation-measurement-scale-options-activity-7295417284582330368-PXS-?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAAv-mQBjDr9Jz8JMW5kF3_ogzmqhJHjGXo
The Resurgence of Survey Research in the AI Era https://www.linkedin.com/posts/minha-hwang-7440771_ai-humandata-alignment-activity-7292901672811405312-cjZf?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAAv-mQBjDr9Jz8JMW5kF3_ogzmqhJHjGXo
LLM as a judge: Search Engine Ranking Relevance https://www.linkedin.com/posts/minha-hwang-7440771_large-language-models-can-accurately-predict-activity-7287267842343813121-57is?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAAv-mQBjDr9Jz8JMW5kF3_ogzmqhJHjGXo
Comparing Traditional and LLM-based Search for Consumer Choice https://www.linkedin.com/posts/minha-hwang-7440771_comparing-traditional-and-llm-based-search-activity-7287260733388529665-HmrK?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAAv-mQBjDr9Jz8JMW5kF3_ogzmqhJHjGXo
Interpretable User Satisfaction Estimation for Conversational SYstem with LLM https://www.linkedin.com/posts/minha-hwang-7440771_240312388-activity-7281121918601121792-7rRh?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAAv-mQBjDr9Jz8JMW5kF3_ogzmqhJHjGXo

Usage

Download the slides or sample Jupyther from the repository.
Open with Adobe PDF, PowerPoint, Google Slides, Visual Studio Code or any compatible viewer.
Use the slides for presentations, research, or internal analysis.
Modify or extend the slides and sample Jupyther Notebook as needed for your specific use case.

Contributing

Contributions are welcome! To contribute:

Fork the repository.
Create a new branch (feature-name or bugfix-name).
Add or update slide decks and Jupyther notebook.
Commit your changes and push the branch.
Open a Pull Request with a detailed description.

License

This project is licensed under the MIT License. See LICENSE for more details.

Contact

For inquiries or support, open an issue or reach out to minha.hwang@gmail.com.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
LLM_Evaluation and Preference Measurement.pdf		LLM_Evaluation and Preference Measurement.pdf
LLM_Models.pdf		LLM_Models.pdf
README.md		README.md
Semantic_Sentence_Embedding.pdf		Semantic_Sentence_Embedding.pdf
Sentence_Embeddings_Share_vF.ipynb		Sentence_Embeddings_Share_vF.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Evaluation / Preference Measurement and Alignment / Interpretabilty

Overview

Features

LLM Evaluation and Preference Measurement (including LLM 101)

Sentence Embedding Applications

LLM model summary

Case studies and examples of LLM model comparisons

Usage

Contributing

License

Contact

About

Uh oh!

Releases

Packages

Languages

DrSquare/LLM_Evaluation

Folders and files

Latest commit

History

Repository files navigation

LLM Evaluation / Preference Measurement and Alignment / Interpretabilty

Overview

Features

LLM Evaluation and Preference Measurement (including LLM 101)

Sentence Embedding Applications

LLM model summary

Case studies and examples of LLM model comparisons

Usage

Contributing

License

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages