Introduction

This repository aims to collect recent accepted papers related to AuTo-Evaluation.

Dynamic benchmarks

Dynamic benchmarks aim at continuously updating of the testing data, offering a fairer assessment.

20 ICLR 2024 Jifan Yu, Xiaozhi Wang, et al., “KoLA: Carefully benchmarking world knowledge of large language models” [HOME Page][PDF]

21 LREC 2024 Yantao Liu, Zijun Yao, et al., “Untangle the KNOT: Interweaving conflicting knowledge and reasoning skills in large language models” [HOME Page] [PDF]

22 ICLR 2025 Colin White, Samuel Dooley, et al., **“Livebench: A challenging, contamination-free LLM benchmark” **[HOME Page][PDF]

23 arXiv 2024 Wei Tang, Yixin Cao, Yang Deng, et al., “EvoWiki: Evaluating LLMs on evolving knowledge” [HOME Page] [PDF]

24 arXiv 2024 Xiaobao Wu, Liangming Pan, et al., “AntiLeak-Bench: Preventing data contamination by automatically constructing benchmarks with updated real-world knowledge” [HOME Page] [PDF]

47 ICLR 2025 Naman Jain, Han, Alex Gu, et al., “LiveCodeBench: Holistic and contamination free evaluation of large language models for code” [HOME Page] [PDF]

244 ECML 2023 Andrzej Dulny, Andreas Hotho, Anna Krause, “DynaBench: A benchmark dataset for learning dynamical systems from low-resolution data,” [PDF]

245 ICML 2024 Wei-Lin Chiang, Lianmin Zheng, et al., **“Chatbot arena: An open platform for evaluating LLMs by human preference” **[HOME Page][PDF]

246 NeurIPS 2023 Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, LINGMING ZHANG, “Is your code generated by chatgpt really correct? Rigorous evaluation of large language models for code generation” [HOME Page][PDF]

247 arXiv 2024 Aidar Myrzakhan, Sondos Mahmoud Bsharat, Zhiqiang Shen, “Open-LLM-Leaderboard: From multi-choice to open-style questions for llms evaluation, benchmark, and arena” [HOME Page] [PDF]

248 NeurIPS 2023 Yuzhen Huang, Yuzhuo Bai, et al., “C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models”[HOME Page][PDF]

Automated Dataset Curation

Automated Dataset Curation is particularly vulnerable to rapid outdatedness and potential information leakage, avoiding human annotation data that requires substantial budgets and time costs to be qualified.

15 EMNLP 2023 Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen, “HaluEval: A large-scale hallucination evaluation benchmark for large language models” | [HOME_Page] [PDF]

22 ICLR 2025 Colin White, Samuel Dooley, et al., **“Livebench: A challenging, contamination-free LLM benchmark” **[HOME Page][PDF]

24 arXiv 2024 Xiaobao Wu, Liangming Pan, et al., “AntiLeak-Bench: Preventing data contamination by automatically constructing benchmarks with updated real-world knowledge” [HOME Page] [PDF]

29 ICLR 2024 Pan Lu, Hritik Bansal, et al., “Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts” [HOME Page] [PDF]

33 ACL 2024 Chaoqun He, Renjie Luo, et al., “Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems”[HOME Page][PDF]

38 ICLR 2024 Carlos E Jimenez, John Yang, et al.“SWE-bench: Can language models resolve real-world GitHub issues?”** [HOME Page][PDF]

39 ICLR 2025 Terry Yue Zhuo, Minh Chien Vu, et al., “BigCodeBench: Benchmarking code generation with diverse function calls and complex instructions” [HOME Page] [PDF]

47 ICLR 2025 Naman Jain, Han, Alex Gu, et al., “LiveCodeBench: Holistic and contamination free evaluation of large language models for code” [HOME Page] [PDF]

62 TMLR 2024 Aarohi Srivastava, Abhinav Rastogi, et al., “Beyond the imitation game: Quantifying and extrapolating the capabilities of language models” [HOME Page] [PDF]

90 ACL 2024 Chenxin An, Shansan Gong, et al., “L-eval: Instituting standardized evaluation for long context language models” [HOME Page][PDF]

91 ACL 2024 Yushi Bai, Xin Lv, Jiajie Zhang, et al., “LongBench: A bilingual, multitask benchmark for long context understanding” [HOME Page][PDF]

145 ICML 2024 Kaining Ying, Fanqing Meng, et al., **“Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask AGI” **[HOME Page][PDF]

148 NeurIPS 2024 Lin Chen, Jinsong Li, et al., “Are we on the right way for evaluating large vision-language models?” [HOME Page] [PDF]

149 ECCV 2024 Yuan Liu, Haodong Duan, et al., “Mmbench: Is your multi-modal model an all-around player?” [HOME Page][PDF]

158 CVPR 2024 Xiang Yue, Yuansheng Ni, et al., “MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI” [HOME Page] [PDF]

248 NeurIPS 2023 Yuzhen Huang, Yuzhuo Bai, et al., “C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models”[HOME Page][PDF]

249 EMNLP 2023 Wenhu Chen, Ming Yin, et al., “TheoremQA: A theorem-driven question answering dataset” [HOME Page][PDF]

250 ICML 2024 Xiaoxuan Wang, Ziniu Hu, et al., “SciBench: Evaluating college-level scientific problem-solving abilities of large language models” [HOME Page][PDF]

251 COLM 2024 Song Dingjie, Shunian Chen, et al., “MileBench: Benchmarking MLLMs in long context” [HOME Page] [PDF]

253 EACL 2023 Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers, “MTEB: Massive text embedding benchmark” [HOME Page][PDF]

254 ICLR 2025 Manuel Faysse, Hugues Sibille, et al., “Colpali: Efficient document retrieval with vision language models” [HOME Page][PDF]

255 ACL 2024 Jiahao Ying, Yixin Cao, et al.,“Intuitive or dependent? Investigating LLMs’ behavior style to conflicting prompts” [HOME Page][PDF]

256 ICLR 2025 Jiacheng Chen, Tianhao Liang, et al.,“Mega-bench: Scaling multimodal evaluation to over 500 real-world tasks” [HOME Page][PDF]

257 NeurIPS 2023 Wenxuan Zhang, Mahani Aljunied, et al., “M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models” [HOME Page][PDF]

258 ICLR 2024 Xingyao Wang, Zihan Wang, et al., "MINT: Evaluating LLMs in multi-turn interaction with tools and language feedback"[HOME Page][PDF]

260 ICLR 2025 Hongjin SU, Howard Yen, et al., “BRIGHT: A realistic and challenging benchmark for reasoning-intensive retrieval” [HOME Page] [PDF]

263 EMNLP 2024 Yubo Ma, Zhibin Gou, et al., “SciAgent: Tool-augmented language models for scientific reasoning” [HOME Page][PDF]

264 NeurIPS 2024 Yubo Ma, Yuhang Zang, et al., “Mmlongbench-doc: Benchmarking long-context document understanding with visualizations” [HOME Page][PDF]

265 ICLR 2025 Shi Yu, Chaoyue Tang, et al., “VisRAG: Vision-based retrieval-augmented generation on multi-modality documents” [HOME Page] [PDF]

266 ACL 2024 Zhiqing Sun, Sheng Shen, et al., “Aligning large multimodal models with factually augmented RLHF” [HOME Page][PDF]

267 arXiv 2025 Niklas Muennighoff, Zitong Yang, et al., “s1: Simple test-time scaling” [HOME Page] [PDF]

269 arXiv 2023 Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, “Hierarchical multimodal transformers for multi-page DocVQA” [HOME Page] [PDF]

270 ICML 2024 Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su, “GPT-4V(ision) is a generalist web agent, if grounded” [HOME Page] [PDF]

271 ICLR 2025 John Yang, Carlos E Jimenez, et al., “SWE-Bench multimodal: Do AI systems generalize to visual software domains?” [HOME Page][PDF]

272 NeurIPS 2023 Xiang Deng, Yu Gu, Boyuan Zheng, et al., “Mind2web: Towards a generalist agent for the web” [HOME Page][PDF]

273 arXiv 2024 Xueguang Ma, Shengyao Zhuang, Bevan Koopman, Guido Zuccon, Wenhu Chen, Jimmy Lin, “VISA: Retrieval augmented generation with visual source attribution” [HOME Page] [PDF]

274 arXiv 2024 Jaemin Cho, Debanjan Mahata, Ozan Irsoy, Yujie He, Mohit Bansal, “M3DocRAG: Multi-modal retrieval is what you need for multi-page multi-document understanding” [HOME Page] [PDF]

276 arXiv 2024 Zora Zhiruo Wang, Akari Asai, et al., “CodeRAG-Bench: Can retrieval augment code generation?” [HOME Page] [PDF]

278 arXiv 2024 Wangtao Sun, Chenxiang Zhang, et al., “Beyond instruction following: Evaluating inferential rule following of large language models,” [HOME Page] [PDF]

279 arXiv 2024 Fei Wang, Xingyu Fu, et al., “MuirBench: A comprehensive benchmark for robust multi-image understanding” [HOME Page] [PDF]

280 EMNLP 2024 Chunyang Li, Hao Peng, et al., “MAVEN-FACT: A large-scale event factuality detection dataset”[HOME Page][PDF]

281 NeurIPS 2023 Kaiyu Yang, Aidan Swope, et al., “LeanDojo: Theorem proving with retrieval-augmented language models” [HOME Page][PDF]

282 NeurIPS 2023 Jungo Kasai, Keisuke Sakaguchi, et al., “Realtime QA: What’s the answer right now?” [HOME Page][PDF]

283 arXiv 2025 Shanghaoran Quan, Jiaxi Yang, et al., “Codeelo: Benchmarking competition-level code generation of llms with human-comparable elo ratings” [HOME Page] [PDF]

285 NeurIPS 2024 Weiyun Wang, Shuibo Zhang, et al., “Needle in a multimodal haystack”[HOME Page] [PDF]

286 NeurIPS 2023 Hugo Laurençon, Lucile Saulnier, et al., “Obelics: An open web-scale filtered dataset of interleaved image-text documents” [HOME Page][PDF]

287 EMNLP 2023 Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen, “Evaluating object hallucination in large vision-language models” [HOME Page][PDF]

288 ACL 2024 Noah Wang, Z.y. Peng, Haoran Que, et al., “RoleLLM: Benchmarking, eliciting, and enhancing role-playing abilities of large language models” [HOME Page][PDF]

289 NeurIPS 2023 Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, Bernard Ghanem, “Camel: communicative agents for ”mind” exploration of large language model society” [HOME Page][PDF]

290 ACL 2024 Ge Bai, Jie Liu, Xingyuan Bu, et al., **“MT-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues” ** [HOME Page][PDF]

291 ICML 2024 Alex Gu, Baptiste Roziere, et al., “Cruxeval: A benchmark for code reasoning, understanding and execution” [HOME Page] [PDF]

292 ICLR 2024 Yujia Qin, Shihao Liang, et al., “ToolLLM: Facilitating large language models to master 16000+ real-world APIs” [HOME Page][PDF]

293 COLM 2024 Abhika Mishra, Akari Asai, et al.,“Fine-grained hallucination detection and editing for language models” [HOME Page] [PDF]

294 ICLR 2025 Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, Juanzi Li, “RM-Bench: Benchmarking reward models of language models with subtlety and style” [HOME Page] [PDF]

295 CVPR 2025 Lei Li, wei yuancheng, et al., “VL-RewardBench: A challenging benchmark for vision-language generative reward models” [HOME Page] [PDF]

296 arXiv 2024 Zhuoran Jin, Hongbang Yuan, Tianyi Men, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao, “RAG-RewardBench: Benchmarking reward models in retrieval augmented generation for preference alignment” [HOME Page] [PDF]

297 arXiv 2024 Tinghui Zhu, Qin Liu, Fei Wang, Zhengzhong Tu, Muhao Chen, “Unraveling cross-modality knowledge conflicts in large vision-language models” [HOME Page] [PDF]

298 ICLR 2024 Can Xu, Qingfeng Sun, et al., “WizardLM: Empowering large pre-trained language models to follow complex instructions” [HOME Page][PDF]

299 NeurIPS 2024 Jiahao Ying, Yixin Cao, et al., “Automating dataset updates towards reliable and timely evaluation of large language models”[HOME Page][PDF]

301 NeurIPS 2023 Yushi Bai, Jiahao Ying, et al., “Benchmarking foundation models with language-model-as-an-examiner” [HOME Page][PDF]

Pipeline of Automated Dataset Curation

Pipeline of Automated Dataset Curation consists of ways to enhance the ability of automated dataset curation in the design, including well-defined taxonomy, step decomposition, prompting and verification.

145 ICML 2024 Kaining Ying, Fanqing Meng, et al., **“Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask AGI” **[HOME Page][PDF]

147 EMNLP 2024 Meiqi Chen, Yixin Cao, Yan Zhang, and Chaochao Lu, “Quantifying and mitigating unimodal biases in multimodal large language models: A causal perspective”[HOME Page][PDF]

263 EMNLP 2024 Yubo Ma, Zhibin Gou, Junheng Hao, et al., “SciAgent: Tool-augmented language models for scientific reasoning” [HOME Page][PDF]

273 arXiv 2024 Xueguang Ma, Shengyao Zhuang, et al., “VISA: Retrieval augmented generation with visual source attribution” [HOME Page] [PDF]

289 NeurIPS 2023 Guohao Li, Hasan Hammoud, et al., “Camel: communicative agents for ”mind” exploration of large language model society” [HOME Page][PDF]

292 ICLR 2024 Yujia Qin, Shihao Liang, et al., “ToolLLM: Facilitating large language models to master 16000+ real-world APIs” [HOME Page][PDF]

296 arXiv 2024 Zhuoran Jin, Hongbang Yuan, et al., “RAG-RewardBench: Benchmarking reward models in retrieval augmented generation for preference alignment” [HOME Page] [PDF]

302 ICLR 2024 Xiao Liu, Hao Yu, et al., **“AgentBench: Evaluating LLMs as agents” **[HOME Page][PDF]

303 CVPR 2024 Bohao Li, Yuying Ge, et al., “Seed-Bench: Benchmarking multimodal large language models” [HOME Page] [PDF]

304 EMNLP 2024 Yizhu Jiao, Ming Zhong, et al., “Instruct and extract: Instruction tuning for on-demand information extraction” [HOME Page][PDF]

305 EMNLP 2024 Philippe Laban, Alexander Fabbri, et al., “Summary of a haystack: A challenge to long-context LLMs and RAG systems” [HOME Page][PDF]

306 EMNLP 2024 Qingxiu Dong, Lei Li, Damai Dai, et al., “A survey on in-context learning” [HOME Page][PDF]

307 ICLR 2024 Xiang Yue, Xingwei Qu, Ge Zhang, et al., **“MAmmoTH: Building math generalist models through hybrid instruction tuning” **[HOME Page] [PDF]

308 NeurIPS 2024 Dan Zhang, Ziniu Hu, et al., “SciInstruct: a self-reflective instruction annotated dataset for training scientific language models” [HOME Page][PDF]

Evaluator

Part Evaluator displays methods and strategies of LLM-as-judge.

130 NeurIPS 2023 Lianmin Zheng, Wei-Lin Chiang, et al., “Judging llm-as-a-judge with mt-bench and chatbot arena” [HOME Page][PDF]

242 DASFAA 2024 Yu Li, Shenyu Zhang, Rui Wu, et al., "MATEval: A multi-agent discussion framework for advancing open-eEnded text evaluation" [HOME Page] [PDF]

301 NeurIPS 2023 Yushi Bai, Jiahao Ying, et al., “Benchmarking foundation models with language-model-as-an-examiner” [HOME Page][PDF]

306 EMNLP 2024 Qingxiu Dong, Lei Li, et al., “A survey on in-context learning” [HOME Page][PDF]

315 ACL 2024 Jinlan Fu, See-Kiong Ng, et al., “Gptscore: Evaluate as you desire” [HOME Page][PDF]

318 JMLR 2024 Hyung Won Chung, Le Hou, et al., “Scaling instruction-finetuned language models” [HOME Page][PDF]

319 ACL 2023 Xi Ye, Srinivasan Iyer, Asli Celikyilmaz, Veselin Stoyanov, Greg Durrett, and Ramakanth Pasunuru, “Complementary explanations for effective in-context learning” [HOME Page][PDF]

321 Eval4NLP 2023 Neema Kotonya, Saran Krishnasamy, et al., “Little giants: Exploring the potential of small LLMs as evaluation metrics in summarization in the Eval4NLP 2023 shared task” [HOME Page] [PDF]

322 arXiv 2023 Hosein Hasanbeig, Hiteshi Sharma, et al., “Allure: A systematic protocol for auditing and improving llm-based evaluation of text using iterative in-context-learning,” [HOME Page] [PDF]

323 ACL coling 2025 Mingyang Song, Mao Zheng, and Xuan Luo, “Can many-shot in-context learning help long-context LLM judges? see more, judge better!” [HOME Page] [PDF]

324 EMNLP 2023 Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu, “G-Eval: NLG evaluation using GPT-4 with better human alignment” [HOME Page][PDF]

325 EMNLP 2023 Cheng-Han Chiang and Hung-yi Lee, “A closer look into using large language models for automatic evaluation” [HOME Page][PDF]

326 EACL 2024 Terry Yue Zhuo, “Ice-score: Instructing large language models to evaluate code” [HOME Page][PDF]

327 NAACL 2024 Swarnadeep Saha, Omer Levy, et al., “Branch-solve-merge improves large language model evaluation and generation” [HOME Page] [PDF]

328 NAACL 2024 Hangfeng He, Hongming Zhang, and Dan Roth, “SocREval: Large language models with the socratic method for reference-free reasoning evaluation” [HOME Page] [PDF]

329 ICLR 2024 Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, hai zhao, Pengfei Liu, “Generative judge for evaluating alignment” [HOME Page][PDF]

330 ACL 2024 Zhuohao Yu, Chang Gao, et al., “Kieval: A knowledge-grounded interactive evaluation framework for large language models,” [HOME Page][PDF]

331 arXiv 2023 Mingqi Gao, Jie Ruan, et al., “Human-like summarization evaluation with Chat-GPT” [HOME Page] [PDF]

332 EMNLP 2024 Seungone Kim, Juyoung Suk, et al., “Prometheus 2: An open source language model specialized in evaluating other language models” [HOME Page] [PDF]

333 ACL 2023 Sameer Jain, Vaishakh Keshava, et al., “Multi-dimensional evaluation of text summarization with in-context learning” [HOME Page][PDF]

334 ACL 2024 Hwanjun Song, Hang Su, et al., “Finesure: Fine-grained summarization evaluation using LLMs,” [HOME Page][PDF]

335 ACL 2024 Yuxuan Liu, Tianchi Yang, et al., “Hd-eval: Aligning large language model evaluators through hierarchical criteria decomposition” [HOME Page] [PDF]

336 ACL 2024 Xinyu Hu, Mingqi Gao, et al., “Are LLM-based evaluators confusing NLG quality criteria?” [HOME Page][PDF]

337 LREC 2024 Yuxuan Liu, Tianchi Yang, et al., “Calibrating LLM-based evaluator” [HOME Page] PDF

338 EMNLP 2024 Yijiang River Dong, Tiancheng Hu, and Nigel Collier, “Can LLM be a personalized judge?” [HOME Page][PDF]

341 ICLR 2024 Yidong Wang, Zhuohao Yu, et al., “PandaLM: Anautomatic evaluation benchmark for LLM instruction tuning optimization” [HOME Page][PDF]

342 arXiv 2024 Ruochen Zhao, Wenxuan Zhang, et al., “Auto arena of llms: Automating llm evaluations with agent peer-battles and committee discussions” [HOME Page] [PDF]

343 arXiv 2024 Zhumin Chu, Qingyao Ai, et al., “PRE: A peer review based large language model evaluator” [HOME Page] [PDF]

344 ICLR 2025 Kun-Peng Ning, Shuo Yang, et al., “Pico: Peer review in LLMs based on the consistency optimization” [HOME Page] [PDF]

345 arXiv 2024 Bhrij Patel, Souradip Chakraborty, et al., “AIME: AI system optimization via multiple LLM evaluators” [HOME Page] [PDF]

346 EMNLP 2024 Yicheng Gao, Gonghan Xu, Zhe Wang, and Arman Cohan, “Bayesian calibration of win rate estimation with LLM evaluators” [HOME Page ][PDF]

347 arXiv 2024 Zhengyu Hu, Jieyu Zhang, et al., “Language model preference evaluation with multiple weak evaluators” [HOME Page] [PDF]

348 arXiv 2023 Xinghua Zhang, Bowen Yu, et al., “Wider and deeper LLM networks are fairer LLM evaluators” [HOME Page] [PDF]

349 arXiv 2023 Zhenran Xu, Senbao Shi, et al., “Towards reasoning in large language models via multi-agent peer review collaboration” [HOME Page] [PDF]

350 EMNLP 2024 Sirui Liang, Baoli Zhang, Jun Zhao, and Kang Liu, “ABSEval: An agent-based framework for script evaluation” [HOME Page][PDF]

351 arXiv 2024 Chaithanya Bandi, Abir Harrasse, “Adversarial multi-agent evaluation of large language models through iterative debates” [HOME Page] PDF

352 ICLR 2024 Chi-Min Chan, Weize Chen, et al., “ChatEval: Towards better LLM-based evaluators through multi-agent debate” [HOME Page][PDF]

353 TMLR 2024 Ruosen Li, Teerth Patel, Xinya Du, “PRD: Peer rank and discussion improve large language model based evaluations,” [HOME Page] [PDF]

354 ICLR 2025 Jaehun Jung, Faeze Brahman, Yejin Choi, “Trust or escalate: Llm judges with provable guarantees for human agreement” [HOME Page][PDF]

355 arXiv 2024 Hui Huang, Yingqi Qu, Xingyuan Bu, et al., “An empirical study of LLM-as-a-judge for LLM evaluation: Fine-tuned judge models is not a genera substitute for GPT-4” [HOME Page] [PDF]

356 HUCLLM 2024 Qian Pan, Zahra Ashktorab, Michael Desmond, et al., “Human-centered design recommendations for LLM-as-a-judge” [HOME Page] [PDF]

358 arXiv 2023 Qintong Li, Leyang Cui, Lingpeng Kong, Wei Bi, "Collaborative Evaluation: Exploring the Synergy of Large Language Models and Humans for Open-ended Generation Evaluation" [HOME Page] [PDF]

359 UIST 2024 Shreya Shankar, J.D. Zamfirescu-Pereira, et al., “Who validates the validators? aligning LLM-assisted evaluation of LLM outputs with human preferences” [HOME Page] [PDF]

361 AIES 2023 Charvi Rastogi, Marco Tulio Ribeiro, et al., “Supporting human-AI collaboration in auditing LLMs with LLMs” [PDF]

362 ACL 2024 Peiyi Wang, Lei Li, et al., “Large language models are not fair evaluators” [HOME Page][PDF]

363 arXiv 2023 Tianlu Wang, Ping Yu, et al., “Shepherd: A critic for language model generation” [HOME Page] [PDF]

364 EMNLP 2024 Tu Vu, Kalpesh Krishna, et al., **“Foundational autoraters: Taming large language models for better automatic evaluation” **[HOME Page][PDF]

365 ICLR 2025 Lianghui Zhu, Xinggang Wang, Xinlong Wang, “Judgelm: Fine-tuned large language models are scalable judges” [HOME Page][PDF]

366 ICLR 2024 Seungone Kim, Jamin Shin, et al., “Prometheus: Inducing fine-grained evaluation capability in language models” [HOME Page][PDF]

367 TMLR 2024 Dongfu Jiang, Yishan Li, et al., “Tigerscore: Towards building explainable metric for all text generation tasks,” [HOME Page] [PDF]

368 EMNLP 2023 Wenda Xu, Danqing Wang, et al., “INSTRUCTSCORE: Towards explainable text generation evaluation with automatic feedback” [HOME Page][PDF]

369 EMNLP 2024 Yixiu Liu, Yuxiang Zheng, et al., “Safety-j: Evaluating safety with critique” [HOME Page][PDF]

371 PDLM 2025 Binjie Wang, Steffi Chern, et al., “Halu-j: Critique-based hallucination judge” [PDF]

372 EMNLP 2024 Xinyu Hu, Li Lin, et al., “Themis: A reference-free NLG evaluation language model with flexibility and interpretability” [HOME Page][PDF]

373 arXiv 2024 Tianhao Wu, Weizhe Yuan, et al., “Meta-rewarding language models: Self-improving alignment with LLM-as-a-meta-judge” [HOME Page] [PDF]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Dynamic benchmarks

Automated Dataset Curation

Pipeline of Automated Dataset Curation

Evaluator

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
ACL		ACL
CVPR		CVPR
EMNLP		EMNLP
ICLR		ICLR
ICML		ICML
NeurIPS		NeurIPS
others		others
README.md		README.md

ALEX-nlp/Awesome-Paper-List-for-Auto-Evaluation

Folders and files

Latest commit

History

Repository files navigation

Introduction

Dynamic benchmarks

Automated Dataset Curation

Pipeline of Automated Dataset Curation

Evaluator

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Packages