A curated list of pioneering research papers, tools, and resources at the intersection of Large Language Models (LLMs) and Scientific Discovery.
Survey: From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery. ([https://arxiv.org/abs/2505.13259])
The survey delineates the evolving role of LLMs in science through a three-level autonomy framework:
- Level 1: LLM as Tool: LLMs augmenting human researchers for specific, well-defined tasks.
- Level 2: LLM as Analyst: LLMs exhibiting greater autonomy in processing complex information and offering insights.
- Level 3: LLM as Scientist: LLM-based systems autonomously conducting major research stages.
Below is a visual representation of this taxonomy:
We aim to provide a comprehensive overview for researchers, developers, and enthusiasts interested in this rapidly advancing field.
At this foundational level, LLMs function as tailored tools under direct human supervision, designed to execute specific, well-defined tasks within a single stage of the scientific method. Their primary goal is to enhance researcher efficiency.
Automating literature search, retrieval, synthesis, structuring, and organization.
- SCIMON : Scientific Inspiration Machines Optimized for Novelty
- Wang et al. (2023.05)
- ResearchAgent: Iterative research idea generation over scientific literature with Large Language Models
- Baek et al. (2024.04)
- Text-Tuple-Table: Towards Information Integration in Text-to-Table Generation via Global Tuple Extraction
- Deng et al. (2024.04)
- TKGT: Redefinition and A New Way of text-to-table tasks based on real world demands and knowledge graphs augmented LLMs
- Jiang et al. (2024.10)
- ArxivDIGESTables: Synthesizing scientific literature into tables using language models
- Newman et al. (2024.10)
- Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol
- Wang et al. (2025.04)
- LitLLM: A Toolkit for Scientific Literature Review
- Agarwal et al. (2024.02)
- Title and abstract screening for literature reviews using large language models: an exploratory study in the biomedical domain
- Dennstädt et al. (2024.06)
- Science Hierarchography: Hierarchical Organization of Science Literature
- Gao et al. (2025.04)
Automated generation of novel research ideas, conceptual insights, and testable scientific hypotheses.
- SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning
- Ghafarollahi et al. (2024.09)
- Accelerating scientific discovery with generative knowledge extraction, graph-based representation, and multimodal intelligent graph reasoning
- Buehler (2024.03)
- MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses
- Yang et al. (2024.10)
- Large Language Models for Automated Open-domain Scientific Hypotheses Discovery
- Yang et al. (2023.09)
- Improving Scientific Hypothesis Generation with Knowledge Grounded Large Language Models
- Xiong et al. (2024.11)
- ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition
- Liu et al. (2025.03)
- AI Idea Bench 2025: AI Research Idea Generation Benchmark
- Qiu et al. (2025.04)
- IdeaBench: Benchmarking Large Language Models for Research Idea Generation
- Guo et al. (2024.11)
- Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers
- Si et al. (2024.09)
- Learning to Generate Research Idea with Dynamic Control
- Li et al. (2024.12)
- LiveIdeaBench: Evaluating LLMs' Divergent Thinking for Scientific Idea Generation with Minimal Context
- Ruan et al. (2024.12)
- Nova: An Iterative Planning and Search Approach to Enhance Novelty and Diversity of LLM Generated Ideas
- Hu et al. (2024.10)
- GraphEval: A Lightweight Graph-Based LLM Framework for Idea Evaluation
- Feng et al. (2025.03)
- Hypothesis Generation with Large Language Models
- Zhou et al. (2024.04)
- Harnessing the Power of Adversarial Prompting and Large Language Models for Robust Hypothesis Generation in Astronomy
- Ciuca et al. (2023.06)
- Large Language Models are Zero Shot Hypothesis Proposers
- Qi et al. (2023.11)
- Machine learning for hypothesis generation in biology and medicine: exploring the latent space of neuroscience and developmental bioelectricity
- O’Brien et al. (2023.07)
- Large Language Models as Biomedical Hypothesis Generators: A Comprehensive Evaluation
- Qi et al. (2024.07)
- LLM4GRN: Discovering Causal Gene Regulatory Networks with LLMs -- Evaluation through Synthetic Data Generation
- Afonja et al. (2024.10)
- Scideator: Human-LLM Scientific Idea Generation Grounded in Research-Paper Facet Recombination
- Radensky et al. (2024.09)
- HypER: Literature-grounded Hypothesis Generation and Distillation with Provenance
- Vasu et al. (2025.06)
- Sparks of Science: Hypothesis Generation Using Structured Paper Data
- O'Neill et al. (2025.04)
LLMs assisting in experimental protocol planning, workflow design, and scientific code generation.
- BioPlanner: Automatic Evaluation of LLMs on Protocol Planning in Biology
- O'Donoghue et al. (2023.10)
- Can Large Language Models Help Experimental Design for Causal Discovery? (Li et al. in survey)
- Li et al. (2025.03)
- Hierarchically Encapsulated Representation for Protocol Design in Self-Driving Labs
- Shi et al. (2025.04)
- SciCode: A Research Coding Benchmark Curated by Scientists
- Tian et al. (2024.07)
- Natural Language to Code Generation in Interactive Data Science Notebooks
- Yin et al. (2022.12)
- DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation
- Lai et al. (2022.11)
- Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents,
- Kon et al. (2025.02)
LLMs assisting in data-driven analysis, tabular/chart reasoning, statistical reasoning, and model discovery.
- AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZ
- Belouadi et al. (2023.10)
- Text2Chart31: Instruction Tuning for Chart Generation with Automatic Feedback
- Zadeh et al. (2024.10)
- ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
- Masry et al. (2022.03)
- CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
- Wang et al. (2024.06)
- ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning
- Xia et al. (2024.02)
- Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding
- Wang et al. (2024.01)
- TableBench: A Comprehensive and Complex Benchmark for Table Question Answering
- Wu et al. (2024.08)
- Tables as Texts or Images: Evaluating the Table Reasoning Ability of LLMs and MLLMs
- Deng et al. (2024.02)
LLMs providing feedback, verifying claims, replicating results, and generating reviews.
- CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers?
- Ou et al. (2025.03)
- LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing
- Du et al. (2024.06)
- AI-Driven Review Systems: Evaluating LLMs in Scalable and Bias-Aware Academic Reviews
- Tyser et al. (2024.08)
- Is LLM a Reliable Reviewer? A Comprehensive Evaluation of LLM on Automatic Paper Reviewing Tasks
- Zhou et al. (2024.05)
- ReviewerGPT? An Exploratory Study on Using Large Language Models for Paper Reviewing
- Liu and Shah (2023.06)
- Towards Autonomous Hypothesis Verification via Language Models with Minimal Guidance
- Takagi et al. (2023.11)
- CycleResearcher: Improving Automated Research via Automated Review
- Weng et al. (2024.11)
- PaperBench: Evaluating AI’s Ability to Replicate AI Research
- Starace et al. (2025.04)
- SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers
- Xiang et al. (2025.04)
- Advancing AI-Scientist Understanding: Making LLM Think Like a Physicist with Interpretable Reasoning
- Xu et al. (2025.04)
- Generative Adversarial Reviews: When LLMs Become the Critic
- Bougie & Watanabe (2024.12)
- Predicting Empirical AI Research Outcomes with Language Models
- Wen et al. (2025.06)
LLMs involved in iterative refinement of research hypotheses and strategic exploration.
- Verification and Refinement of Natural Language Explanations through LLM-Symbolic Theorem Proving
- Quan et al. (2024.05)
- Chain of Ideas: Revolutionizing Research Via Novel Idea Development with LLM Agents
- Li et al. (2024.10)
- Iterative Hypothesis Generation for Scientific Discovery with Monte Carlo Nash Equilibrium Self-Refining Trees
- Rabby et al. (2025.03)
- XtraGPT: LLMs for Human-AI Collaboration on Controllable Academic Paper Revision
- Chen et al. (2025.05)
LLMs exhibiting a greater degree of autonomy, functioning as passive agents capable of complex information processing, data modeling, and analytical reasoning with reduced human intervention.
Automated modeling of machine learning tasks, experiment design, and execution.
- MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation
- Huang et al. (2023.10)
- MLR-Copilot: Autonomous Machine Learning Research based on Large Language Models Agents
- Li et al. (2024.08)
- MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
- Chan et al. (2024.10)
- IMPROVE: Iterative Model Pipeline Refinement and Optimization Leveraging LLM Agents
- Xue et al. (2025.02)
- CodeScientist: End-to-End Semi-Automated Scientific Discovery with Code-based Experimentation
- Jansen et al. (2025.03)
- MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?
- Zhang et al. (2025.04)
- RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts
- Wijk et al. (2024.11)
- MLZero: A Multi-Agent System for End-to-end Machine Learning Automation
- Fang et al. (2025.05)
- AIDE: AI-Driven Exploration in the Space of Code
- Jiang et al. (2025.02)
- Language Modeling by Language Models
- Cheng et al. (2025.06)
- MLGym: A New Framework and Benchmark for Advancing AI Research Agents
- Nathani et al. (2025.02)
Automated data-driven analysis, statistical data modeling, and hypothesis validation.
- Automated Statistical Model Discovery with Language Models
- Li et al. (2024.02)
- InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks
- Hu et al. (2024.01)
- DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning
- Guo et al. (2024.02)
- BLADE: Benchmarking Language Model Agents for Data-Driven Science
- Gu et al. (2024.08)
- DAgent: A Relational Database-Driven Data Analysis Report Generation Agent
- Xu et al. (2025.03)
- DiscoveryBench: Towards Data-Driven Discovery with Large Language Models
- Majumder et al. (2024.07)
- Large Language Models for Scientific Synthesis, Inference and Explanation
- Zheng et al. (2023.10)
- MM-Agent: LLM as Agents for Real-world Mathematical Modeling Problem
- Liu et al. (2025.05)
- DSBench: How Far Are Data Science Agents from Becoming Data Science Experts?
- Jing et al. (2024.09)
Identifying underlying equations from observational data (AI-driven symbolic regression).
- LLM-SR: Scientific Equation Discovery via Programming with Large Language Models
- Shojaee et al. (2024.04)
- LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models
- Shojaee et al. (2025.04)
- Gravity-Bench-v1: A Benchmark on Gravitational Physics Discovery for Agents
- Koblischke et al. (2025.01)
Autonomous research workflows for natural science discovery (e.g., chemistry, biology, biomedicine).
- Coscientist: Autonomous Chemical Research with Large Language Models
- Boiko et al. (2023.10)
- Empowering biomedical discovery with AI agents
- Gao et al. (2024.09)
- From Intention To Implementation: Automating Biomedical Research via LLMs
- Luo et al. (2024.12)
- DrugAgent: Automating AI-aided Drug Discovery Programming through LLM Multi-Agent Collaboration
- Liu et al. (2024.11)
- ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery
- Chen et al. (2024.10)
- ProtAgents: Protein discovery by combining physics and machine learning
- Ghafarollahi and Buehler (2024.02)
- Auto-Bench: An Automated Benchmark for Scientific Discovery in LLMs
- Chen et al. (2025.02)
- Towards an AI co-scientist
- Gottweis et al. (2025.02)
Benchmarks and frameworks evaluating diverse tasks from different stages of scientific discovery.
- DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents
- Jansen et al. (2024.06)
- A Vision for Auto Research with LLM Agents
- Liu et al. (2025.04)
- CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning
- Cui et al. (2025.03)
- EAIRA: Establishing a Methodology for Evaluating AI Models as Scientific Research Assistants
- Cappello et al. (2025.02)
LLM-based systems operating as active agents capable of orchestrating and navigating multiple stages of the scientific discovery process with considerable independence, often culminating in draft research papers.
- Agent Laboratory: Using LLM Agents as Research Assistants
- Schmidgall et al. (2025.01)
- The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
- Lu et al. (2024.08)
- The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search
- Yamada et al. (2025.04)
- AI-Researcher: Fully-Automated Scientific Discovery with LLM Agents
- Data Intelligence Lab (2025.03)
- Zochi Technical Report
- Intology AI (2025.03)
- Meet Carl: The First AI System To Produce Academically Peer-Reviewed Research
- Autoscience Institute (2025.03)
Contributions are welcome! If you have a paper, tool, or resource that fits into this taxonomy, please submit a pull request.
Please cite our paper if you found our survey helpful:
@misc{zheng2025automationautonomysurveylarge,
title={From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery},
author={Tianshi Zheng and Zheye Deng and Hong Ting Tsang and Weiqi Wang and Jiaxin Bai and Zihao Wang and Yangqiu Song},
year={2025},
eprint={2505.13259},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.13259},
}