Skip to content

HKUST-KnowComp/Awesome-LLM-Scientific-Discovery

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Awesome LLM Scientific Discovery Awesome

A curated list of pioneering research papers, tools, and resources at the intersection of Large Language Models (LLMs) and Scientific Discovery.

Survey: From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery. ([https://arxiv.org/abs/2505.13259])

The survey delineates the evolving role of LLMs in science through a three-level autonomy framework:

  • Level 1: LLM as Tool: LLMs augmenting human researchers for specific, well-defined tasks.
  • Level 2: LLM as Analyst: LLMs exhibiting greater autonomy in processing complex information and offering insights.
  • Level 3: LLM as Scientist: LLM-based systems autonomously conducting major research stages.

Below is a visual representation of this taxonomy:

Taxonomy of LLM in Scientific Discovery

We aim to provide a comprehensive overview for researchers, developers, and enthusiasts interested in this rapidly advancing field.

Contents


Level 1: LLM as Tool

At this foundational level, LLMs function as tailored tools under direct human supervision, designed to execute specific, well-defined tasks within a single stage of the scientific method. Their primary goal is to enhance researcher efficiency.

Literature Review and Information Gathering

Automating literature search, retrieval, synthesis, structuring, and organization.

  • SCIMON : Scientific Inspiration Machines Optimized for Novelty arXiv - Wang et al. (2023.05)
  • ResearchAgent: Iterative research idea generation over scientific literature with Large Language Models arXiv - Baek et al. (2024.04)
  • Text-Tuple-Table: Towards Information Integration in Text-to-Table Generation via Global Tuple Extraction arXiv - Deng et al. (2024.04)
  • TKGT: Redefinition and A New Way of text-to-table tasks based on real world demands and knowledge graphs augmented LLMs arXiv - Jiang et al. (2024.10)
  • ArxivDIGESTables: Synthesizing scientific literature into tables using language models arXiv - Newman et al. (2024.10)
  • Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol arXiv - Wang et al. (2025.04)
  • LitLLM: A Toolkit for Scientific Literature Review arXiv - Agarwal et al. (2024.02)
  • Title and abstract screening for literature reviews using large language models: an exploratory study in the biomedical domain DOI - Dennstädt et al. (2024.06)
  • Science Hierarchography: Hierarchical Organization of Science Literature arXiv - Gao et al. (2025.04)

Idea Generation and Hypothesis Formulation

Automated generation of novel research ideas, conceptual insights, and testable scientific hypotheses.

  • SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning arXiv - Ghafarollahi et al. (2024.09)
  • Accelerating scientific discovery with generative knowledge extraction, graph-based representation, and multimodal intelligent graph reasoning arXiv - Buehler (2024.03)
  • MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses arXiv - Yang et al. (2024.10)
  • Large Language Models for Automated Open-domain Scientific Hypotheses Discovery arXiv - Yang et al. (2023.09)
  • Improving Scientific Hypothesis Generation with Knowledge Grounded Large Language Models arXiv - Xiong et al. (2024.11)
  • ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition arXiv - Liu et al. (2025.03)
  • AI Idea Bench 2025: AI Research Idea Generation Benchmark arXiv - Qiu et al. (2025.04)
  • IdeaBench: Benchmarking Large Language Models for Research Idea Generation arXiv - Guo et al. (2024.11)
  • Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers arXiv - Si et al. (2024.09)
  • Learning to Generate Research Idea with Dynamic Control arXiv - Li et al. (2024.12)
  • LiveIdeaBench: Evaluating LLMs' Divergent Thinking for Scientific Idea Generation with Minimal Context arXiv - Ruan et al. (2024.12)
  • Nova: An Iterative Planning and Search Approach to Enhance Novelty and Diversity of LLM Generated Ideas arXiv - Hu et al. (2024.10)
  • GraphEval: A Lightweight Graph-Based LLM Framework for Idea Evaluation arXiv - Feng et al. (2025.03)
  • Hypothesis Generation with Large Language Models arXiv - Zhou et al. (2024.04)
  • Harnessing the Power of Adversarial Prompting and Large Language Models for Robust Hypothesis Generation in Astronomy arXiv - Ciuca et al. (2023.06)
  • Large Language Models are Zero Shot Hypothesis Proposers arXiv - Qi et al. (2023.11)
  • Machine learning for hypothesis generation in biology and medicine: exploring the latent space of neuroscience and developmental bioelectricity DOI - O’Brien et al. (2023.07)
  • Large Language Models as Biomedical Hypothesis Generators: A Comprehensive Evaluation arXiv - Qi et al. (2024.07)
  • LLM4GRN: Discovering Causal Gene Regulatory Networks with LLMs -- Evaluation through Synthetic Data Generation arXiv - Afonja et al. (2024.10)
  • Scideator: Human-LLM Scientific Idea Generation Grounded in Research-Paper Facet Recombination arXiv - Radensky et al. (2024.09)
  • HypER: Literature-grounded Hypothesis Generation and Distillation with Provenance arXiv - Vasu et al. (2025.06)
  • Sparks of Science: Hypothesis Generation Using Structured Paper Data arXiv - O'Neill et al. (2025.04)

Experiment Planning and Execution

LLMs assisting in experimental protocol planning, workflow design, and scientific code generation.

  • BioPlanner: Automatic Evaluation of LLMs on Protocol Planning in Biology arXiv - O'Donoghue et al. (2023.10)
  • Can Large Language Models Help Experimental Design for Causal Discovery? (Li et al. in survey) arXiv - Li et al. (2025.03)
  • Hierarchically Encapsulated Representation for Protocol Design in Self-Driving Labs arXiv - Shi et al. (2025.04)
  • SciCode: A Research Coding Benchmark Curated by Scientists arXiv - Tian et al. (2024.07)
  • Natural Language to Code Generation in Interactive Data Science Notebooks arXiv - Yin et al. (2022.12)
  • DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation arXiv - Lai et al. (2022.11)
  • Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents, arXiv - Kon et al. (2025.02)

Data Analysis and Organization

LLMs assisting in data-driven analysis, tabular/chart reasoning, statistical reasoning, and model discovery.

  • AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZ arXiv - Belouadi et al. (2023.10)
  • Text2Chart31: Instruction Tuning for Chart Generation with Automatic Feedback arXiv - Zadeh et al. (2024.10)
  • ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning arXiv - Masry et al. (2022.03)
  • CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs arXiv - Wang et al. (2024.06)
  • ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning arXiv - Xia et al. (2024.02)
  • Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding arXiv - Wang et al. (2024.01)
  • TableBench: A Comprehensive and Complex Benchmark for Table Question Answering arXiv - Wu et al. (2024.08)
  • Tables as Texts or Images: Evaluating the Table Reasoning Ability of LLMs and MLLMs arXiv - Deng et al. (2024.02)

Conclusion and Hypothesis Validation

LLMs providing feedback, verifying claims, replicating results, and generating reviews.

  • CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers? arXiv - Ou et al. (2025.03)
  • LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing arXiv - Du et al. (2024.06)
  • AI-Driven Review Systems: Evaluating LLMs in Scalable and Bias-Aware Academic Reviews arXiv - Tyser et al. (2024.08)
  • Is LLM a Reliable Reviewer? A Comprehensive Evaluation of LLM on Automatic Paper Reviewing Tasks Link - Zhou et al. (2024.05)
  • ReviewerGPT? An Exploratory Study on Using Large Language Models for Paper Reviewing arXiv - Liu and Shah (2023.06)
  • Towards Autonomous Hypothesis Verification via Language Models with Minimal Guidance arXiv - Takagi et al. (2023.11)
  • CycleResearcher: Improving Automated Research via Automated Review arXiv - Weng et al. (2024.11)
  • PaperBench: Evaluating AI’s Ability to Replicate AI Research arXiv - Starace et al. (2025.04)
  • SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers arXiv - Xiang et al. (2025.04)
  • Advancing AI-Scientist Understanding: Making LLM Think Like a Physicist with Interpretable Reasoning arXiv - Xu et al. (2025.04)
  • Generative Adversarial Reviews: When LLMs Become the Critic arXiv - Bougie & Watanabe (2024.12)
  • Predicting Empirical AI Research Outcomes with Language Models arXiv - Wen et al. (2025.06)

Iteration and Refinement

LLMs involved in iterative refinement of research hypotheses and strategic exploration.

  • Verification and Refinement of Natural Language Explanations through LLM-Symbolic Theorem Proving arXiv - Quan et al. (2024.05)
  • Chain of Ideas: Revolutionizing Research Via Novel Idea Development with LLM Agents arXiv - Li et al. (2024.10)
  • Iterative Hypothesis Generation for Scientific Discovery with Monte Carlo Nash Equilibrium Self-Refining Trees arXiv - Rabby et al. (2025.03)
  • XtraGPT: LLMs for Human-AI Collaboration on Controllable Academic Paper Revision arXiv - Chen et al. (2025.05)

Level 2: LLM as Analyst

LLMs exhibiting a greater degree of autonomy, functioning as passive agents capable of complex information processing, data modeling, and analytical reasoning with reduced human intervention.

Machine Learning Research

Automated modeling of machine learning tasks, experiment design, and execution.

  • MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation arXiv - Huang et al. (2023.10)
  • MLR-Copilot: Autonomous Machine Learning Research based on Large Language Models Agents arXiv - Li et al. (2024.08)
  • MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering arXiv - Chan et al. (2024.10)
  • IMPROVE: Iterative Model Pipeline Refinement and Optimization Leveraging LLM Agents arXiv - Xue et al. (2025.02)
  • CodeScientist: End-to-End Semi-Automated Scientific Discovery with Code-based Experimentation arXiv - Jansen et al. (2025.03)
  • MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges? arXiv - Zhang et al. (2025.04)
  • RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts arXiv - Wijk et al. (2024.11)
  • MLZero: A Multi-Agent System for End-to-end Machine Learning Automation arXiv - Fang et al. (2025.05)
  • AIDE: AI-Driven Exploration in the Space of Code arXiv - Jiang et al. (2025.02)
  • Language Modeling by Language Models arXiv - Cheng et al. (2025.06)
  • MLGym: A New Framework and Benchmark for Advancing AI Research Agents arXiv - Nathani et al. (2025.02)

Data Modeling and Analysis

Automated data-driven analysis, statistical data modeling, and hypothesis validation.

  • Automated Statistical Model Discovery with Language Models arXiv - Li et al. (2024.02)
  • InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks arXiv - Hu et al. (2024.01)
  • DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning arXiv - Guo et al. (2024.02)
  • BLADE: Benchmarking Language Model Agents for Data-Driven Science arXiv - Gu et al. (2024.08)
  • DAgent: A Relational Database-Driven Data Analysis Report Generation Agent arXiv - Xu et al. (2025.03)
  • DiscoveryBench: Towards Data-Driven Discovery with Large Language Models arXiv - Majumder et al. (2024.07)
  • Large Language Models for Scientific Synthesis, Inference and Explanation arXiv - Zheng et al. (2023.10)
  • MM-Agent: LLM as Agents for Real-world Mathematical Modeling Problem arXiv - Liu et al. (2025.05)
  • DSBench: How Far Are Data Science Agents from Becoming Data Science Experts? arXiv - Jing et al. (2024.09)

Function Discovery

Identifying underlying equations from observational data (AI-driven symbolic regression).

  • LLM-SR: Scientific Equation Discovery via Programming with Large Language Models arXiv - Shojaee et al. (2024.04)
  • LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models arXiv - Shojaee et al. (2025.04)
  • Gravity-Bench-v1: A Benchmark on Gravitational Physics Discovery for Agents arXiv - Koblischke et al. (2025.01)

Natural Science Research

Autonomous research workflows for natural science discovery (e.g., chemistry, biology, biomedicine).

  • Coscientist: Autonomous Chemical Research with Large Language Models DOI - Boiko et al. (2023.10)
  • Empowering biomedical discovery with AI agents DOI - Gao et al. (2024.09)
  • From Intention To Implementation: Automating Biomedical Research via LLMs arXiv - Luo et al. (2024.12)
  • DrugAgent: Automating AI-aided Drug Discovery Programming through LLM Multi-Agent Collaboration arXiv - Liu et al. (2024.11)
  • ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery arXiv - Chen et al. (2024.10)
  • ProtAgents: Protein discovery by combining physics and machine learning arXiv - Ghafarollahi and Buehler (2024.02)
  • Auto-Bench: An Automated Benchmark for Scientific Discovery in LLMs arXiv - Chen et al. (2025.02)
  • Towards an AI co-scientist arXiv - Gottweis et al. (2025.02)

General Research

Benchmarks and frameworks evaluating diverse tasks from different stages of scientific discovery.

  • DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents arXiv - Jansen et al. (2024.06)
  • A Vision for Auto Research with LLM Agents arXiv - Liu et al. (2025.04)
  • CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning arXiv - Cui et al. (2025.03)
  • EAIRA: Establishing a Methodology for Evaluating AI Models as Scientific Research Assistants arXiv - Cappello et al. (2025.02)

Survey Generation

  • AutoSurvey: Large Language Models Can Automatically Write Surveys arXiv - Wang et al. (2024.06)

Level 3: LLM as Scientist

LLM-based systems operating as active agents capable of orchestrating and navigating multiple stages of the scientific discovery process with considerable independence, often culminating in draft research papers.

  • Agent Laboratory: Using LLM Agents as Research Assistants arXiv - Schmidgall et al. (2025.01)
  • The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery arXiv - Lu et al. (2024.08)
  • The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search arXiv - Yamada et al. (2025.04)
  • AI-Researcher: Fully-Automated Scientific Discovery with LLM Agents GitHub - Data Intelligence Lab (2025.03)
  • Zochi Technical Report Link - Intology AI (2025.03)
  • Meet Carl: The First AI System To Produce Academically Peer-Reviewed Research Link - Autoscience Institute (2025.03)

Contributing

Contributions are welcome! If you have a paper, tool, or resource that fits into this taxonomy, please submit a pull request.


Citation

Please cite our paper if you found our survey helpful:

@misc{zheng2025automationautonomysurveylarge,
      title={From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery}, 
      author={Tianshi Zheng and Zheye Deng and Hong Ting Tsang and Weiqi Wang and Jiaxin Bai and Zihao Wang and Yangqiu Song},
      year={2025},
      eprint={2505.13259},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.13259}, 
}

About

From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published