Data Gatherer is a Python library for automatically extracting dataset references from scientific publications. It processes full-text articles—whether in HTML or XML format—and uses both rule-based and LLM-based methods to identify and structure dataset citations.
- Parses scientific articles from open-access sources like PubMed Central (PMC).
- Extracts dataset mentions from structured sections (e.g., Data Availability, Supplementary Material).
- Supports two main strategies:
- Retrieve-Then-Read (RTR): First retrieves relevant sections using hand-crafted rules, then applies LLMs.
- Full-Document Read (FDR): Applies LLMs to the full text without section filtering.
- Outputs structured results in JSON format.
- Includes support for known repositories (e.g., GEO, PRIDE, MassIVE) via a configurable ontology.
- Helping data curators and librarians identify datasets cited in publications.
- Supporting meta-analysis and secondary data discovery.
- Enabling dataset indexing and retrieval across the open-access literature.