pipeline_new is a modular system for automated literature search, download, and parsing, primarily focused on scientific papers in materials science and sustainability.
It supports asynchronous operation, publisher-specific access methods, structured parsing, and organized database storage.
The system is designed to handle large-scale ingestion and structuring of research papers for downstream analysis, machine learning, and knowledge graph building.
/scripts/ # Python modules: search.py, download.py, parse.py (for external users)
/wiki/ # GitHub Wiki contains full documentation
/data/scratch/ # Downloaded full-text articles (local storage during runs)
/docs/ (optional) # Additional notes or local copies of documentation
Script | Purpose |
---|---|
search.py |
Searches external APIs (Crossref, Dimensions, Lens) based on user-provided keywords. Requires command-line arguments for keywords and size. |
download.py |
Downloads full-text articles based on DOIs, using publisher APIs or manual sources. |
parse.py |
Parses downloaded files into structured sections and paragraphs, then inserts them into MongoDB. |
Depending on your access, please follow one of the two paths:
-
Directory:
Scripts are located directly under/home/jupyter/Pipeline
. -
Environment:
Activate the existing environment:conda activate pipeline_env
-
Running the Pipeline:
cd /home/jupyter/Pipeline # 1. Search for papers (specify keywords and size) python search.py --keywords "plastic" "ozone" "machine learning" --size 300 # 2. Download papers python download.py # 3. Parse downloaded papers python parse.py
-
No need to clone the repository or install dependencies.
-
Clone the Repository:
git clone https://github.com/YourUsername/pipeline_new.git cd pipeline_new
-
Create the Conda Environment:
conda env create -f environment.yaml
-
Activate the Environment:
conda activate pipeline_env
-
Install Additional Pip Packages:
pip install dimcli==1.4
-
Running the Pipeline:
cd scripts # 1. Search for papers (specify keywords and size) python search.py --keywords "plastic" "ozone" "machine learning" --size 300 # 2. Download papers python download.py # 3. Parse downloaded papers python parse.py
-
search.py
requires the following arguments:--keywords
: One or more search terms.--size
: Number of DOIs to retrieve per source.
-
Example:
python search.py --keywords "solid state battery" "energy storage" --size 300
-
Always ensure the MongoDB server is running and accessible before starting the pipeline.
-
All downloads happen under the
/data/scratch/
directory.
📚 Full usage instructions, pipeline flow, module guides, and best practices are available in the Wiki.
Start here: Overview
- Vineeth Venugopal
- vinven7@gmail.com
This project was developed to support data-driven materials discovery and automated scientific knowledge extraction.
Special thanks to all contributors and maintainers for expanding and improving the pipeline's functionality.