pipeline_new

Overview

pipeline_new is a modular system for automated literature search, download, and parsing, primarily focused on scientific papers in materials science and sustainability.
It supports asynchronous operation, publisher-specific access methods, structured parsing, and organized database storage.

The system is designed to handle large-scale ingestion and structuring of research papers for downstream analysis, machine learning, and knowledge graph building.

📽️ Pipeline Tutorial Video

Repository Structure

/scripts/           # Python modules: search.py, download.py, parse.py (for external users)
/wiki/              # GitHub Wiki contains full documentation
/data/scratch/      # Downloaded full-text articles (local storage during runs)
/docs/ (optional)   # Additional notes or local copies of documentation

Main Scripts

Script	Purpose
`search.py`	Searches external APIs (Crossref, Dimensions, Lens) based on user-provided keywords. Requires command-line arguments for keywords and size.
`download.py`	Downloads full-text articles based on DOIs, using publisher APIs or manual sources.
`parse.py`	Parses downloaded files into structured sections and paragraphs, then inserts them into MongoDB.

Setup Instructions

Depending on your access, please follow one of the two paths:

Option A: Olivetti Lab Members (Spatula Server Users)

Directory:
Scripts are located directly under /home/jupyter/Pipeline.
Environment:
Activate the existing environment:
```
conda activate pipeline_env
```

Running the Pipeline:

cd /home/jupyter/Pipeline

# 1. Search for papers (specify keywords and size)
python search.py --keywords "plastic" "ozone" "machine learning" --size 300

# 2. Download papers
python download.py

# 3. Parse downloaded papers
python parse.py

No need to clone the repository or install dependencies.

Option B: External Users (New Users / Outside Lab)

Clone the Repository:

git clone https://github.com/YourUsername/pipeline_new.git
cd pipeline_new

Create the Conda Environment:
```
conda env create -f environment.yaml
```
Activate the Environment:
```
conda activate pipeline_env
```
Install Additional Pip Packages:
```
pip install dimcli==1.4
```

Running the Pipeline:

cd scripts

# 1. Search for papers (specify keywords and size)
python search.py --keywords "plastic" "ozone" "machine learning" --size 300

# 2. Download papers
python download.py

# 3. Parse downloaded papers
python parse.py

Notes on Usage

search.py requires the following arguments:
- --keywords: One or more search terms.
- --size: Number of DOIs to retrieve per source.

Example:

python search.py --keywords "solid state battery" "energy storage" --size 300

Always ensure the MongoDB server is running and accessible before starting the pipeline.
All downloads happen under the /data/scratch/ directory.

Documentation

📚 Full usage instructions, pipeline flow, module guides, and best practices are available in the Wiki.

Start here: Overview

Author

Vineeth Venugopal
vinven7@gmail.com

Acknowledgements

This project was developed to support data-driven materials discovery and automated scientific knowledge extraction.
Special thanks to all contributors and maintainers for expanding and improving the pipeline's functionality.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
scripts		scripts
README.md		README.md
environment.yaml		environment.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

pipeline_new

Overview

📽️ Pipeline Tutorial Video

Repository Structure

Main Scripts

Setup Instructions

Option A: Olivetti Lab Members (Spatula Server Users)

Option B: External Users (New Users / Outside Lab)

Notes on Usage

Documentation

Author

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

olivettigroup/pipeline_new

Folders and files

Latest commit

History

Repository files navigation

pipeline_new

Overview

📽️ Pipeline Tutorial Video

Repository Structure

Main Scripts

Setup Instructions

Option A: Olivetti Lab Members (Spatula Server Users)

Option B: External Users (New Users / Outside Lab)

Notes on Usage

Documentation

Author

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages