IND-Enzymes

PubMed Data Extraction and Network Generation for Thermophilic and Psychrophilic Organisms

This Python script automates the process of extracting PubMed IDs related to specific species, parsing the resulting JSON files to identify whether the organisms have thermophilic or psychrophilic characteristics, and generating a network visualization of the data.

Prerequisites

Ensure you have the following Python libraries installed: Biopython xmltodict pandas json networkx matplotlib You can install them using pip: pip install biopython xmltodict pandas matplotlib networkx

Usage

Part 1: Extract PubMed IDs

Email Setup: Make sure to set your email in the script for using Entrez services: python Entrez.email = "your_email@example.com" Input File: The script reads species names from ex.csv. Make sure this file contains a column named "species". Run the script: The script will search for each species name in PubMed, extract the search results, and write them to trail1.txt.

Part 2: Parse JSON Files

Input JSON File: The script reads from a JSON file (trail.json or psychroCGCB.json). Make sure this file contains the required structure as expected from PubMed's E-utilities. Parsing Process: The script extracts specific fields such as "IdList", "TranslationSet", and "TranslationStack" to identify and categorize organisms based on the search results. Output: Results are written to psychroCGCBresult.txt.

Part 3: Network Generation

Input Files: The script reads two CSV files: nodes.csv: Contains information about nodes such as species names, groups (thermophilic or psychrophilic), and node sizes (species count). edges.csv: Contains information about edges including the source species, target category, and edge weight. Network Visualization: The script generates a network graph using NetworkX, where:

Nodes represent species. Edges connect species to thermophilic or psychrophilic categories. Node colors and sizes represent groups and species count. Customization:

Node colors are set using a predefined color_map. The graph layout and appearance can be adjusted using various parameters in options.

Outputs

Text Files: trail1.txt: Contains PubMed search results for each species. psychroCGCBresult.txt: Contains parsed data from JSON files.

Network Visualization:

A network graph is displayed, showing the relationships between species and their thermophilic/psychrophilic nature. Notes Entrez API Usage: Make sure your email is set correctly and that you comply with NCBI’s usage policy. Input Files: Ensure that input files (ex.csv, nodes.csv, and edges.csv) are formatted correctly to match the script’s expectations. Network Graph: You may adjust the color_map dictionary and other network parameters to fit your visualization needs.

Troubleshooting

Missing Modules: Install any missing modules using pip install. File Not Found Errors: Verify that all input files are in the correct directory or provide full paths to the files. API Limits: NCBI imposes rate limits on API requests. Ensure compliance with these limits to avoid temporary bans.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Acidobacterium ailaaui.xls		Acidobacterium ailaaui.xls
Aeromonas encheleia.xls		Aeromonas encheleia.xls
Aeromonas fluvialis.xls		Aeromonas fluvialis.xls
Algoriphagus antarcticus.xls		Algoriphagus antarcticus.xls
LICENSE		LICENSE
README.md		README.md
To extract organism specific literature.py		To extract organism specific literature.py
extract.py		extract.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

IND-Enzymes

PubMed Data Extraction and Network Generation for Thermophilic and Psychrophilic Organisms

Prerequisites

Usage

Part 1: Extract PubMed IDs

Part 2: Parse JSON Files

Part 3: Network Generation

Outputs

Network Visualization:

Troubleshooting

About

Uh oh!

Releases

Packages

Languages

License

Jithin-S-Sunny/IND-Enzymes-Sequence-Database-related-pipelines

Folders and files

Latest commit

History

Repository files navigation

IND-Enzymes

PubMed Data Extraction and Network Generation for Thermophilic and Psychrophilic Organisms

Prerequisites

Usage

Part 1: Extract PubMed IDs

Part 2: Parse JSON Files

Part 3: Network Generation

Outputs

Network Visualization:

Troubleshooting

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages