Mine-N-Match (MNM): A NCBI Taxonomy and Sequence Data Mining Tool

This repository contains a set of Python functions designed to mine and process sequence data from the National Center for Biotechnology Information (NCBI) databases. It leverages the Biopython library for interacting with NCBI's Entrez API and other libraries such as pygbif, pandas, and progressbar for additional functionalities.

Description

The core functionality of this package revolves around fetching sequence data (both nucleotide and protein) from NCBI based on user-defined queries and a list of species. It includes robust error handling and the ability to correct species names using the Global Biodiversity Information Facility (GBIF) backbone taxonomy, and can automatically record species synonyms.

Key Features:

ncbi_fetch_species(): Fetches species names from NCBI for a given higher-order taxonomic group and returns a dictionary containing taxonomic information.
ncbi_mine_seq_data(): Mines NCBI for sequence data based on a list of species and a query, saving results to CSV and FASTA files.

Installation

Clone the repository:

 git clone https://github.com/VisualPhysiologyDB/mine_n_match

Install dependencies: [Make sure you are working in the repository directory from here-after]

A. Create a Conda environment for Mine-N-Match (make sure you have Conda installed)
```
conda create --name mnm_env python=3.9.17
```
THEN
```
conda activate mnm_env
```
B. Use the 'requirements.txt' file to download base package dependencies for MNM
```
pip install -r requirements.txt
```

Usage

Example: Here is a minimal usage example for the ncbi_mine_seq_data function. This example uses the species_list variable, which is a list of species names, and query variable which is a search string for Entrez. ***NOTE - You can also just open the exaample jupyter-notebook, 'mine_n_match_tester.ipynb', when you download the git-repo and perform this example there (that's what I recommend).

# Import functions for mining NCBI
from mine_ncbi_functions import ncbi_fetch_species, ncbi_mine_seq_data 
# Import json so we can load any existing 
import json
# Email for when we query NCBI
email = "example_email@ucsb.edu"  # Replace with your email

 # Example usage:
 taxa = "Mammalia"
 rank = "class"
 limit = 500
 report_dir = 'taxonomy_data'
 out_file = "mammalia_taxonomy"
 
 species_data = ncbi_fetch_species(email, report_dir=report_dir, out=out_file, taxa=taxa, rank=rank, limit=limit, verbose=False)

 query = f"(opsin[Title] OR rhodopsin[Title] OR OPN[Title] OR rh1[Title] OR rh2[Title] OR Rh1[Title] OR Rh2[Title]) NOT partial[Title] NOT voucher[All Fields] NOT kinase[All Fields] NOT kinase-like[All Fields] NOT similar[Title] NOT homolog[Title] NOT opsin-like[Title]"

 ncbi_query_df, query_report_dir = ncbi_mine_seq_data(email=email, job_label='ncbi_mammalia_opsins', out='ncbi_mammalia_opsins', species_list=species_list[140:143], taxa_dictionary=species_data, query=query)

Notes

The functions include extensive error handling, especially for network-related issues when querying NCBI. If an error occurs, intermediate progress is often saved to files, allowing you to resume from where you left off.
The ncbi_mine_seq_data() function automatically creates a directory to store the results, labeled with the job name and current date/time.
Using an API key is highly recommended to increase the number of allowed requests per second and avoid potential issues with NCBI's rate limiting.
The ncbi_fetch_species function can take some time depending on the search parameters, especially for a large taxonomic group.

Contributing

Contributions to this project are welcome. Please feel free to submit pull requests or open issues on the GitHub repository.

Contact

Contact information for author questions or feedback.

Seth A. Frazer - ORCID ID

sethfrazer@ucsb.edu

License

All data and code is covered under a CC0-1.0 License, in accordance with Open Source Initiative (OSI)-policies

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
blast_dbs/opsin_dbs		blast_dbs/opsin_dbs
blast_results/opsins		blast_results/opsins
mnm_data		mnm_data
taxonomy_data		taxonomy_data
transcriptomes/ostracod_seqData		transcriptomes/ostracod_seqData
.gitignore		.gitignore
AUTHORS		AUTHORS
LICENSE		LICENSE
README.md		README.md
blast_matching_functions.py		blast_matching_functions.py
mine_ncbi_functions.py		mine_ncbi_functions.py
requirements.txt		requirements.txt
test_mining_ops.ipynb		test_mining_ops.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mine-N-Match (MNM): A NCBI Taxonomy and Sequence Data Mining Tool

Description

Installation

THEN

Usage

Notes

Contributing

Contact

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

VisualPhysiologyDB/mine_n_match

Folders and files

Latest commit

History

Repository files navigation

Mine-N-Match (MNM): A NCBI Taxonomy and Sequence Data Mining Tool

Description

Installation

THEN

Usage

Notes

Contributing

Contact

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages