A collection of functions aimed at interfacing R and blast+ suite. The tidyverse package should be loaded first as the functions use some tidyverse packages as dependencies
Essential libraries for package functions.
library(tidyverse)
library(networkD3)
library(htmlwidgets)
library(webshot)
library(uuid)
library(data.table)
library(ggplot2)
library(DT)
library(knitr)
library(rmarkdown)
library(foreach)
library(doParallel)
library(dplyr)
library(tidyr)
The following is a documentation of functions included in this package. In order to view examples their usage, please refer to the manual section, where a detailed example of the workflow is provided. Most functions have to be run in a specific order, as the output of one is the input of another.
The make_blast_db
function is a wrapper for the makeblastdb
function from BLAST+.
It will generate all the files required for a BLAST database.
The infile
argument should specify the path to a FASTA file containing all the sequences
to be included in the database. The outfile
argument should specify the names of all the
database files to be generated. All database files will carry the same name but will differ
in extension.
dbtype
Default is nucl. The database type (prot, nucl)taxid
The taxonomy information file. (txt file expected).
make_blast_db(infile = "PATH/TO/FILE.FASTA",outfile="my_out_file")
Runs BLAST+ on query against a local database specified by the user.
btype
argument specifies which blast type would be used, the default isblastn
.dbase
path or name of the blast database without including any extensions.qry
path or name of the query file. Must be FASTA format.taxid
default is FALSE. FALSE means there is no file for taxonomy id. TRUE means that there is a file for taxonomy id that was used in make_blast_db function.report
default is TRUE. Generates an interactive report.ncores
default is two. Number of threads.numt
argument specifies the number of threads to be used, only work with UNIX based OS, default value is 1.
Returns a dataframe with blast/query results. Will output this information in the form of an interactive table in the report.
blastinr(btype = "blastn", dbase = "PATH/TO/DATABASE/FILES",
qry = "PATH/TO/FASTA/FILE", taxid = FALSE, report = TRUE, ncores = 2, numt = 1)
A function to retrieve hit sequences from blast search results from within R.
query_ids
vector which holds the ID of queries for which we want to retrieve their corresponding hit sequences.blast_results
parameter is the dataframe output of the blastinr function.NumHitseqs
default is 1. The number of hits returned.outfile
specifies the name and path of the output file.cut_seq
default is TRUE. TRUE will cut the hit sequence from start to end of match. FALSE returns the entire sequence.MultFiles
default is FALSE. if TRUE, will output one file of hit sequences for each query. FALSE places all query hits in one file.report
default is TRUE. Generates an interactive report.pipeline
default parameter adds the function to the pipeline reporter. Set to FALSE by default and TRUE in Run_blast pipeline function.
Writes FASTA files to outfile path and returns hit sequences to the console.
retrieve_hit_seqs(query_ids, blast_results, blastdb = "PATH/TO/FASTA/FILE",
NumHitseqs = 1, outfile= "PATH", cut_seq = TRUE,
MultFiles = FALSE, report = TRUE, pipeline = FALSE)
Can plot taxonomy or GO annotations. Receives metadata to annotate blast outputs.
df1
The dataframe that has the added metadata.df2
dataframe outputted from blastinr function.id_col
The column name that contains the ID to merge dataframes with.summarize_cols
A vector that contains the names of the columns to summarize.report
default is TRUE. Generates an interactive report.
Returns an interactive Sankey plot.
summarize_bl(df1, df2, id_col, summarize_cols, report = TRUE)
One of the key features of this package is its ability to record the user's interactions with all of the functions within it. Furthermore, it will include the outputs that were generated by those functions.
Function will not require any inputs.
generate_report()
To delete and clear the report.
Warning: The report cannot be retrieved after deletion. To make a new report while keeping the old one, make a copy and change the name. Then, use this function to reset the record keeping.
delete_report()
A basic workflow of the program is demonstrated below. For more information about the functions used and their parameters, refer to the function documentations above.
First, a local database must be made. In this example, spike proteins of different Coronavirus variants are used.
# Obtain sequences from GitHub
genomes_seqs <- readDNAStringSet(filepath = "https://raw.githubusercontent.com/idohatam/Biol-3315-files/main/SARS_MERS_coronavirus.raw_sequence.fasta")
# Write sequences into a FASTA file
writeXStringSet(genomes_seqs, file = "spike_protein_seqs_SARS.fasta", format = "fasta")
After obtaining the FASTA files that you wish to use as the database, use the make_blast_db function to create the database.
# create a spike protein database from the FASTA file.
make_blast_db("spike_protein_seqs_SARS.fasta",'prot','taxid_map_internalDS.txt')
--- 2024-10-26 17:54:00.140211 ---
[1] "Blast database successfully created." "Outfile name: spike_protein_seqs_SARS"
After make_blast_db function, there should be multiple files created either in the current directory or another directory specified to the function.
The program is now ready to run a query against a database. The code below obtains a sample dataset to run against the local database.
# obtain the sequences from GitHub
genomes_seqs <- readDNAStringSet(
filepath = "https://raw.githubusercontent.com/idohatam/Biol-3315-files/main/SARS_MERS_coronavirus.raw_sequence.fasta")
# write sequences into a FASTA file.
writeXStringSet(genomes_seqs, file = "genomes_seqs_SARS.fasta", format = "fasta")
Note that the query sequences are nucleotides. This is important as it determines the type of blast that will be run against the database.
Create a dataframe containing Gene Ontology (GO) data for later annotation.
go_df2 <- data.frame(
ID = c(11, 22, 33, 44, 55, 66, 77, 88, 99, 14, 24, 34),
MolecularFunction = c("Nucleic acid binding", "Nucleic acid binding", "Catalytic activity", "Oxidoreductase activity", "DNA binding", "DNA binding", "RNA binding", "Signal transducer activity", "DNA binding", "Signal transducer activity", "ATP binding", "Protein kinase activity"),
BiologicalProcess = c("Transcription", "Biological process", "Metabolic process", "Biochemical synthesis", "Translation", "Transcription", "Phosphorylation", "Signal transduction", "Biochemical synthesis", "Signal transduction", "Transport", "Phosphorylation"),
CellularComponent = c("Nucleus", "Cytosol", "Cytoplasm", "Mitochondrion", "Ribosome", "Nucleus", "Nucleolus", "Cytoplasm", "Nucleus", "Plasma membrane", "Cytoplasm", "Plasma membrane")
)
To run all the functions of BLASTinR with one command. If this is run, you have essentially completed all the steps that are shown after this command chunk.
Run_Blast(infile = "spike_protein_seqs_SARS.fasta", dbtype = "prot", taxids_file = "taxid_map_internalDS.txt", btype = "blastx",
qry = "genomes_seqs_SARS.fasta",
taxid = TRUE, ncores = 3, query_ids = c("Bat_coronavirus"), retrievSeqs_outfile = "prot_hit_OneFile", df1 = go_df2, id_col = "ID", summarize_cols = c("MolecularFunction", "BiologicalProcess","CellularComponent"))
However, If you prefer to execute each step individually, you can skip this pipeline command and carry on with the rest of the commands listed below.
Since the database is a protein database and the query file has nucleotide sequences, we will use blastx.
blast_output <- blstinr('blastx','spike_protein_seqs_SARS','genomes_seqs_SARS.fasta', TRUE)
The output dataframe:
# A tibble: 123 × 14
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids
<chr> <chr> <dbl> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <int>
1 Bat_corona… Bat_c… 100 1258 0 0 21578 25351 12 1269 0 2545 22
2 Bat_corona… SARS_… 97.4 1262 29 1 21578 25351 12 1273 0 2481 55
3 Bat_corona… SARS-… 97.0 1262 34 1 21578 25351 12 1273 0 2467 44
4 Bat_corona… Pango… 89.7 1258 124 3 21578 25351 13 1265 0 2293 33
5 Bat_corona… SARS_… 77.7 1246 258 6 21620 25351 28 1255 0 1988 66
6 Bat_corona… SARS_… 77.6 1246 259 6 21620 25351 28 1255 0 1988 77
7 Bat_corona… Camel… 34.7 1023 587 25 22331 25219 311 1312 1.82e-159 527 11
8 Bat_corona… MERS_… 34.6 1022 589 25 22331 25219 311 1312 7.70e-159 525 99
9 Bat_corona… MERS_… 24.2 91 50 3 16822 17049 78 164 7.9 e+ 0 23.9 99
10 Bat_corona… MERS_… 34.5 1023 589 25 22331 25219 311 1312 1.91e-158 524 88
# ℹ 113 more rows
# ℹ 1 more variable: Range <int>
# ℹ Use `print(n = ...)` to see more rows
# Retrieve ID vector
qry_ids <- c("Bat_coronavirus")
retrieve_hit_seqs(qry_ids, blast_output, "spike_protein_seqs_SARS", 6, "prot_hit_OneFile", TRUE, FALSE, FALSE)
The first sequence inside the "prot_hit_OneFile.fasta" output file.
>Bat_coronavirus__queryID:Bat_coronavirus_sstart:12_send:1269_Orientation:+
SSQCVNLTTRTQLPPAYTNSSTRGVYYPDKVFRSSVLHLTQDLFLPFFSNVTWFHAIHVSGTNGIKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPPGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTDSIVRFPNITNLCPFGEVFNATTFASVYAWNRKRISNCVADYSVLYNSTSFSTFKCYGVSPTKLNDLCFTNVYADSFVITGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSKHIDAKEGGNFNYLYRLFRKANLKPFERDISTEIYQAGSKPCNGQTGLNCYYPLYRYGFYPTDGVGHQPYRVVVLSFELLNAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNASNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSRSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGSCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIIMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT
Create a sankey plot summarizing the GO components that the input data is involved in.
summarize_bl(go_df2, blast_output, id_col = "ID", summarize_cols = c("MolecularFunction", "BiologicalProcess","CellularComponent"), report = FALSE)
Finally, we generate a report in an html file, detailing the user's use of functions, their inputs, and outputs.
To generate an html file, run the code below. The file will appear in your current directory.
generate_report()
Running the code below will delete the html file, and reset the report history.
delete_report()