This tools helps in classifying paprica designated edges, and unique sequences (16S and 18S) using RDP classifier.
Details of paprica pipeline can be found in https://github.com/bowmanjeffs/paprica, and the tutorial can be found in https://www.polarmicrobes.org/analysis-with-paprica/.
This works with the latest version of paprica (Jan-2021).
Most of the python based dependencies are same as paprica. So, if you are running paprica you do not need install them separately. Most important dependencies for ROPE is RDPTool. You can get the RDP tool from https://github.com/rdpstaff/RDPTools. The RDP tool needs a java library (ant 1.9). You can find the library from https://ant.apache.org/bindownload.cgi
git clone https://github.com/avishekdutta14/ROPE.git
cd ROPE
chmod a+x *py
chmod a+x *sh
git clone https://github.com/rdpstaff/RDPTools.git
For RDP installations, building depends on 'make' and ant1.9. Detailed installation guidelines are present in https://github.com/rdpstaff/RDPTools
Please specify the PATH of the RDP Tools/classifier.jar in rdp classification step in all the shell scripts (files ending with .sh) after -jar
Also put the ROPE scripts in the PATH environmental variable.
export PATH=$PATH:/path/to/ROPE
For running ROPE on the edges, you will need to copy all the .unique_seqs.csv (output of paprica-run.sh)file in a new folder and run the following script
ROPE.sh
For running ROPE on the 16S edges, you will need to copy the 16S .unique_tally.csv (output of paprica-combine_results.py) file in a new folder and run the following script
ROPE_unique.sh
As of now this script works only with 16S unique output of ROPE. The required files for this script are
- seq_edge_map.csv (output of paprica)
- taxon_map.csv (output of paprica)
- taxa_map_ROPE_unique.csv (output of ROPE)
- unique_ID_tally.csv (output of ROPE)
The output of this script is comparison_phylum_RP.csv file which has the unique sequences, phylum level affiliations of ROPE and paprica, ROPE confidence value at phylum level, and comparison column (match/mis-match).
comparison_phylum_RP.py
For 18S sequencing download the latest version of 18S classifier (file name: 18Sv4.1_mydata_trained.zip for version 4.1) from https://github.com/terrimporter/18SClassifier/releases
wget https://github.com/terrimporter/18SClassifier/releases/download/v4.1/18Sv4.1_mydata_trained.zip
unzip 18Sv4.1_mydata_trained.zip
Be sure to declare/modify the path for 18Sv4.1_mydata_trained.zip in the rdp classification step in the 18S shell script after -t
.
For running ROPE on the 16S edges, you will need to copy the 18S .unique_tally.csv (output of paprica-combine_results.py) file in a new folder and run the following script
ROPE_unique_18S.sh
For comparison of ROPE and paprica output at unqiue level: Works with output of ROPE_unique.sh and ROPE_unique_18S.sh
Required files: Change the filenames in the script accordingly
- seq_edge_map.csv (output of paprica)
- taxon_map.csv (output of paprica)
- taxa_map_ROPE_unique.csv (output of ROPE)
- unique_ID_tally.csv (output of ROPE)
The output of this script is ROPE_paprica_comaprison_unique_16S.csv
Required files: Change the filenames in the script accordingly
- seq_edge_map.csv (output of paprica)
- taxon_map.csv (output of paprica)
- taxa_map_ROPE_unique_18S.csv (output of ROPE)
- unique_ID_tally.csv (output of ROPE)
The output of this script is ROPE_paprica_comaprison_18S.csv
It finds the most abundant sequence affiliated to a particular edge and makes a .fasta file. The fasta file is classified using RDP calssifier. The output file will contain the taxonomy of each edges best on the most abundant affiliated asv in a file name taxa_map_ROPE.csv. The numerical values generated at each taxonomic hierarchy is the confidence of classification at that level. Details about classification algorithm and confidence calculation is present in Classifier help.
It extracts the sequences from the unique tally.csv file and makes a .fasta file. The fasta file is classified using RDP calssifier. The output file will contain the taxonomy of each edges best on the most abundant affiliated asv in a file name taxa_map_rdp_unique.csv. The numerical values generated at each taxonomic hierarchy is the confidence of classification at that level. In this script, unique ID will be generated for each unique sequence and the map file for each sequences (mapping to unique ID) will be present in unique_ID_tally.csv
It extracts the sequences from the unique_tally.csv file and makes a .fasta file. The fasta file is classified using RDP calssifier. The output file will contain the taxonomy of each edges best on the most abundant affiliated asv in a file name taxa_map_ROPE_unique_18S.csv. The numerical values generated at each taxonomic hierarchy is the confidence of classification at that level. In this script, unique ID will be generated for each unique sequence and the map file for each sequences (mapping to unique ID) will be present in unique_ID_tally.csv
For ROPE: Dutta, A., Goldman, T., Keating, J., Burke, E., Williamson, N., Dirmeier, R., & Bowman, J. S. (2022). Machine Learning Predicts Biogeochemistry from Microbial Community Structure in a Complex Model System. Microbiology Spectrum, 10(1), e01909-21.
For paprica (for all analyses): Bowman, Jeff S., and Hugh W. Ducklow. "Microbial Communities Can Be Described by Metabolic Structure: A General Framework and Application to a Seasonally Variable, Depth-Stratified Microbial Community from the Coastal West Antarctic Peninsula." PloS one 10.8 (2015): e0135868.
For RDP classifier (for all analyses): Wang et al. (2007) Naïve Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73: 5261.
For SILVA (for using 18S only): Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig WG, Peplies J, Glöckner FO (2007) SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucl. Acids Res. 35:7188-7196
For 18S database (for using 18S only):