3
3
> -Terry Pratchett, A Hat Full of Sky
4
4
5
5
# blobtools
6
- Application for the visualisation of (draft) genome assemblies and general assembly QC using TAGC (Taxon-annotated Gc-Coverage) plots [ Kumar et al. 2012] ( http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3843372/pdf/fgene-04-00237.pdf ) .
6
+ Application for the visualisation of (draft) genome assemblies using TAGC (Taxon-annotated Gc-Coverage) plots [ Kumar et al. 2012] ( http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3843372/pdf/fgene-04-00237.pdf ) .
7
7
8
8
## Requirements
9
-
10
- ```
11
9
- Python 2.7+
12
- - Matplotlib 1.5
13
- - Docopt
14
- - NCBI Taxonomy (names.dmp and nodes.dmp) ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
15
- - Virtualenv (recommended), see http://docs.python-guide.org/en/latest/dev/virtualenvs/
16
- ```
10
+ - Matplotlib 1.5
11
+ - Docopt
12
+ - NCBI Taxonomy (names.dmp and nodes.dmp), < ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz >
13
+ - Virtualenv (recommended), see [ Tutorial] ( http://docs.python-guide.org/en/latest/dev/virtualenvs/ )
17
14
18
- ## Installation
19
- - Recommended
15
+ ## Installation
16
+ - Recommended
20
17
```
21
- # install virtualenv
18
+ # install virtualenv
22
19
pip install virtualenv
23
20
24
21
# clone blobtools into folder
25
22
git clone https://github.com/DRL/blobtools.git
26
23
27
- # create virtual environment for blobtools
24
+ # create virtual environment for blobtools
28
25
cd blobtools/
29
26
virtualenv blob_env
30
27
@@ -35,7 +32,7 @@ source blob_env/bin/activate
35
32
(blob_env) $ pip install matplotlib
36
33
37
34
# install docopt
38
- (blob_env) $ pip install docopt
35
+ (blob_env) $ pip install docopt
39
36
40
37
# run
41
38
(blob_env) $ ./blobtools -h
@@ -55,129 +52,222 @@ $ pip install docopt
55
52
$ ./blobtools -h
56
53
```
57
54
58
- ## Doc
59
- ### blobtools
55
+ ## Doc
56
+ ### blobtools
60
57
- main executable
61
58
```
62
59
usage: blobtools <command> [<args>...] [--help]
63
60
64
61
commands:
65
- create create a BlobDB
66
- view print BlobDB
67
- plot plot BlobDB as a blobplot
62
+ create create a BlobDB
63
+ view print BlobDB as a table
64
+ blobplot plot BlobDB as a blobplot
65
+
66
+ covplot compare BlobDB cov(s) to additional cov file
67
+ bam2cov generate cov file from bam file
68
+ sumcov sum coverage from multiple COV files
68
69
69
70
-h --help show this
70
71
```
71
72
72
- ### blobtools create
73
+ ### blobtools create
73
74
- create a BlobDb JSON file
74
75
```
75
76
usage: blobtools create -i FASTA [-y FASTATYPE] [-o OUTFILE] [--title TITLE]
76
- [-b BAM...] [-s SAM...] [-a CAS...] [-c COV...]
77
- [--nodes <NODES>] [--names <NAMES>] [--db <NODESDB>]
78
- [-t TAX...] [-r TAXRULE...]
79
- [-h|--help]
80
-
77
+ [-b BAM...] [-s SAM...] [-a CAS...] [-c COV...]
78
+ [--nodes <NODES>] [--names <NAMES>] [--db <NODESDB>]
79
+ [-t TAX...] [-x TAXRULE...]
80
+ [-h|--help]
81
+
81
82
Options:
82
83
-h --help show this
83
- -i, --infile FASTA FASTA file of assembly. Headers are split at whitespaces.
84
- -y, --type FASTATYPE Assembly program used to create FASTA. If specified,
85
- coverage will be parsed from FASTA header.
84
+ -i, --infile FASTA FASTA file of assembly. Headers are split at whitespaces.
85
+ -y, --type FASTATYPE Assembly program used to create FASTA. If specified,
86
+ coverage will be parsed from FASTA header.
86
87
(Parsing supported for 'spades', 'soap', 'velvet', 'abyss')
87
- -t, --taxfile TAX... Taxonomy file in format (qseqid\ttaxid\tbitscore)
88
+ -t, --taxfile TAX... Taxonomy file in format (qseqid\ttaxid\tbitscore)
88
89
(e.g. BLAST output "--outfmt '6 qseqid staxids bitscore'")
89
90
-x, --taxrule <TAXRULE>... Taxrule determines how taxonomy of blobs is computed [default: bestsum]
90
91
"bestsum" : sum bitscore across all hits for each taxonomic rank
91
- "bestsumorder" : sum bitscore across all hits for each taxonomic rank.
92
- - If first <TAX> file supplies hits, bestsum is calculated.
93
- - If no hit is found, the next <TAX> file is used.
92
+ "bestsumorder" : sum bitscore across all hits for each taxonomic rank.
93
+ - If first <TAX> file supplies hits, bestsum is calculated.
94
+ - If no hit is found, the next <TAX> file is used.
94
95
--nodes <NODES> NCBI nodes.dmp file. Not required if '--db'
95
- --names <NAMES> NCBI names.dmp file. Not required if '--db'
96
- --db <NODESDB> NodesDB file [default: data/nodesDB.txt].
97
- -b, --bam <BAM>... BAM file (requires samtools in $PATH)
98
- -s, --sam <SAM>... SAM file
99
- -a, --cas <CAS>... CAS file (requires clc_mapping_info in $PATH)
96
+ --names <NAMES> NCBI names.dmp file. Not required if '--db'
97
+ --db <NODESDB> NodesDB file [default: data/nodesDB.txt].
98
+ -b, --bam <BAM>... BAM file(s) (requires samtools in $PATH)
99
+ -s, --sam <SAM>... SAM file(s)
100
+ -a, --cas <CAS>... CAS file(s) (requires clc_mapping_info in $PATH)
100
101
-c, --cov <COV>... TAB separated. (seqID\tcoverage)
101
- -o, --out <OUT> BlobDB output prefix
102
- --title TITLE Title of BlobDB [default: FASTA)
102
+ -o, --out <OUT> BlobDB output prefix
103
+ --title TITLE Title of BlobDB [default: output prefix)
103
104
```
104
105
105
- ### blobtools view
106
+ ### blobtools view
106
107
- generate table output from a blobDB file
107
108
```
108
- usage: blobtools view -i <BLOBDB> [-r <TAXRULE>] [--rank <TAXRANK>...] [--hits]
109
+ usage: blobtools view -i <BLOBDB> [-x <TAXRULE>] [--rank <TAXRANK>...] [--hits]
109
110
[--list <LIST>] [--out <OUT>]
110
- [--h|--help]
111
-
111
+ [--h|--help]
112
+
112
113
Options:
113
114
--h --help show this
114
- -i, --input <BLOBDB> BlobDB file (created with "blobtools forge ")
115
+ -i, --input <BLOBDB> BlobDB file (created with "blobtools create ")
115
116
-o, --out <OUT> Output file [default: STDOUT]
116
- -l, --list <LIST> List of sequence names (comma-separated or file).
117
+ -l, --list <LIST> List of sequence names (comma-separated or file).
117
118
If comma-separated, no whitespaces allowed.
118
119
-x, --taxrule <TAXRULE> Taxrule used for computing taxonomy (supported: "bestsum", "bestsumorder")
119
120
[default: bestsum]
120
- -r, --rank <TAXRANK>... Taxonomic rank(s) at which output will be written.
121
- (supported: 'species', 'genus', 'family', 'order',
121
+ -r, --rank <TAXRANK>... Taxonomic rank(s) at which output will be written.
122
+ (supported: 'species', 'genus', 'family', 'order',
122
123
'phylum', 'superkingdom', 'all') [default: phylum]
123
124
-b, --hits Displays taxonomic hits from tax files
124
125
```
125
126
126
- ### blobtools plot
127
+ ### blobtools blobplot
127
128
- generate a blobplot from a blobDB file
128
129
```
129
- usage: blobtools plot -i BLOBDB [-p INT] [-l INT] [-c] [-n] [-s]
130
- [-r RANK] [-x TAXRULE] [--label GROUPS...]
131
- [-o PREFIX] [-m] [--sort ORDER] [--hist HIST] [--title]
132
- [--colours FILE] [--include FILE] [--exclude FILE]
133
- [--format FORMAT] [--noblobs] [--noreads] [--refcov FILE]
134
- [-h|--help]
130
+ usage: blobtools blobplot -i BLOBDB [-p INT] [-l INT] [-c] [-n] [-s]
131
+ [-r RANK] [-x TAXRULE] [--label GROUPS...]
132
+ [-o PREFIX] [-m] [--sort ORDER] [--hist HIST] [--title]
133
+ [--colours FILE] [--include FILE] [--exclude FILE]
134
+ [--format FORMAT] [--noblobs] [--noreads]
135
+ [--refcov FILE] [--catcolour FILE]
136
+ [-h|--help]
135
137
136
138
Options:
137
139
-h --help show this
138
- -i, --infile BLOBDB BlobDB file
139
- -p, --plotgroups INT Number of (taxonomic) groups to plot, remaining
140
+ -i, --infile BLOBDB BlobDB file (created with "blobtools create")
141
+ -p, --plotgroups INT Number of (taxonomic) groups to plot, remaining
140
142
groups are placed in 'other' [default: 7]
141
143
-l, --length INT Minimum sequence length considered for plotting [default: 100]
142
144
-c, --cindex Colour blobs by 'c index' [default: False]
143
145
-n, --nohit Hide sequences without taxonomic annotation [default: False]
144
146
-s, --noscale Do not scale sequences by length [default: False]
145
147
-o, --out PREFIX Output prefix
146
- -m, --multiplot Multi-plot. Print plot after addition of each (taxonomic) group
148
+ -m, --multiplot Multi-plot. Print plot after addition of each (taxonomic) group
147
149
[default: False]
148
150
--sort <ORDER> Sort order for plotting [default: span]
149
151
span : plot with decreasing span
150
- count : plot with decreasing count
151
- --hist <HIST> Data for histograms [default: span]
152
+ count : plot with decreasing count
153
+ --hist <HIST> Data for histograms [default: span]
152
154
span : span-weighted histograms
153
155
count : count histograms
154
156
--title Add title of BlobDB to plot [default: False]
155
157
-r, --rank RANK Taxonomic rank used for colouring of blobs [default: phylum]
156
- (Supported: species, genus, family, order, phylum, superkingdom)
157
- -x, --taxrule TAXRULE Taxrule which has been used for computing taxonomy
158
+ (Supported: species, genus, family, order, phylum, superkingdom)
159
+ -x, --taxrule TAXRULE Taxrule which has been used for computing taxonomy
158
160
(Supported: bestsum, bestsumorder) [default: bestsum]
159
- --label GROUPS... Relabel (taxonomic) groups (not 'all' or 'other'),
161
+ --label GROUPS... Relabel (taxonomic) groups (not 'all' or 'other'),
160
162
e.g. "Bacteria=Actinobacteria,Proteobacteria"
161
163
--colours COLOURFILE File containing colours for (taxonomic) groups
162
164
--exclude GROUPS.. Place these (taxonomic) groups in 'other',
163
165
e.g. "Actinobacteria,Proteobacteria"
164
- --format FORMAT Figure format for plot (png, pdf, eps, jpeg,
166
+ --format FORMAT Figure format for plot (png, pdf, eps, jpeg,
165
167
ps, svg, svgz, tiff) [default: png]
166
168
--noblobs Omit blobplot [default: False]
167
169
--noreads Omit plot of reads mapping [default: False]
168
- --refcov FILE File containing number of "total" and "mapped" reads
169
- per coverage file. (e.g.: bam0,900,100). If provided, info
170
- will be used in read coverage plot(s).
170
+ --refcov FILE File containing number of "total" and "mapped" reads
171
+ per coverage file. (e.g.: bam0,900,100). If provided, info
172
+ will be used in read coverage plot(s).
173
+ --catcolour FILE Colour plot based on categories from FILE
174
+ (format : "seq category").
171
175
```
172
176
## Additional features
173
177
174
178
### blobtools bam2cov
175
179
- extract base-coverage for each contig from BAM file
176
180
```
177
- usage: blobtools bam2cov -i FASTA -b BAM [-h|--help]
178
-
181
+ usage: blobtools bam2cov -i FASTA -b BAM [-h|--help]
182
+
179
183
Options:
180
184
-h --help show this
181
- -i, --infile FASTA FASTA file of assembly. Headers are split at whitespaces.
185
+ -i, --infile FASTA FASTA file of assembly. Headers are split at whitespaces.
182
186
-b, --bam <BAM> BAM file (requires samtools in $PATH)
183
187
```
188
+ ### blobtools covplot
189
+ - plots blobDB cov(s) vs additional cov file (only works at superkingdom level at the moment)
190
+ ```
191
+ usage: blobtools covplot -i BLOBDB -c COV [-p INT] [-l INT] [-n] [-s]
192
+ [--xlabel XLABEL] [--ylabel YLABEL]
193
+ [--log] [--xmax FLOAT] [--ymax FLOAT]
194
+ [-r RANK] [-x TAXRULE] [-o PREFIX] [-m] [--title]
195
+ [--sort ORDER] [--hist HIST] [--format FORMAT]
196
+ [-h|--help]
197
+
198
+ Options:
199
+ -h --help show this
200
+ -i, --infile BLOBDB BlobDB file
201
+ -c, --cov COV COV file used for y-axis
202
+
203
+ --xlabel XLABEL Label for x-axis [default: BlobDB_cov]
204
+ --ylabel YLABEL Label for y-axis [default: CovFile_cov]
205
+ --log Plot log-scale axes
206
+ --xmax FLOAT Maximum values for x-axis [default: 1e10]
207
+ --ymax FLOAT Maximum values for y-axis [default: 1e10]
208
+
209
+ -p, --plotgroups INT Number of (taxonomic) groups to plot, remaining
210
+ groups are placed in 'other' [default: 7]
211
+ -r, --rank RANK Taxonomic rank used for colouring of blobs [default: phylum]
212
+ -x, --taxrule TAXRULE Taxrule which has been used for computing taxonomy
213
+ (Supported: bestsum, bestsumorder) [default: bestsum]
214
+ --sort <ORDER> Sort order for plotting [default: span]
215
+ span : plot with decreasing span
216
+ count : plot with decreasing count
217
+ --hist <HIST> Data for histograms [default: span]
218
+ span : span-weighted histograms
219
+ count : count histograms
220
+
221
+ --title Add title of BlobDB to plot [default: False]
222
+ -l, --length INT Minimum sequence length considered for plotting [default: 100]
223
+ -n, --nohit Hide sequences without taxonomic annotation [default: False]
224
+ -s, --noscale Do not scale sequences by length [default: False]
225
+ -o, --out PREFIX Output prefix
226
+ -m, --multiplot Multi-plot. Print plot after addition of each (taxonomic) group
227
+ [default: False]
228
+ --format FORMAT Figure format for plot (png, pdf, eps, jpeg,
229
+ ps, svg, svgz, tiff) [default: png]
230
+ ```
231
+ ## Tips & Tricks
232
+ - Recommended BLASTn search against NCBI nt
233
+ ```
234
+ blastn \
235
+ -task megablast \
236
+ -query assmebly.fna \
237
+ -db nt \
238
+ -outfmt '6 qseqid staxids bitscore std sscinames sskingdoms stitle' \
239
+ -culling_limit 5 \
240
+ -num_threads 62 \
241
+ -evalue 1e-25 \
242
+ -out assembly.vs.nt.cul5.1e25.megablast.out
243
+ ```
244
+ - Converting [ Diamond] ( https://github.com/bbuchfink/diamond/ ) blastx output for use in 'blobtools create' : [ daa_to_tagc.pl] ( https://github.com/GDKO/CGP-scripts/blob/master/scripts/daa_to_tagc.pl )
245
+
246
+ - Filtering Reads (requires [ samtools] ( http://www.htslib.org/ ) )
247
+ ```
248
+ # 1) Generate index of contigs
249
+ samtools faidx ASSEMBLY.fna
250
+
251
+ # 2) Subset index using list of contigs of interest (list.txt)
252
+ grep -w -f list.txt ASSEMBLY. fai > list.fai
253
+
254
+ # 3) Filter unmapped reads
255
+ samtools view -bS -f12 FILE.bam > FILE.u_u.bam
256
+ samtools bam2fq FILE.u_u.bam | gzip > FILE.u_u.ilv.fq.gz
257
+
258
+ # 4A) Filter pairs where both reads map to list of contigs
259
+ samtools view -t list.fai -bS -F12 FILE.bam > FILE.m_m.bam
260
+ samtools bam2fq FILE.m_m.bam | gzip > FILE.m_m.ilv.fq.gz
261
+
262
+ # 4B) Filter pairs where both reads map
263
+ samtools view -bS -F12 FILE.bam > FILE.m_m.bam
264
+ samtools bam2fq FILE.m_m.bam | gzip > FILE.m_m.ilv.fq.gz
265
+
266
+ # 5) Filter pairs where one read of a pair maps (use -t list.fai if necessary)
267
+ samtools view -bS -f8 -F4 FILE.bam > FILE.m_u.bam
268
+ samtools view -bS -f4 -F8 FILE.bam > FILE.u_m.bam
269
+ samtools merge -n FILE.one_mapped.bam FILE.m_u.bam FILE.u_m.bam
270
+ samtools sort -n -T FILE.temp -O bam FILE.one_mapped.bam > FILE.one_mapped.bam.sorted;
271
+ mv FILE.one_mapped.bam.sorted FILE.one_mapped.bam
272
+ samtools bam2fq FILE.one_mapped.bam | gzip > FILE.one_mapped.ilv.fq.gz
273
+ ```
0 commit comments