Skip to content

Commit 719d960

Browse files
committed
Update README.md
1 parent 730ef6a commit 719d960

File tree

1 file changed

+158
-68
lines changed

1 file changed

+158
-68
lines changed

README.md

Lines changed: 158 additions & 68 deletions
Original file line numberDiff line numberDiff line change
@@ -3,28 +3,25 @@
33
> -Terry Pratchett, A Hat Full of Sky
44
55
# blobtools
6-
Application for the visualisation of (draft) genome assemblies and general assembly QC using TAGC (Taxon-annotated Gc-Coverage) plots [Kumar et al. 2012](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3843372/pdf/fgene-04-00237.pdf).
6+
Application for the visualisation of (draft) genome assemblies using TAGC (Taxon-annotated Gc-Coverage) plots [Kumar et al. 2012](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3843372/pdf/fgene-04-00237.pdf).
77

88
## Requirements
9-
10-
```
119
- Python 2.7+
12-
- Matplotlib 1.5
13-
- Docopt
14-
- NCBI Taxonomy (names.dmp and nodes.dmp) ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
15-
- Virtualenv (recommended), see http://docs.python-guide.org/en/latest/dev/virtualenvs/
16-
```
10+
- Matplotlib 1.5
11+
- Docopt
12+
- NCBI Taxonomy (names.dmp and nodes.dmp), <ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz>
13+
- Virtualenv (recommended), see [Tutorial](http://docs.python-guide.org/en/latest/dev/virtualenvs/)
1714

18-
## Installation
19-
- Recommended
15+
## Installation
16+
- Recommended
2017
```
21-
# install virtualenv
18+
# install virtualenv
2219
pip install virtualenv
2320
2421
# clone blobtools into folder
2522
git clone https://github.com/DRL/blobtools.git
2623
27-
# create virtual environment for blobtools
24+
# create virtual environment for blobtools
2825
cd blobtools/
2926
virtualenv blob_env
3027
@@ -35,7 +32,7 @@ source blob_env/bin/activate
3532
(blob_env) $ pip install matplotlib
3633
3734
# install docopt
38-
(blob_env) $ pip install docopt
35+
(blob_env) $ pip install docopt
3936
4037
# run
4138
(blob_env) $ ./blobtools -h
@@ -55,129 +52,222 @@ $ pip install docopt
5552
$ ./blobtools -h
5653
```
5754

58-
## Doc
59-
### blobtools
55+
## Doc
56+
### blobtools
6057
- main executable
6158
```
6259
usage: blobtools <command> [<args>...] [--help]
6360
6461
commands:
65-
create create a BlobDB
66-
view print BlobDB
67-
plot plot BlobDB as a blobplot
62+
create create a BlobDB
63+
view print BlobDB as a table
64+
blobplot plot BlobDB as a blobplot
65+
66+
covplot compare BlobDB cov(s) to additional cov file
67+
bam2cov generate cov file from bam file
68+
sumcov sum coverage from multiple COV files
6869
6970
-h --help show this
7071
```
7172

72-
### blobtools create
73+
### blobtools create
7374
- create a BlobDb JSON file
7475
```
7576
usage: blobtools create -i FASTA [-y FASTATYPE] [-o OUTFILE] [--title TITLE]
76-
[-b BAM...] [-s SAM...] [-a CAS...] [-c COV...]
77-
[--nodes <NODES>] [--names <NAMES>] [--db <NODESDB>]
78-
[-t TAX...] [-r TAXRULE...]
79-
[-h|--help]
80-
77+
[-b BAM...] [-s SAM...] [-a CAS...] [-c COV...]
78+
[--nodes <NODES>] [--names <NAMES>] [--db <NODESDB>]
79+
[-t TAX...] [-x TAXRULE...]
80+
[-h|--help]
81+
8182
Options:
8283
-h --help show this
83-
-i, --infile FASTA FASTA file of assembly. Headers are split at whitespaces.
84-
-y, --type FASTATYPE Assembly program used to create FASTA. If specified,
85-
coverage will be parsed from FASTA header.
84+
-i, --infile FASTA FASTA file of assembly. Headers are split at whitespaces.
85+
-y, --type FASTATYPE Assembly program used to create FASTA. If specified,
86+
coverage will be parsed from FASTA header.
8687
(Parsing supported for 'spades', 'soap', 'velvet', 'abyss')
87-
-t, --taxfile TAX... Taxonomy file in format (qseqid\ttaxid\tbitscore)
88+
-t, --taxfile TAX... Taxonomy file in format (qseqid\ttaxid\tbitscore)
8889
(e.g. BLAST output "--outfmt '6 qseqid staxids bitscore'")
8990
-x, --taxrule <TAXRULE>... Taxrule determines how taxonomy of blobs is computed [default: bestsum]
9091
"bestsum" : sum bitscore across all hits for each taxonomic rank
91-
"bestsumorder" : sum bitscore across all hits for each taxonomic rank.
92-
- If first <TAX> file supplies hits, bestsum is calculated.
93-
- If no hit is found, the next <TAX> file is used.
92+
"bestsumorder" : sum bitscore across all hits for each taxonomic rank.
93+
- If first <TAX> file supplies hits, bestsum is calculated.
94+
- If no hit is found, the next <TAX> file is used.
9495
--nodes <NODES> NCBI nodes.dmp file. Not required if '--db'
95-
--names <NAMES> NCBI names.dmp file. Not required if '--db'
96-
--db <NODESDB> NodesDB file [default: data/nodesDB.txt].
97-
-b, --bam <BAM>... BAM file (requires samtools in $PATH)
98-
-s, --sam <SAM>... SAM file
99-
-a, --cas <CAS>... CAS file (requires clc_mapping_info in $PATH)
96+
--names <NAMES> NCBI names.dmp file. Not required if '--db'
97+
--db <NODESDB> NodesDB file [default: data/nodesDB.txt].
98+
-b, --bam <BAM>... BAM file(s) (requires samtools in $PATH)
99+
-s, --sam <SAM>... SAM file(s)
100+
-a, --cas <CAS>... CAS file(s) (requires clc_mapping_info in $PATH)
100101
-c, --cov <COV>... TAB separated. (seqID\tcoverage)
101-
-o, --out <OUT> BlobDB output prefix
102-
--title TITLE Title of BlobDB [default: FASTA)
102+
-o, --out <OUT> BlobDB output prefix
103+
--title TITLE Title of BlobDB [default: output prefix)
103104
```
104105

105-
### blobtools view
106+
### blobtools view
106107
- generate table output from a blobDB file
107108
```
108-
usage: blobtools view -i <BLOBDB> [-r <TAXRULE>] [--rank <TAXRANK>...] [--hits]
109+
usage: blobtools view -i <BLOBDB> [-x <TAXRULE>] [--rank <TAXRANK>...] [--hits]
109110
[--list <LIST>] [--out <OUT>]
110-
[--h|--help]
111-
111+
[--h|--help]
112+
112113
Options:
113114
--h --help show this
114-
-i, --input <BLOBDB> BlobDB file (created with "blobtools forge")
115+
-i, --input <BLOBDB> BlobDB file (created with "blobtools create")
115116
-o, --out <OUT> Output file [default: STDOUT]
116-
-l, --list <LIST> List of sequence names (comma-separated or file).
117+
-l, --list <LIST> List of sequence names (comma-separated or file).
117118
If comma-separated, no whitespaces allowed.
118119
-x, --taxrule <TAXRULE> Taxrule used for computing taxonomy (supported: "bestsum", "bestsumorder")
119120
[default: bestsum]
120-
-r, --rank <TAXRANK>... Taxonomic rank(s) at which output will be written.
121-
(supported: 'species', 'genus', 'family', 'order',
121+
-r, --rank <TAXRANK>... Taxonomic rank(s) at which output will be written.
122+
(supported: 'species', 'genus', 'family', 'order',
122123
'phylum', 'superkingdom', 'all') [default: phylum]
123124
-b, --hits Displays taxonomic hits from tax files
124125
```
125126

126-
### blobtools plot
127+
### blobtools blobplot
127128
- generate a blobplot from a blobDB file
128129
```
129-
usage: blobtools plot -i BLOBDB [-p INT] [-l INT] [-c] [-n] [-s]
130-
[-r RANK] [-x TAXRULE] [--label GROUPS...]
131-
[-o PREFIX] [-m] [--sort ORDER] [--hist HIST] [--title]
132-
[--colours FILE] [--include FILE] [--exclude FILE]
133-
[--format FORMAT] [--noblobs] [--noreads] [--refcov FILE]
134-
[-h|--help]
130+
usage: blobtools blobplot -i BLOBDB [-p INT] [-l INT] [-c] [-n] [-s]
131+
[-r RANK] [-x TAXRULE] [--label GROUPS...]
132+
[-o PREFIX] [-m] [--sort ORDER] [--hist HIST] [--title]
133+
[--colours FILE] [--include FILE] [--exclude FILE]
134+
[--format FORMAT] [--noblobs] [--noreads]
135+
[--refcov FILE] [--catcolour FILE]
136+
[-h|--help]
135137
136138
Options:
137139
-h --help show this
138-
-i, --infile BLOBDB BlobDB file
139-
-p, --plotgroups INT Number of (taxonomic) groups to plot, remaining
140+
-i, --infile BLOBDB BlobDB file (created with "blobtools create")
141+
-p, --plotgroups INT Number of (taxonomic) groups to plot, remaining
140142
groups are placed in 'other' [default: 7]
141143
-l, --length INT Minimum sequence length considered for plotting [default: 100]
142144
-c, --cindex Colour blobs by 'c index' [default: False]
143145
-n, --nohit Hide sequences without taxonomic annotation [default: False]
144146
-s, --noscale Do not scale sequences by length [default: False]
145147
-o, --out PREFIX Output prefix
146-
-m, --multiplot Multi-plot. Print plot after addition of each (taxonomic) group
148+
-m, --multiplot Multi-plot. Print plot after addition of each (taxonomic) group
147149
[default: False]
148150
--sort <ORDER> Sort order for plotting [default: span]
149151
span : plot with decreasing span
150-
count : plot with decreasing count
151-
--hist <HIST> Data for histograms [default: span]
152+
count : plot with decreasing count
153+
--hist <HIST> Data for histograms [default: span]
152154
span : span-weighted histograms
153155
count : count histograms
154156
--title Add title of BlobDB to plot [default: False]
155157
-r, --rank RANK Taxonomic rank used for colouring of blobs [default: phylum]
156-
(Supported: species, genus, family, order, phylum, superkingdom)
157-
-x, --taxrule TAXRULE Taxrule which has been used for computing taxonomy
158+
(Supported: species, genus, family, order, phylum, superkingdom)
159+
-x, --taxrule TAXRULE Taxrule which has been used for computing taxonomy
158160
(Supported: bestsum, bestsumorder) [default: bestsum]
159-
--label GROUPS... Relabel (taxonomic) groups (not 'all' or 'other'),
161+
--label GROUPS... Relabel (taxonomic) groups (not 'all' or 'other'),
160162
e.g. "Bacteria=Actinobacteria,Proteobacteria"
161163
--colours COLOURFILE File containing colours for (taxonomic) groups
162164
--exclude GROUPS.. Place these (taxonomic) groups in 'other',
163165
e.g. "Actinobacteria,Proteobacteria"
164-
--format FORMAT Figure format for plot (png, pdf, eps, jpeg,
166+
--format FORMAT Figure format for plot (png, pdf, eps, jpeg,
165167
ps, svg, svgz, tiff) [default: png]
166168
--noblobs Omit blobplot [default: False]
167169
--noreads Omit plot of reads mapping [default: False]
168-
--refcov FILE File containing number of "total" and "mapped" reads
169-
per coverage file. (e.g.: bam0,900,100). If provided, info
170-
will be used in read coverage plot(s).
170+
--refcov FILE File containing number of "total" and "mapped" reads
171+
per coverage file. (e.g.: bam0,900,100). If provided, info
172+
will be used in read coverage plot(s).
173+
--catcolour FILE Colour plot based on categories from FILE
174+
(format : "seq category").
171175
```
172176
## Additional features
173177

174178
### blobtools bam2cov
175179
- extract base-coverage for each contig from BAM file
176180
```
177-
usage: blobtools bam2cov -i FASTA -b BAM [-h|--help]
178-
181+
usage: blobtools bam2cov -i FASTA -b BAM [-h|--help]
182+
179183
Options:
180184
-h --help show this
181-
-i, --infile FASTA FASTA file of assembly. Headers are split at whitespaces.
185+
-i, --infile FASTA FASTA file of assembly. Headers are split at whitespaces.
182186
-b, --bam <BAM> BAM file (requires samtools in $PATH)
183187
```
188+
### blobtools covplot
189+
- plots blobDB cov(s) vs additional cov file (only works at superkingdom level at the moment)
190+
```
191+
usage: blobtools covplot -i BLOBDB -c COV [-p INT] [-l INT] [-n] [-s]
192+
[--xlabel XLABEL] [--ylabel YLABEL]
193+
[--log] [--xmax FLOAT] [--ymax FLOAT]
194+
[-r RANK] [-x TAXRULE] [-o PREFIX] [-m] [--title]
195+
[--sort ORDER] [--hist HIST] [--format FORMAT]
196+
[-h|--help]
197+
198+
Options:
199+
-h --help show this
200+
-i, --infile BLOBDB BlobDB file
201+
-c, --cov COV COV file used for y-axis
202+
203+
--xlabel XLABEL Label for x-axis [default: BlobDB_cov]
204+
--ylabel YLABEL Label for y-axis [default: CovFile_cov]
205+
--log Plot log-scale axes
206+
--xmax FLOAT Maximum values for x-axis [default: 1e10]
207+
--ymax FLOAT Maximum values for y-axis [default: 1e10]
208+
209+
-p, --plotgroups INT Number of (taxonomic) groups to plot, remaining
210+
groups are placed in 'other' [default: 7]
211+
-r, --rank RANK Taxonomic rank used for colouring of blobs [default: phylum]
212+
-x, --taxrule TAXRULE Taxrule which has been used for computing taxonomy
213+
(Supported: bestsum, bestsumorder) [default: bestsum]
214+
--sort <ORDER> Sort order for plotting [default: span]
215+
span : plot with decreasing span
216+
count : plot with decreasing count
217+
--hist <HIST> Data for histograms [default: span]
218+
span : span-weighted histograms
219+
count : count histograms
220+
221+
--title Add title of BlobDB to plot [default: False]
222+
-l, --length INT Minimum sequence length considered for plotting [default: 100]
223+
-n, --nohit Hide sequences without taxonomic annotation [default: False]
224+
-s, --noscale Do not scale sequences by length [default: False]
225+
-o, --out PREFIX Output prefix
226+
-m, --multiplot Multi-plot. Print plot after addition of each (taxonomic) group
227+
[default: False]
228+
--format FORMAT Figure format for plot (png, pdf, eps, jpeg,
229+
ps, svg, svgz, tiff) [default: png]
230+
```
231+
## Tips & Tricks
232+
- Recommended BLASTn search against NCBI nt
233+
```
234+
blastn \
235+
-task megablast \
236+
-query assmebly.fna \
237+
-db nt \
238+
-outfmt '6 qseqid staxids bitscore std sscinames sskingdoms stitle' \
239+
-culling_limit 5 \
240+
-num_threads 62 \
241+
-evalue 1e-25 \
242+
-out assembly.vs.nt.cul5.1e25.megablast.out
243+
```
244+
- Converting [Diamond](https://github.com/bbuchfink/diamond/) blastx output for use in 'blobtools create' : [daa_to_tagc.pl](https://github.com/GDKO/CGP-scripts/blob/master/scripts/daa_to_tagc.pl)
245+
246+
- Filtering Reads (requires [samtools](http://www.htslib.org/))
247+
```
248+
# 1) Generate index of contigs
249+
samtools faidx ASSEMBLY.fna
250+
251+
# 2) Subset index using list of contigs of interest (list.txt)
252+
grep -w -f list.txt ASSEMBLY. fai > list.fai
253+
254+
# 3) Filter unmapped reads
255+
samtools view -bS -f12 FILE.bam > FILE.u_u.bam
256+
samtools bam2fq FILE.u_u.bam | gzip > FILE.u_u.ilv.fq.gz
257+
258+
# 4A) Filter pairs where both reads map to list of contigs
259+
samtools view -t list.fai -bS -F12 FILE.bam > FILE.m_m.bam
260+
samtools bam2fq FILE.m_m.bam | gzip > FILE.m_m.ilv.fq.gz
261+
262+
# 4B) Filter pairs where both reads map
263+
samtools view -bS -F12 FILE.bam > FILE.m_m.bam
264+
samtools bam2fq FILE.m_m.bam | gzip > FILE.m_m.ilv.fq.gz
265+
266+
# 5) Filter pairs where one read of a pair maps (use -t list.fai if necessary)
267+
samtools view -bS -f8 -F4 FILE.bam > FILE.m_u.bam
268+
samtools view -bS -f4 -F8 FILE.bam > FILE.u_m.bam
269+
samtools merge -n FILE.one_mapped.bam FILE.m_u.bam FILE.u_m.bam
270+
samtools sort -n -T FILE.temp -O bam FILE.one_mapped.bam > FILE.one_mapped.bam.sorted;
271+
mv FILE.one_mapped.bam.sorted FILE.one_mapped.bam
272+
samtools bam2fq FILE.one_mapped.bam | gzip > FILE.one_mapped.ilv.fq.gz
273+
```

0 commit comments

Comments
 (0)