|
1 |
| -> Once we were blobs in the sea... |
2 |
| -
|
3 |
| -> -Terry Pratchett, A Hat Full of Sky |
4 |
| -
|
5 | 1 | # blobtools
|
6 |
| -Application for the visualisation of (draft) genome assemblies using TAGC (Taxon-annotated Gc-Coverage) plots [Kumar et al. 2012](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3843372/pdf/fgene-04-00237.pdf). |
7 |
| - |
8 |
| -## Requirements |
9 |
| -- Python 2.7+ |
10 |
| -- Matplotlib 1.5 |
11 |
| -- Docopt |
12 |
| -- NCBI Taxonomy (names.dmp and nodes.dmp), <ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz> |
13 |
| -- Virtualenv (recommended), see [Tutorial](http://docs.python-guide.org/en/latest/dev/virtualenvs/) |
14 |
| - |
15 |
| -## Installation |
16 |
| -- Recommended |
17 |
| -``` |
18 |
| -# install virtualenv |
19 |
| -pip install virtualenv |
20 |
| -
|
21 |
| -# clone blobtools into folder |
22 |
| -git clone https://github.com/DRL/blobtools.git |
23 |
| -
|
24 |
| -# create virtual environment for blobtools |
25 |
| -cd blobtools/ |
26 |
| -virtualenv blob_env |
27 |
| -
|
28 |
| -# activate virtual environment |
29 |
| -source blob_env/bin/activate |
30 |
| -
|
31 |
| -# install matplotlib |
32 |
| -(blob_env) $ pip install matplotlib |
33 |
| -
|
34 |
| -# install docopt |
35 |
| -(blob_env) $ pip install docopt |
36 |
| -
|
37 |
| -# run |
38 |
| -(blob_env) $ ./blobtools -h |
39 |
| -``` |
40 |
| -- Basic |
41 |
| -``` |
42 |
| -# clone blobtools into folder |
43 |
| -$ git clone https://github.com/DRL/blobtools.git |
44 |
| -
|
45 |
| -# install matplotlib |
46 |
| -$ pip install matplotlib |
47 |
| -
|
48 |
| -# install docopt |
49 |
| -$ pip install docopt |
50 |
| -
|
51 |
| -# run |
52 |
| -$ ./blobtools -h |
53 |
| -``` |
54 |
| - |
55 |
| -## Doc |
56 |
| -### blobtools |
57 |
| -- main executable |
58 |
| -``` |
59 |
| -usage: blobtools <command> [<args>...] [--help] |
60 |
| -
|
61 |
| -commands: |
62 |
| - create create a BlobDB |
63 |
| - view print BlobDB as a table |
64 |
| - blobplot plot BlobDB as a blobplot |
65 |
| -
|
66 |
| - covplot compare BlobDB cov(s) to additional cov file |
67 |
| - bam2cov generate cov file from bam file |
68 |
| - sumcov sum coverage from multiple COV files |
69 |
| -
|
70 |
| --h --help show this |
71 |
| -``` |
72 |
| - |
73 |
| -### blobtools create |
74 |
| -- create a BlobDb JSON file |
75 |
| -``` |
76 |
| -usage: blobtools create -i FASTA [-y FASTATYPE] [-o OUTFILE] [--title TITLE] |
77 |
| - [-b BAM...] [-s SAM...] [-a CAS...] [-c COV...] |
78 |
| - [--nodes <NODES>] [--names <NAMES>] [--db <NODESDB>] |
79 |
| - [-t TAX...] [-x TAXRULE...] |
80 |
| - [-h|--help] |
81 |
| -
|
82 |
| - Options: |
83 |
| - -h --help show this |
84 |
| - -i, --infile FASTA FASTA file of assembly. Headers are split at whitespaces. |
85 |
| - -y, --type FASTATYPE Assembly program used to create FASTA. If specified, |
86 |
| - coverage will be parsed from FASTA header. |
87 |
| - (Parsing supported for 'spades', 'soap', 'velvet', 'abyss', 'platanus') |
88 |
| - -t, --taxfile TAX... Taxonomy file in format (qseqid\ttaxid\tbitscore) |
89 |
| - (e.g. BLAST output "--outfmt '6 qseqid staxids bitscore'") |
90 |
| - -x, --taxrule <TAXRULE>... Taxrule determines how taxonomy of blobs is computed [default: bestsum] |
91 |
| - "bestsum" : sum bitscore across all hits for each taxonomic rank |
92 |
| - "bestsumorder" : sum bitscore across all hits for each taxonomic rank. |
93 |
| - - If first <TAX> file supplies hits, bestsum is calculated. |
94 |
| - - If no hit is found, the next <TAX> file is used. |
95 |
| - --nodes <NODES> NCBI nodes.dmp file. Not required if '--db' |
96 |
| - --names <NAMES> NCBI names.dmp file. Not required if '--db' |
97 |
| - --db <NODESDB> NodesDB file [default: data/nodesDB.txt]. |
98 |
| - -b, --bam <BAM>... BAM file(s) (requires samtools in $PATH) |
99 |
| - -s, --sam <SAM>... SAM file(s) |
100 |
| - -a, --cas <CAS>... CAS file(s) (requires clc_mapping_info in $PATH) |
101 |
| - -c, --cov <COV>... TAB separated. (seqID\tcoverage) |
102 |
| - -o, --out <OUT> BlobDB output prefix |
103 |
| - --title TITLE Title of BlobDB [default: output prefix) |
104 |
| -``` |
105 |
| - |
106 |
| -### blobtools view |
107 |
| -- generate table output from a blobDB file |
108 |
| -``` |
109 |
| -usage: blobtools view -i <BLOBDB> [-x <TAXRULE>] [--rank <TAXRANK>...] [--hits] |
110 |
| - [--list <LIST>] [--out <OUT>] |
111 |
| - [--h|--help] |
112 |
| -
|
113 |
| - Options: |
114 |
| - --h --help show this |
115 |
| - -i, --input <BLOBDB> BlobDB file (created with "blobtools create") |
116 |
| - -o, --out <OUT> Output file [default: STDOUT] |
117 |
| - -l, --list <LIST> List of sequence names (comma-separated or file). |
118 |
| - If comma-separated, no whitespaces allowed. |
119 |
| - -x, --taxrule <TAXRULE> Taxrule used for computing taxonomy (supported: "bestsum", "bestsumorder") |
120 |
| - [default: bestsum] |
121 |
| - -r, --rank <TAXRANK>... Taxonomic rank(s) at which output will be written. |
122 |
| - (supported: 'species', 'genus', 'family', 'order', |
123 |
| - 'phylum', 'superkingdom', 'all') [default: phylum] |
124 |
| - -b, --hits Displays taxonomic hits from tax files |
125 |
| -``` |
126 |
| - |
127 |
| -### blobtools blobplot |
128 |
| -- generate a blobplot from a blobDB file |
129 |
| -``` |
130 |
| -usage: blobtools blobplot -i BLOBDB [-p INT] [-l INT] [-c] [-n] [-s] |
131 |
| - [-r RANK] [-x TAXRULE] [--label GROUPS...] |
132 |
| - [-o PREFIX] [-m] [--sort ORDER] [--hist HIST] [--title] |
133 |
| - [--colours FILE] [--include FILE] [--exclude FILE] |
134 |
| - [--format FORMAT] [--noblobs] [--noreads] |
135 |
| - [--refcov FILE] [--catcolour FILE] |
136 |
| - [-h|--help] |
137 |
| -
|
138 |
| - Options: |
139 |
| - -h --help show this |
140 |
| - -i, --infile BLOBDB BlobDB file (created with "blobtools create") |
141 |
| - -p, --plotgroups INT Number of (taxonomic) groups to plot, remaining |
142 |
| - groups are placed in 'other' [default: 7] |
143 |
| - -l, --length INT Minimum sequence length considered for plotting [default: 100] |
144 |
| - -c, --cindex Colour blobs by 'c index' [default: False] |
145 |
| - -n, --nohit Hide sequences without taxonomic annotation [default: False] |
146 |
| - -s, --noscale Do not scale sequences by length [default: False] |
147 |
| - -o, --out PREFIX Output prefix |
148 |
| - -m, --multiplot Multi-plot. Print plot after addition of each (taxonomic) group |
149 |
| - [default: False] |
150 |
| - --sort <ORDER> Sort order for plotting [default: span] |
151 |
| - span : plot with decreasing span |
152 |
| - count : plot with decreasing count |
153 |
| - --hist <HIST> Data for histograms [default: span] |
154 |
| - span : span-weighted histograms |
155 |
| - count : count histograms |
156 |
| - --title Add title of BlobDB to plot [default: False] |
157 |
| - -r, --rank RANK Taxonomic rank used for colouring of blobs [default: phylum] |
158 |
| - (Supported: species, genus, family, order, phylum, superkingdom) |
159 |
| - -x, --taxrule TAXRULE Taxrule which has been used for computing taxonomy |
160 |
| - (Supported: bestsum, bestsumorder) [default: bestsum] |
161 |
| - --label GROUPS... Relabel (taxonomic) groups (not 'all' or 'other'), |
162 |
| - e.g. "Bacteria=Actinobacteria,Proteobacteria" |
163 |
| - --colours COLOURFILE File containing colours for (taxonomic) groups |
164 |
| - --exclude GROUPS.. Place these (taxonomic) groups in 'other', |
165 |
| - e.g. "Actinobacteria,Proteobacteria" |
166 |
| - --format FORMAT Figure format for plot (png, pdf, eps, jpeg, |
167 |
| - ps, svg, svgz, tiff) [default: png] |
168 |
| - --noblobs Omit blobplot [default: False] |
169 |
| - --noreads Omit plot of reads mapping [default: False] |
170 |
| - --refcov FILE File containing number of "total" and "mapped" reads |
171 |
| - per coverage file. (e.g.: bam0,900,100). If provided, info |
172 |
| - will be used in read coverage plot(s). |
173 |
| - --catcolour FILE Colour plot based on categories from FILE |
174 |
| - (format : "seq category"). |
175 |
| -``` |
176 |
| -## Additional features |
177 |
| - |
178 |
| -### blobtools bam2cov |
179 |
| -- extract base-coverage for each contig from BAM file |
180 |
| -``` |
181 |
| -usage: blobtools bam2cov -i FASTA -b BAM [--mq MQ] [--no_base_cov] |
182 |
| - [-h|--help] |
183 |
| -
|
184 |
| - Options: |
185 |
| - -h --help show this |
186 |
| - -i, --infile FASTA FASTA file of assembly. Headers are split at whitespaces. |
187 |
| - -b, --bam <BAM> BAM file (requires samtools in $PATH) |
188 |
| - --mq <MQ> minimum Mapping Quality (MQ) [default: 1] |
189 |
| - --no_base_cov only parse read coverage (faster, but ... |
190 |
| - can only be used for "blobtools blobplot --noblobs") |
191 |
| -``` |
192 |
| -### blobtools covplot |
193 |
| -- plots blobDB cov(s) vs additional cov file (only works at superkingdom level at the moment) |
194 |
| -``` |
195 |
| -usage: blobtools covplot -i BLOBDB -c COV [-p INT] [-l INT] [-n] [-s] |
196 |
| - [--xlabel XLABEL] [--ylabel YLABEL] |
197 |
| - [--log] [--xmax FLOAT] [--ymax FLOAT] |
198 |
| - [-r RANK] [-x TAXRULE] [-o PREFIX] [-m] [--title] |
199 |
| - [--sort ORDER] [--hist HIST] [--format FORMAT] |
200 |
| - [-h|--help] |
201 |
| -
|
202 |
| - Options: |
203 |
| - -h --help show this |
204 |
| - -i, --infile BLOBDB BlobDB file |
205 |
| - -c, --cov COV COV file used for y-axis |
206 |
| -
|
207 |
| - --xlabel XLABEL Label for x-axis [default: BlobDB_cov] |
208 |
| - --ylabel YLABEL Label for y-axis [default: CovFile_cov] |
209 |
| - --log Plot log-scale axes |
210 |
| - --xmax FLOAT Maximum values for x-axis [default: 1e10] |
211 |
| - --ymax FLOAT Maximum values for y-axis [default: 1e10] |
212 |
| -
|
213 |
| - -p, --plotgroups INT Number of (taxonomic) groups to plot, remaining |
214 |
| - groups are placed in 'other' [default: 7] |
215 |
| - -r, --rank RANK Taxonomic rank used for colouring of blobs [default: phylum] |
216 |
| - -x, --taxrule TAXRULE Taxrule which has been used for computing taxonomy |
217 |
| - (Supported: bestsum, bestsumorder) [default: bestsum] |
218 |
| - --sort <ORDER> Sort order for plotting [default: span] |
219 |
| - span : plot with decreasing span |
220 |
| - count : plot with decreasing count |
221 |
| - --hist <HIST> Data for histograms [default: span] |
222 |
| - span : span-weighted histograms |
223 |
| - count : count histograms |
224 |
| -
|
225 |
| - --title Add title of BlobDB to plot [default: False] |
226 |
| - -l, --length INT Minimum sequence length considered for plotting [default: 100] |
227 |
| - -n, --nohit Hide sequences without taxonomic annotation [default: False] |
228 |
| - -s, --noscale Do not scale sequences by length [default: False] |
229 |
| - -o, --out PREFIX Output prefix |
230 |
| - -m, --multiplot Multi-plot. Print plot after addition of each (taxonomic) group |
231 |
| - [default: False] |
232 |
| - --format FORMAT Figure format for plot (png, pdf, eps, jpeg, |
233 |
| - ps, svg, svgz, tiff) [default: png] |
234 |
| -``` |
235 |
| -## Tips & Tricks |
236 |
| -- Recommended BLASTn search against NCBI nt |
237 |
| -``` |
238 |
| -blastn \ |
239 |
| --task megablast \ |
240 |
| --query assmebly.fna \ |
241 |
| --db nt \ |
242 |
| --outfmt '6 qseqid staxids bitscore std sscinames sskingdoms stitle' \ |
243 |
| --culling_limit 5 \ |
244 |
| --num_threads 62 \ |
245 |
| --evalue 1e-25 \ |
246 |
| --out assembly.vs.nt.cul5.1e25.megablast.out |
247 |
| -``` |
248 |
| -- Converting [Diamond](https://github.com/bbuchfink/diamond/) blastx output for use in 'blobtools create' : [daa_to_tagc.pl](https://github.com/GDKO/CGP-scripts/blob/master/scripts/daa_to_tagc.pl) |
249 |
| - |
250 |
| -- Filtering Reads (requires [samtools](http://www.htslib.org/)) |
251 |
| -``` |
252 |
| -# 1) Generate index of contigs |
253 |
| -samtools faidx ASSEMBLY.fna |
254 |
| -
|
255 |
| -# 2) Subset index using list of contigs of interest (list.txt) |
256 |
| -grep -w -f list.txt ASSEMBLY. fai > list.fai |
257 |
| -
|
258 |
| -# 3) Filter unmapped reads |
259 |
| -samtools view -bS -f12 FILE.bam > FILE.u_u.bam |
260 |
| -samtools bam2fq FILE.u_u.bam | gzip > FILE.u_u.ilv.fq.gz |
261 |
| -
|
262 |
| -# 4A) Filter pairs where both reads map to list of contigs |
263 |
| -samtools view -t list.fai -bS -F12 FILE.bam > FILE.m_m.bam |
264 |
| -samtools bam2fq FILE.m_m.bam | gzip > FILE.m_m.ilv.fq.gz |
265 |
| -
|
266 |
| -# 4B) Filter pairs where both reads map |
267 |
| -samtools view -bS -F12 FILE.bam > FILE.m_m.bam |
268 |
| -samtools bam2fq FILE.m_m.bam | gzip > FILE.m_m.ilv.fq.gz |
| 2 | +Application for the visualisation of (draft) genome assemblies using TAGC (Taxon-annotated Gc-Coverage) plots |
269 | 3 |
|
270 |
| -# 5) Filter pairs where one read of a pair maps (use -t list.fai if necessary) |
271 |
| -samtools view -bS -f8 -F4 FILE.bam > FILE.m_u.bam |
272 |
| -samtools view -bS -f4 -F8 FILE.bam > FILE.u_m.bam |
273 |
| -samtools merge -n FILE.one_mapped.bam FILE.m_u.bam FILE.u_m.bam |
274 |
| -samtools sort -n -T FILE.temp -O bam FILE.one_mapped.bam > FILE.one_mapped.bam.sorted; |
275 |
| -mv FILE.one_mapped.bam.sorted FILE.one_mapped.bam |
276 |
| -samtools bam2fq FILE.one_mapped.bam | gzip > FILE.one_mapped.ilv.fq.gz |
277 |
| -``` |
| 4 | +For the documentation, please refer to https://blobtools.readme.io/ |
0 commit comments