Skip to content

Commit d7d55c4

Browse files
authored
Updated README
1 parent 94446f3 commit d7d55c4

File tree

1 file changed

+2
-275
lines changed

1 file changed

+2
-275
lines changed

README.md

Lines changed: 2 additions & 275 deletions
Original file line numberDiff line numberDiff line change
@@ -1,277 +1,4 @@
1-
> Once we were blobs in the sea...
2-
3-
> -Terry Pratchett, A Hat Full of Sky
4-
51
# blobtools
6-
Application for the visualisation of (draft) genome assemblies using TAGC (Taxon-annotated Gc-Coverage) plots [Kumar et al. 2012](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3843372/pdf/fgene-04-00237.pdf).
7-
8-
## Requirements
9-
- Python 2.7+
10-
- Matplotlib 1.5
11-
- Docopt
12-
- NCBI Taxonomy (names.dmp and nodes.dmp), <ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz>
13-
- Virtualenv (recommended), see [Tutorial](http://docs.python-guide.org/en/latest/dev/virtualenvs/)
14-
15-
## Installation
16-
- Recommended
17-
```
18-
# install virtualenv
19-
pip install virtualenv
20-
21-
# clone blobtools into folder
22-
git clone https://github.com/DRL/blobtools.git
23-
24-
# create virtual environment for blobtools
25-
cd blobtools/
26-
virtualenv blob_env
27-
28-
# activate virtual environment
29-
source blob_env/bin/activate
30-
31-
# install matplotlib
32-
(blob_env) $ pip install matplotlib
33-
34-
# install docopt
35-
(blob_env) $ pip install docopt
36-
37-
# run
38-
(blob_env) $ ./blobtools -h
39-
```
40-
- Basic
41-
```
42-
# clone blobtools into folder
43-
$ git clone https://github.com/DRL/blobtools.git
44-
45-
# install matplotlib
46-
$ pip install matplotlib
47-
48-
# install docopt
49-
$ pip install docopt
50-
51-
# run
52-
$ ./blobtools -h
53-
```
54-
55-
## Doc
56-
### blobtools
57-
- main executable
58-
```
59-
usage: blobtools <command> [<args>...] [--help]
60-
61-
commands:
62-
create create a BlobDB
63-
view print BlobDB as a table
64-
blobplot plot BlobDB as a blobplot
65-
66-
covplot compare BlobDB cov(s) to additional cov file
67-
bam2cov generate cov file from bam file
68-
sumcov sum coverage from multiple COV files
69-
70-
-h --help show this
71-
```
72-
73-
### blobtools create
74-
- create a BlobDb JSON file
75-
```
76-
usage: blobtools create -i FASTA [-y FASTATYPE] [-o OUTFILE] [--title TITLE]
77-
[-b BAM...] [-s SAM...] [-a CAS...] [-c COV...]
78-
[--nodes <NODES>] [--names <NAMES>] [--db <NODESDB>]
79-
[-t TAX...] [-x TAXRULE...]
80-
[-h|--help]
81-
82-
Options:
83-
-h --help show this
84-
-i, --infile FASTA FASTA file of assembly. Headers are split at whitespaces.
85-
-y, --type FASTATYPE Assembly program used to create FASTA. If specified,
86-
coverage will be parsed from FASTA header.
87-
(Parsing supported for 'spades', 'soap', 'velvet', 'abyss', 'platanus')
88-
-t, --taxfile TAX... Taxonomy file in format (qseqid\ttaxid\tbitscore)
89-
(e.g. BLAST output "--outfmt '6 qseqid staxids bitscore'")
90-
-x, --taxrule <TAXRULE>... Taxrule determines how taxonomy of blobs is computed [default: bestsum]
91-
"bestsum" : sum bitscore across all hits for each taxonomic rank
92-
"bestsumorder" : sum bitscore across all hits for each taxonomic rank.
93-
- If first <TAX> file supplies hits, bestsum is calculated.
94-
- If no hit is found, the next <TAX> file is used.
95-
--nodes <NODES> NCBI nodes.dmp file. Not required if '--db'
96-
--names <NAMES> NCBI names.dmp file. Not required if '--db'
97-
--db <NODESDB> NodesDB file [default: data/nodesDB.txt].
98-
-b, --bam <BAM>... BAM file(s) (requires samtools in $PATH)
99-
-s, --sam <SAM>... SAM file(s)
100-
-a, --cas <CAS>... CAS file(s) (requires clc_mapping_info in $PATH)
101-
-c, --cov <COV>... TAB separated. (seqID\tcoverage)
102-
-o, --out <OUT> BlobDB output prefix
103-
--title TITLE Title of BlobDB [default: output prefix)
104-
```
105-
106-
### blobtools view
107-
- generate table output from a blobDB file
108-
```
109-
usage: blobtools view -i <BLOBDB> [-x <TAXRULE>] [--rank <TAXRANK>...] [--hits]
110-
[--list <LIST>] [--out <OUT>]
111-
[--h|--help]
112-
113-
Options:
114-
--h --help show this
115-
-i, --input <BLOBDB> BlobDB file (created with "blobtools create")
116-
-o, --out <OUT> Output file [default: STDOUT]
117-
-l, --list <LIST> List of sequence names (comma-separated or file).
118-
If comma-separated, no whitespaces allowed.
119-
-x, --taxrule <TAXRULE> Taxrule used for computing taxonomy (supported: "bestsum", "bestsumorder")
120-
[default: bestsum]
121-
-r, --rank <TAXRANK>... Taxonomic rank(s) at which output will be written.
122-
(supported: 'species', 'genus', 'family', 'order',
123-
'phylum', 'superkingdom', 'all') [default: phylum]
124-
-b, --hits Displays taxonomic hits from tax files
125-
```
126-
127-
### blobtools blobplot
128-
- generate a blobplot from a blobDB file
129-
```
130-
usage: blobtools blobplot -i BLOBDB [-p INT] [-l INT] [-c] [-n] [-s]
131-
[-r RANK] [-x TAXRULE] [--label GROUPS...]
132-
[-o PREFIX] [-m] [--sort ORDER] [--hist HIST] [--title]
133-
[--colours FILE] [--include FILE] [--exclude FILE]
134-
[--format FORMAT] [--noblobs] [--noreads]
135-
[--refcov FILE] [--catcolour FILE]
136-
[-h|--help]
137-
138-
Options:
139-
-h --help show this
140-
-i, --infile BLOBDB BlobDB file (created with "blobtools create")
141-
-p, --plotgroups INT Number of (taxonomic) groups to plot, remaining
142-
groups are placed in 'other' [default: 7]
143-
-l, --length INT Minimum sequence length considered for plotting [default: 100]
144-
-c, --cindex Colour blobs by 'c index' [default: False]
145-
-n, --nohit Hide sequences without taxonomic annotation [default: False]
146-
-s, --noscale Do not scale sequences by length [default: False]
147-
-o, --out PREFIX Output prefix
148-
-m, --multiplot Multi-plot. Print plot after addition of each (taxonomic) group
149-
[default: False]
150-
--sort <ORDER> Sort order for plotting [default: span]
151-
span : plot with decreasing span
152-
count : plot with decreasing count
153-
--hist <HIST> Data for histograms [default: span]
154-
span : span-weighted histograms
155-
count : count histograms
156-
--title Add title of BlobDB to plot [default: False]
157-
-r, --rank RANK Taxonomic rank used for colouring of blobs [default: phylum]
158-
(Supported: species, genus, family, order, phylum, superkingdom)
159-
-x, --taxrule TAXRULE Taxrule which has been used for computing taxonomy
160-
(Supported: bestsum, bestsumorder) [default: bestsum]
161-
--label GROUPS... Relabel (taxonomic) groups (not 'all' or 'other'),
162-
e.g. "Bacteria=Actinobacteria,Proteobacteria"
163-
--colours COLOURFILE File containing colours for (taxonomic) groups
164-
--exclude GROUPS.. Place these (taxonomic) groups in 'other',
165-
e.g. "Actinobacteria,Proteobacteria"
166-
--format FORMAT Figure format for plot (png, pdf, eps, jpeg,
167-
ps, svg, svgz, tiff) [default: png]
168-
--noblobs Omit blobplot [default: False]
169-
--noreads Omit plot of reads mapping [default: False]
170-
--refcov FILE File containing number of "total" and "mapped" reads
171-
per coverage file. (e.g.: bam0,900,100). If provided, info
172-
will be used in read coverage plot(s).
173-
--catcolour FILE Colour plot based on categories from FILE
174-
(format : "seq category").
175-
```
176-
## Additional features
177-
178-
### blobtools bam2cov
179-
- extract base-coverage for each contig from BAM file
180-
```
181-
usage: blobtools bam2cov -i FASTA -b BAM [--mq MQ] [--no_base_cov]
182-
[-h|--help]
183-
184-
Options:
185-
-h --help show this
186-
-i, --infile FASTA FASTA file of assembly. Headers are split at whitespaces.
187-
-b, --bam <BAM> BAM file (requires samtools in $PATH)
188-
--mq <MQ> minimum Mapping Quality (MQ) [default: 1]
189-
--no_base_cov only parse read coverage (faster, but ...
190-
can only be used for "blobtools blobplot --noblobs")
191-
```
192-
### blobtools covplot
193-
- plots blobDB cov(s) vs additional cov file (only works at superkingdom level at the moment)
194-
```
195-
usage: blobtools covplot -i BLOBDB -c COV [-p INT] [-l INT] [-n] [-s]
196-
[--xlabel XLABEL] [--ylabel YLABEL]
197-
[--log] [--xmax FLOAT] [--ymax FLOAT]
198-
[-r RANK] [-x TAXRULE] [-o PREFIX] [-m] [--title]
199-
[--sort ORDER] [--hist HIST] [--format FORMAT]
200-
[-h|--help]
201-
202-
Options:
203-
-h --help show this
204-
-i, --infile BLOBDB BlobDB file
205-
-c, --cov COV COV file used for y-axis
206-
207-
--xlabel XLABEL Label for x-axis [default: BlobDB_cov]
208-
--ylabel YLABEL Label for y-axis [default: CovFile_cov]
209-
--log Plot log-scale axes
210-
--xmax FLOAT Maximum values for x-axis [default: 1e10]
211-
--ymax FLOAT Maximum values for y-axis [default: 1e10]
212-
213-
-p, --plotgroups INT Number of (taxonomic) groups to plot, remaining
214-
groups are placed in 'other' [default: 7]
215-
-r, --rank RANK Taxonomic rank used for colouring of blobs [default: phylum]
216-
-x, --taxrule TAXRULE Taxrule which has been used for computing taxonomy
217-
(Supported: bestsum, bestsumorder) [default: bestsum]
218-
--sort <ORDER> Sort order for plotting [default: span]
219-
span : plot with decreasing span
220-
count : plot with decreasing count
221-
--hist <HIST> Data for histograms [default: span]
222-
span : span-weighted histograms
223-
count : count histograms
224-
225-
--title Add title of BlobDB to plot [default: False]
226-
-l, --length INT Minimum sequence length considered for plotting [default: 100]
227-
-n, --nohit Hide sequences without taxonomic annotation [default: False]
228-
-s, --noscale Do not scale sequences by length [default: False]
229-
-o, --out PREFIX Output prefix
230-
-m, --multiplot Multi-plot. Print plot after addition of each (taxonomic) group
231-
[default: False]
232-
--format FORMAT Figure format for plot (png, pdf, eps, jpeg,
233-
ps, svg, svgz, tiff) [default: png]
234-
```
235-
## Tips & Tricks
236-
- Recommended BLASTn search against NCBI nt
237-
```
238-
blastn \
239-
-task megablast \
240-
-query assmebly.fna \
241-
-db nt \
242-
-outfmt '6 qseqid staxids bitscore std sscinames sskingdoms stitle' \
243-
-culling_limit 5 \
244-
-num_threads 62 \
245-
-evalue 1e-25 \
246-
-out assembly.vs.nt.cul5.1e25.megablast.out
247-
```
248-
- Converting [Diamond](https://github.com/bbuchfink/diamond/) blastx output for use in 'blobtools create' : [daa_to_tagc.pl](https://github.com/GDKO/CGP-scripts/blob/master/scripts/daa_to_tagc.pl)
249-
250-
- Filtering Reads (requires [samtools](http://www.htslib.org/))
251-
```
252-
# 1) Generate index of contigs
253-
samtools faidx ASSEMBLY.fna
254-
255-
# 2) Subset index using list of contigs of interest (list.txt)
256-
grep -w -f list.txt ASSEMBLY. fai > list.fai
257-
258-
# 3) Filter unmapped reads
259-
samtools view -bS -f12 FILE.bam > FILE.u_u.bam
260-
samtools bam2fq FILE.u_u.bam | gzip > FILE.u_u.ilv.fq.gz
261-
262-
# 4A) Filter pairs where both reads map to list of contigs
263-
samtools view -t list.fai -bS -F12 FILE.bam > FILE.m_m.bam
264-
samtools bam2fq FILE.m_m.bam | gzip > FILE.m_m.ilv.fq.gz
265-
266-
# 4B) Filter pairs where both reads map
267-
samtools view -bS -F12 FILE.bam > FILE.m_m.bam
268-
samtools bam2fq FILE.m_m.bam | gzip > FILE.m_m.ilv.fq.gz
2+
Application for the visualisation of (draft) genome assemblies using TAGC (Taxon-annotated Gc-Coverage) plots
2693

270-
# 5) Filter pairs where one read of a pair maps (use -t list.fai if necessary)
271-
samtools view -bS -f8 -F4 FILE.bam > FILE.m_u.bam
272-
samtools view -bS -f4 -F8 FILE.bam > FILE.u_m.bam
273-
samtools merge -n FILE.one_mapped.bam FILE.m_u.bam FILE.u_m.bam
274-
samtools sort -n -T FILE.temp -O bam FILE.one_mapped.bam > FILE.one_mapped.bam.sorted;
275-
mv FILE.one_mapped.bam.sorted FILE.one_mapped.bam
276-
samtools bam2fq FILE.one_mapped.bam | gzip > FILE.one_mapped.ilv.fq.gz
277-
```
4+
For the documentation, please refer to https://blobtools.readme.io/

0 commit comments

Comments
 (0)