Releases: FelixKrueger/Bismark
v0.25.1 - tolerate + symbol for UMIs for bclconvert deduplication
Allowing the +
sign as valid symbol when considering UMIs in --bclconvert
mode (more details)
v0.25.0 - new options and minor fixes
Bismark
-
now using 4 cores for merging multiple BAM files (more details #707)
-
fixed a corner case when reads were aligned in FastA mode with
--parallel
and in addition either--ambiguous
and/or--unmapped
(see #723)
deduplicate_bismark
- added check to see if the UMI appears to be in the middle of the readID, e.g. if added by
bcl-convert
(prompted in #699). Also added new option--bclconvert
to use this internal UMI instead of the one at the end. Also allowing the+
symbol now for dual-indexed runs
bismark2bedGraph
- fixed a bug in non-CpG methylation call for CHH context (more details #647)
coverage2cytosine
- Expanded option
--ff
into--ffs
to extract four, five, and six nucleotide contexts to enable hexamer context analyses. More details here: #717
filter_non_conversion
- changed shebang line to use
env
bismark2report
- better handling of division by 0 error see more here
Version 0.24.2
Just a few fixes, also added two flavours of scripts for merging coverage files (e.g. for when R1 and R2 had been run in single-end mode)
Bismark
- removed an
exit 0
that would terminate runs after processing a single (set of) input file(s).
deduplicate_bismark
- Changed the path to Samtools to custom variable (#609)
coverage2cytosine
- set threshold reads to 1 (if it was 0) for
--gc_context
as intended and mentioned in the help text. Fixes #621
monolithic beast no more
-
Added entirely new documentation website, built using Material for Mkdocs. Thanks to @ewels for a fantastic (late-night) effort to break up and restructure what had become a fairly unwieldy monolithic beast of markdown document...
-
Added docs for cytosine context summary, useful for
GpC
methylation or filtering for specific C context (e.g.CpA
) -
Updated docs for the dovetailing
Bismark
- Warning messages for closing ambiguous and unmapped file handles only occur when these options were specified see here
0.24.0 - long read support with minimap2
Bismark
-
Added new option
--strandID
which reports the alignment strand identity for paired-end, non-directional libraries, e.g.YS:Z:CTOT
. This information may be difficult to obtain if third party tools interfered with the read ordering (admittedly there is a fine balance of read reporting position, FLAG, Read 1 and Genome conversion state to make it work in the first place. More information can be found in this thread). -
runs with
--parallel/--multicore
> 1 specified will now terminate with an error message whenever one of the child processes fails. This prevents potentially incomplete result files making it through to the end unnoticed (more #494) -
runs with
--parallel/--multicore
> 1 as well as--unmapped
and/or--ambiguous
specified will no longer produce potentially corrupt FastQ files (more #495) -
Added option
--mm2/--minimap2
to use minimap2 as the underlying aligner. The minimap2 alignment modes include Oxford Nanopore, PacBio and accurate short reads. In its current implementation, minimap2 can be invoked in one of the following ways: -
--mm2_nanopore
: Sets preset settings for Oxford Nanopore vs reference mapping '-x map-ont' [default] -
--mm2_pacbio
: Sets preset settings for PacBio vs. reference mapping '-x map-pb' -
--mm2_short_reads
: Sets preset settings for accurate short reads '-x sr' -
added option
--mm2_maximum_length <int>
to set a maximum length cutoff, which might be required for very long reads exceeding the maximum number of CIGAR operations tolerated by the BAM formatted reads (>65535). The default is 10,000 bp.
Other options that are currently set within Bismark include '-a' (SAM output), '--MD' (MD tag), '--secondary=no'.
Prompted by fairly slow alignment speeds with the minimap2 default settings, we set out to improve the performance of the alignment process by tweaking several different parameters
Speed optimisiation: optimisation of minimap2 parameters
k-mer size
Due to the reduced DNA alphabet the minimap2 default k-mer size of 15 leads to substantially higher alignment times. Based on our tests we settled for a new default of ‘-k 20’
minibatch size
The minimap2 default minibatch size of 500 million bp means that a substantial amount of data is aligned and held in memory before additional alignment threads can be started. Reducing the minibatch size to 250K reads seemed to be a good compromise (‘-K 250K’).
minimap2 multi-threading
minimap2 alignments may utilize multiple cores for each alignment process; we found that ‘-t 2’ offered a good speed-up, while allowing additional resources had diminishing returns.
Bismark multi-threading
We also tested the potential of using additional resources for Bismark itself (--parallel), which appeared to result in a speed-up of the alignment process as expected; however this comes at the cost of requiring additional system resources.
As a result of these tests, we changed the default settings for minimap2 alignment parameters to ‘-t 2 -k 20 -K 250K’.
methylation_consistency
- Added new option
--chh
to use cytosines in CHH instead of CpG context to enable some trouble shooting and method development
bismark2report
- The CHH/CHG labels for the Cytosine Methylation after Extraction plot now appear in the correct order
bismark_methylation_extractor
-
removed a print statement that would flood STDOUT the logfile if
--merge_non_CG
(but not--comprehensive
) had been selected -
runs with
--parallel/--multicore
specified will now terminate with an error message whenever one of the child processes fails. This prevents potentially incomplete result files making it through to the end unnoticed -
changed the option
-o/--output
to-o/--output_dir
for consistency reasons...
bismark_genome_preparation
- Added option
--mm2/--minimap2
. The genome indexing process (bismark_genome_preparation
) writes out a minimap2 index to the genome folder, using the optimized k-mer size of-k 20
(see comments for bismark itself). This pre-generated minimap2 index takes precedence over indexing options that would otherwise happen as part of the alignment procedure.
deduplicate_bismark
- when using an output filename
-o customname
the deduplication report will also be derived from customname.
Added a sentence to the Docs that Genozip 14 and above supports Bismark BAM files (with a substantial gain in compression).
fix auto-detection
filter_non_conversion
- fixed global setting of
--paired
or--single
mode. Auto-detection now works by only looking at the@PG ID:Bismark
line of the SAM header.
methylation_consistency
- Auto-detection now works by only looking at the
@PG ID:Bismark
line of the SAM header.
coverage2cytosine
- Swapped the columns for count methylated and count unmethylated for the context summary report to match the header line.
v0.23.0
Bismark Release v0.23.0
- Migrated CI tests from Travis to Github Actions
deduplicate_bismark
-
the command
deduplicate_bismark --barcode *bam
now works again. Previously the output file names were accidentally all derived from the first supplied file in--barcode
(= UMI) mode (it had been fixed for normal files in 0.22.2). -
Changed the way the library auto-detection works to only look at the
@PG ID:Bismark
line of the SAM header (to only look for the Bismark command)
bismark_methylation_extractor / bismark2bedGraph
-
Added a new option
--ucsc
tobismark2bedGraph
andbismark_methylation_extractor
that will produce a UCSC-ready bedGraph file if the genome version used came from Ensembl. This option (i) prefixes chromosome names with 'chr', and (ii) changes the mitochondrial chromosome from 'MT' to 'chrM'. In addition, it will also write out a new file ending in.chromosome_sizes.txt
for easier use ofbedGraphToBigWig
. More here. -
Changed the way the library auto-detection works to only look at the
@PG ID:Bismark
line of the SAM header.
coverage2cytosine
-
Added a new output file for all cytosine context methylation totals. More information here: #321.
-
Added new option
--drach/--m6A
. Mostm6A
sites are found in the conserved sequence motifDRACH
(whereD
=G
/A
/U
,R
=G
/A
,H
=A
/U
/C
), and if bound by anti-m6A antibody, it causes the reverse transcriptase to introduceC
toT
transitions at the cytosine which followsA
in theDRACH
motif. This option also sets a coverage threshold of at 1 unless specified explicitly. This is a very specialised option and should only be used by experimentalists looking atm6A
methylation (where the C to T transition acts as a proxy ofm6A
).
bismark2summary
- Samples with absolutely 0 methylation calls in some context are now excluded from the graphical HTML output (as they break rendering the entire summary graph section). These samples and their statistics do still appear in the file
bismark_summary_report.txt
. More information here: #315.
v0.22.3
Bismark
- Accepted pull request to fix the MAPQ score calculation in
local
mode.
methylation_consistency
- Added a new script to assess the concordance of methylation calls. See more here: https://github.com/FelixKrueger/Bismark/tree/master/Docs#x-concordance-of-methylation-calls-across-bisulfite-reads
0.22.2
- Added FAQ document for questions that keep coming up. Will be populated over time.
Bismark
-
the option
--non_bs_mm
is now only allowed in end-to-end mode -
Fixed the calculation of non bisulfite mismatches for paired-end data which happened correctly only when R2 had an InDel (see here)
-
When the option
-u
was used in conjunction with--parallel
, only-u
sequences will be written to the temporary subset files for each spawn of Bismark (previously, the entire file was split for--parallel
, but then only a small subset of those files was used for-u
, which resulted in very long runs even for a small number of analysed sequences)
deduplicate_bismark
- the command
deduplicate_bismark *bam
now works again. Previously the output file names were accidentally all derived from the first supplied file.
coverage2cytosine
- Added new option
--coverage_threshold INT
. Positions have to be covered by at least INT calls (irrespective of their methylation state) before they get reported. For NOMe-seq, the minimum threshold is automatically set to 1 unless specified explicitly. Setting a coverage threshold does not work in conjunction with--merge_CpGs
(as all genomix CpGs are required for this). Default: 0 (i.e. all genomic positions get reported)
bismark2report
- added seconds to the timestamp report statement (which caused a warning on certain, but not all, platforms)
bismark2summary
- Now reads splitting reports even for non-deduplicated files (such as RRBS).
Essential Easter Performance Release [EEPR]
Bismark
-
Hot-fixed (read: removed) the cause of delay during the
MD:Z:
field computation for reads containing a deletion (which was roughly equal to 1 second per read). Apologies, I did it again... -
Changed the default
--score_min
function for HISAT2 in--local
mode back to a linear function (instead of using the logarithmic model that is employed by Bowtie 2). The default is now--score_min L,0,-0.2
for both end-to-end (default) and--local
mode. It should be mentioned that we currently don't understand how exactly the scoring mode in HISAT2 works (even though the scores appear to be all negative with a maximum value of 0), so this might change somewhat in the future. See here for more info.