-
Notifications
You must be signed in to change notification settings - Fork 29
Large deletion calling and heteroplasmy estimation
This page describes the additional modules dedicated to detecting and quantifying mtDNA deletions from sequencing data. We've most thoroughly evaluated our computational approach using the mscATAC-seq approach, but the general approach can be applied in other settings, such as exome sequencing data. Overall, three main execution steps are required: 1) deletion detection; 2) single-cell preparation (optional but suggested); 3) quantification of deletion heteroplasmy.
We note that we've explicitly validated this tool using mtscATAC-seq data from patients with KSS, CPEO, and Pearson Syndrome, where all deletions occur between Oh and OL. However, through simulated data, we verified that deletion junctions spanning the ends of the reference genome can also be detected and quantified with no issues.
The first step is to identify deletions. The mgatk-del-find
executable takes in a .bam
file (presumably containing many single-cells for a single-cell assay) and returns the most likely positions that are participating in a deletion. An example of this functionality is available in tests
from this repository. Execute the command like so:
mgatk-del-find -i pearsonbam/CACCACTAGGAGGCGA-1.qc.bam
This will produce five output files in the same file path as the original .bam
file, including three plots and two raw data tables. First, let's look at the output data table:
head mgatkdel_find.clip.tsv
position coverage clip_count SA
16569 83 82 6
13095 208 62 20
1 65 61 5
6073 182 54 17
10934 131 41 0
3571 172 37 0
301 164 31 0
13753 230 22 0
296 216 21 0
This file shows the number of reads that cover each position as well as the number of reads that are clipped at the particular position, sorted by the number of clipped reads. Unsurprisingly, positions 1 and 16569 come near the top, which is to be expected as the mtDNA is circular and is linked at these two positions. Next, the SA column shows the number of times that clipped position is supported by a secondary alignment. For true deletions, the other half of the alignment will align to chrM, which is recorded by this column. While this table is good for parsing and downstream analyses, we include a summary visualization in the *.clipped_viz.pdf
file to rapidly determine potential mtDNA deletion points:
Here, the plot shows the positions of the mtDNA genome along the x-axis and the number of clipped reads where a clip occurs at the particular position. The top ten most clipped loci are shown with the position annotated (excluding the circular genome junction in the d-loop). After evaluating several different datasets for putative mtDNA deletions, were observed recurrent positions that are nominated, which are shown in grey. In other words, we've identified several positions that correspond to a "filter list" that we de-emphasize with this color scheme. From this visualization, we can see that 6073-13095
is the most likely deletion for this sample.
We can verify that the purported clips correspond to a loss of coverage in the putative deleted area by looking at the *.coverage_viz.pdf
file:
So, these images strongly implicate the chrM:6073-13095
region as an mtDNA deletion in this sample. When we combine these altogether to make a call:
Note that these "split alignments" can be sourced via this file: (head mgatkdel_find.SA.tsv
) where each row is a different read that shows the spliced alignment (clip) on either column.
out1 out2
16569 1
16569 1
16569 1
16569 1
16569 1
16569 1
1684 685
1684 685
652 1721
To quantify heteroplasmy in individual samples/cells for specific deletions, we utilize the mgatk-del
module, which is shown in Step 3. The input that module is a folder of quality-controlled .bam
files, where it is assumed that one cell is present in each .bam
file, and we further recommend removing PCR duplicates. If your sample comes from a 10x mscATAC-seq run, we generate these files as an intermediate, which can be accessible using the regular mgatk
execution (see here in the wiki for more information). Critically, one must throw the -qc
or --keep-qc-bams
flag to save these intermediate files as they are ordinary deleted in the standard execution. An example execution is shown below:
mgatk bcall -i path_to_full_10x_run.bam -o regular_mgatk_call -n regular_mgatk_call -c 4 \
-g rCRS -qc -bt CB -b barcode/path_to_full_10x_runbarcodes.tsv
In brief, this produces all of the standard output for an mgatk
execution but further results in a folder (regular_mgatk_call/qc_bam
) that can be used as the input for the final tool.
For clarity, this step is optional as one may prepare .bam
files in other workflows, but for genotyping with mscATAC-seq
assay, we recommend executing the mgatk
command similar to that shown above.
Finally, after the mtDNA deletions are identified (Step 1) and the input files are pre-processed (Step 2), we can execute mgatk-del
. In brief, the tool requires an input (-i
) of a folder of .bam
files and the left (-lc
) and right (-rc
) coordinates of putative mtDNA deletions for quantification. For our example on this page (in tests), we recommend the following execution:
mgatk-del -i pearsonbam -lc 6073 -rc 13095
By specifying the output folder (regular_mgatk_call
) to be the same as what was declared in Step 2, the .deletion_heteroplasmy.tsv
file will be present in the /final/
folder with the other genotyping files. We recommend this for convenience in organization, but the -o
argument can be any output folder destination.
Here, mgatk-del
produces only a single output file, which is shown here:
cat mgatk_out/final/mgatk_del.deletion_heteroplasmy.tsv
cell_id heteroplasmy reads_del reads_wt reads_all total_clipped_bases deletion version
CTAACTTAGAGCCACA-1 42.24 68 93 161 1718 del6073-13095 improved
CTAACTTAGAGCCACA-1 42.24 68 93 161 NA del6073-13095 naive
GCCTAGGCAGTTCGGC-1 51.2 64 61 125 1375 del6073-13095 improved
GCCTAGGCAGTTCGGC-1 51.59 65 61 126 NA del6073-13095 naive
CACCACTAGGAGGCGA-1 34.05 63 122 185 1412 del6073-13095 improved
CACCACTAGGAGGCGA-1 34.05 63 122 185 NA del6073-13095 naive
Each line shows a unique cell/deletion combination and includes all relevant information about the deletion heteroplasmy, including the number of reads supporting a "wild-type" allele and the number supporting a deletion allele. The heteroplasmy is defined as the ratio of reads_del
to reads_all
multiplied by 100%
. Finally, after additional simulations, we report an "improved" heteroplasmy tab that provides further stringency for estimation, particularly of low effect (rare) deletions.
We can also specify the quantification of multiple deletions by supplying multiple coordinates to the -lc
and -rc
flags like so:
mgatk-del -i pearsonbam -z -lc 6073,6000 -rc 13095,14000
This is particularly useful for complex single-cell experiment settings or when evaluating many different potential deletions. We can then examine the output:
cat mgatk_out/final/mgatk_del.deletion_heteroplasmy.tsv | grep improved
CTAACTTAGAGCCACA-1 42.24 68 93 161 1718 del6073-13095 improved
CTAACTTAGAGCCACA-1 0 0 171 171 0 del6000-14000 improved
GCCTAGGCAGTTCGGC-1 51.2 64 61 125 1375 del6073-13095 improved
GCCTAGGCAGTTCGGC-1 0 0 165 165 0 del6000-14000 improved
CACCACTAGGAGGCGA-1 34.05 63 122 185 1412 del6073-13095 improved
CACCACTAGGAGGCGA-1 0 0 208 208 0 del6000-14000 improved
Again, we see that each line is a unique cell/deletion combination with all appropriate summary statistics.
Here, we verify that the heteroplasmy is identically zero for all cells for the del6000-14000
, which is expected as these positions had no evidence of contributing to a deletion from the analyses upstream (we note however that coverage ased estimates would suggest that the deletion would be supported. We emphasize that the deletions are parsed from -lc
and -rc
such that the order of the arguments for both parameters correspond to one deletion (e.g. a chrM:6073-14000 deletion was not considered since 6073 was first in the -lc
comma-separated string but 14000 was second in the -rc
string). Thus, a parameterization of -lc 6073,6073 -rc 13095,14000
would be required to compute both of the chrM:6073-13095 and chrM:6073-14000 heteroplasmies.
- Caleb Lareau
- Jeffrey Verboon
- Kopal Garg
Please raise an issue here