-
Notifications
You must be signed in to change notification settings - Fork 29
Increasing coverage from 10x processing
Caleb Lareau
We consistently observe low-coverage regions in certain mtDNA regions due to homology with nuclear DNA. In the setting of scATAC-seq, we expect to observe significantly more reads truly generating from the mtDNA than the corresponding nuclear segment. Thus, the idea is to hard-mask regions of the nuclear genome that share consistent homology with the mtDNA, which in effect forces observed reads to the mtDNA when aligned.
Several common reference genomes have been provided already, but a workflow for how to
generate your own is discussed in the mitoblacklist repo.
The pre-existing blacklist .bed
files can be easily downloaded from there.
One can follow this code using the reference data from CellRanger 1.0.0 to produce a hard-mask modified reference genome.
One exception is that I couldn't figure out how to make the .flat
and .gdx
files,
nor could I figure out where they were important in the 10X protocol, but the pipeline
evidently does need them. So, I just keep them from the old version.
mv genome.fa old_genome.fa
mv genome.fa.flat old_genome.fa.flat
mv genome.fa.gdx old_genome.fa.gdx
bedtools maskfasta -fi old_genome.fa -bed mito_blacklist.bed -fo genome.fa
# Cleanup old files
rm old_genome.fa
rm genome.fa.*
# Make new reference genomes
bwa index genome.fa
samtools faidx genome.fa
# Put the flat files back; these won't be hard-masked, but they aren't being used by the aligner
mv old_genome.fa.flat genome.fa.flat
mv old_genome.fa.gdx genome.fa.gdx
# Update the timestamp to pass the CellRanger preflight checks
touch -d "now" genome.fa.flat
touch -d "now" genome.fa.gdx
Update: in version 1.2.0+ of CellRanger-ATAC, the ability to make a full custom reference
genome is possible.
Thus, a more natural way of generating this blacklist-compatible reference is to 1) generate the hard-masked .fa
file and
then 2) execute the cellranger-atac mkref
command as suggested.
To do this, format the GRCh38_mask.config
with a path to the hard masked .fasta file (from the output of bedtools maskfasta
) as shown below:
{
organism: "human" # same as before
genome: ["GRCh38"] # same as before
input_fasta: ["/path/to/hardmasked_GRCh38/assembly_hardmask.fa"] # updated from bedtools maskfasta with masked regions
input_gtf: ["/path/to/gencode/annotation.gtf"] # same as before
non_nuclear_contigs: ["chrM"] # same as before
input_motifs: "/path/to/jaspar/motifs.pfm" # same as before
}
This json
formatted file coordinates all key components needed to build a reference. Then, with version 1.2.0+ of CellRanger-ATAC, run:
cellranger-atac mkref --config=GRCh38_mask.config
to generate the full reference path.
Please raise an issue here