=================
The snakemake analysis workflow for bioinformatics analysis, including
- ATAC-seq/Cut&Tag/ChIP-seq
- DNA-seq(WGS and WES)
- Assembly
- RNA-seq
- HiC
To-Do list will be updated soon...
One-Click global environment configuration
sh setup.sh
involves
- replacing the value of the root_dir parameter, which represents the project path.
- subsequently, all dependency packages are installed.
- finally, the file base.py from the utils folder is copied to the installed path of the
snakemake-wrapper-utils
package.
In the global configuration file, named config.yaml, both the pipeline
and outdir
parameters require manual configuration.
- the
pipeline
parameter is used to specify the execution workflow. - the
outdir
parameter is employed to define the path for saving the output results.
In the global configuration file, parameters will be overridden by the identically named parameters in the workflow configuration file, indicating that the workflow configuration file takes precedence with higher priority.
workflow include:
- ATAC-seq: config file is ATAC-seq.yaml
- DNA-seq: config file is DNA-seq.yaml
- Assembly: config file is ATAC-seq.yaml
- RNA-seq: config file is ATAC-seq.yaml
example files in directory example
, you need to modify the path of the files sample_info.json
and sample_list.txt
- sample_list.txt: using script
generate_setting.py
to generate or manally create
python generate_setting.py path/to/fastq.gz
sample | fastq1 | fastq2(optional) | type(optional) | experiment(optional) |
---|---|---|---|---|
sp1 | path/to/sp1.R1.fq.gz | path/to/sp1.R2.fq.gz | control | chip |
sp2 | path/to/sp2.R1.fq.gz | path/to/sp2.R2.fq.gz | control | chip |
sp3 | path/to/sp3.R1.fq.gz | path/to/sp1.R2.fq.gz | test | atac |
sp4 | path/to/sp4.R1.fq.gz | path/to/sp2.R2.fq.gz | test | atac |
- sample_info.json: paired samples
{
"sp3": "sp1",
"sp4": "sp2"
}
# run in local
snakemake -s path/to/Snakefile --use-conda -c4
# or run in the slurm task management system
snakemake -s path/to/Snakefile --profile path/to/config/slurm
If you use the slurm task management system, you can write the parameters of sbatch
command as the value of SBATCH_DEFAULTS
in the settings.json file. such as
{
"SBATCH_DEFAULTS": "--nodelist=node1",
"CLUSTER_NAME": "",
"CLUSTER_CONFIG": ""
}
- fastp: A tool designed to provide fast all-in-one preprocessing for FastQ files. This tool is developed in C++ with multithreading supported to afford high performance.
- trimmomatic: A flexible read trimming tool for Illumina NGS data.
- cutadapt: Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.
- trim_galore: Trim Galore is a wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data.
- bwa: BWA is a software package for mapping DNA sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is designed for Illumina sequence reads up to 100bp, while the rest two for longer sequences ranged from 70bp to a few megabases. BWA-MEM and BWA-SW share similar features such as the support of long reads and chimeric alignment, but BWA-MEM, which is the latest, is generally recommended as it is faster and more accurate. BWA-MEM also has better performance than BWA-backtrack for 70-100bp Illumina reads.
- bwa-mem2: The tool bwa-mem2 is the next version of the bwa-mem algorithm in bwa. It produces alignment identical to bwa and is ~1.3-3.1x faster depending on the use-case, dataset and the running machine.
- bowtie2: Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie 2 supports gapped, local, and paired-end alignment modes.
- hisat2: HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (whole-genome, transcriptome, and exome sequencing data) to a population of human genomes (as well as to a single reference genome).
- star: STAR is implemented as a standalone C++ code. STAR is free open source software distributed under GPLv3 license.
- minimap2: Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database. Typical use cases include: (1) mapping PacBio or Oxford Nanopore genomic reads to the human genome; (2) finding overlaps between long reads with error rate up to ~15%; (3) splice-aware alignment of PacBio Iso-Seq or Nanopore cDNA or Direct RNA reads against a reference genome; (4) aligning Illumina single- or paired-end reads; (5) assembly-to-assembly alignment; (6) full-genome alignment between two closely related species with divergence below ~15%.
- Picard
markduplicates
: Picard is implemented using the HTSJDK Java library HTSJDK to support accessing file formats that are commonly used for high-throughput sequencing data such as SAM and VCF. - sambamba: Sambamba is a high performance, highly parallel, robust and fast tool (and library), written in the D programming language, for working with SAM and BAM files.
- samtools: mpileup and other tools for handling SAM, BAM, CRAM.
- bedtools: The swiss army knife for genome arithmetic.
- deeptools: deepTools addresses the challenge of handling the large amounts of data that are now routinely generated from DNA sequencing centers. deepTools contains useful modules to process the mapped reads data for multiple quality checks, creating normalized coverage files in standard bedGraph and bigWig file formats, that allow comparison between different files (for example, treatment and control). Finally, using such normalized and standardized files, deepTools can create many publication-ready visualizations to identify enrichments and for functional annotations of the genome.
- macs2: Model-based Analysis of ChIP-Seq (MACS), for identifying transcript factor binding sites.
- homer: HOMER (Hypergeometric Optimization of Motif EnRichment) is a suite of tools for Motif Discovery and ChIP-Seq analysis.
Referring to the QC methods used in the ENCODE ATAC-seq pipeline
- Total reads
- Mapped reads
-
Fraction of reads in peaks (FRiP): Fraction of all mapped reads that fall into the called peak regions, i.e. usable reads in significantly enriched peaks divided by all usable reads. In general, FRiP scores correlate positively with the number of regions. (Landt et al, Genome Research Sept. 2012, 22(9): 1813–1831)
-
Fraction of reads in annotated regions
-
Reads count distribution of chromatin
- ChIP-seq Standards
PBC1 | PBC2 | Bottlenecking level | NRF | Complexity |
---|---|---|---|---|
< 0.5 | < 1 | Severe | < 0.5 | Concerning |
0.5 ≤ PBC1 < 0.8 | 1 ≤ PBC2 < 3 | Moderate | 0.5 ≤ NRF < 0.8 | Acceptable |
0.8 ≤ PBC1 < 0.9 | 3 ≤ PBC2 < 10 | Mild | 0.8 ≤ NRF < 0.9 | Compliant |
≥ 0.9 | ≥ 10 | None | > 0.9 | Ideal |
- ATAC-seq Standards
PBC1 | PBC2 | Bottlenecking level | NRF | Complexity |
---|---|---|---|---|
< 0.7 | < 1 | Severe | < 0.7 | Concerning |
0.7 ≤ PBC1 ≤ 0.9 | 1 ≤ PBC2 ≤ 3 | Moderate | 0.7 ≤ NRF ≤ 0.9 | Acceptable |
≥ 0.9 | > 3 | None | > 0.9 | Ideal |
- PCR Bottlenecking Coefficient 1 (PBC1)
M1: number of genomic locations where exactly one read maps uniquely MDISTINCT: number of distinct genomic locations to which some read maps uniquely
- PCR Bottlenecking Coefficient 2 (PBC2)
M1: number of genomic locations where only one read maps uniquely M2: number of genomic locations where two reads map uniquely
- Non-Redundant Fraction (NRF) – Number of distinct uniquely mapping reads (i.e. after removing duplicates) / Total number of reads.
- GATK: GATK4 aims to bring together well-established tools from the GATK and Picard codebases under a streamlined framework, and to enable selected tools to be run in a massively parallel way on local clusters or in the cloud using Apache Spark. It also contains many newly developed tools not present in earlier releases of the toolkit.
- vep: VEP (Variant Effect Predictor) predicts the functional effects of genomic variants.
- delly: Delly is an integrated structural variant (SV) prediction method that can discover, genotype and visualize deletions, tandem duplications, inversions and translocations at single-nucleotide resolution in short-read and long-read massively parallel sequencing data. It uses paired-ends, split-reads and read-depth to sensitively and accurately delineate genomic rearrangements throughout the genome.
- hifiadapterfit: Convert .bam to .fastq and remove reads with remnant PacBio adapter sequences.
- jellyfish: Jellyfish is a tool for fast, memory-efficient counting of k-mers in DNA. A k-mer is a substring of length k, and counting the occurrences of all such substrings is a central step in many analyses of DNA sequence. Jellyfish can count k-mers using an order of magnitude less memory and an order of magnitude faster than other k-mer counting packages by using an efficient encoding of a hash table and by exploiting the "compare-and-swap" CPU instruction to increase parallelism.
- genomescope: Fast genome analysis from unassembled short reads.
- genomescope2: Reference-free profiling of polyploid genomes.
- gfatools: gfatools is a set of tools for manipulating sequence graphs in the GFA or the rGFA format. It has implemented parsing, subgraph and conversion to FASTA/BED. More functionality may be added in future.
- seqkit: A cross-platform and ultrafast toolkit for FASTA/Q file manipulation.
- quast: Genome assembly evaluation tool.
- merqury: Evaluate genome assemblies with k-mers and more.
- meryl: A genomic k-mer counter (and sequence utility) with nice features.
- busco: Assessing genomic data quality and beyond.
- hifiasm: Hifiasm is a fast haplotype-resolved de novo assembler initially designed for PacBio HiFi reads. Its latest release could support the telomere-to-telomere assembly by utilizing ultralong Oxford Nanopore reads. Hifiasm produces arguably the best single-sample telomere-to-telomere assemblies combing HiFi, ultralong and Hi-C reads, and it is one of the best haplotype-resolved assemblers for the trio-binning assembly given parental short reads. For a human genome, hifiasm can produce the telomere-to-telomere assembly in one day.
- juicer: Juicer is a platform for analyzing kilobase resolution Hi-C data. In this distribution, we include the pipeline for generating Hi-C maps from fastq raw data files and command line tools for feature annotation on the Hi-C maps.
- 3d-dna: 3D de novo assembly (3D DNA) pipeline(The /tmp directory requires a large space. If not, you need to specify
--tempdir
where parallel is used in the code).
- ...
- rsem: Accurate quantification of gene and isoform expression from RNA-Seq data
- salmon: Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
- kallisto: Near-optimal RNA-Seq quantification
- Subread
featurecounts
: The Subread software package is a tool kit for processing next-gen sequencing data. It includes Subread aligner, Subjunc exon-exon junction detector and featureCounts read summarization program.
FPKM: The fragments per kilobase of transcript per million mapped reads (FPKM) calculation aims to control for transcript length and overall sequencing quantity.
where
$q_i$ are raw read or fragment counts,$l_i$ is feature (i.e., gene or transcript) length, and$\sum_{j}q_j$ corresponds to the total number of mapped reads or fragments.
TPM: The transcripts per million calculation is similar to FPKM, but the difference is that all transcripts are normalized for length first. Then, instead of using the total overall read count as a normalization for size, the sum of the length-normalized transcript values are used as an indicator of size.
where
$q_i$ denotes reads mapped to transcript,$l_i$ is the transcript length, and$\sum_j(\frac{q_j}{l_j})$ corresponds to the sum of mapped reads to transcript normalized by transcript length.
- DESeq2: Differential gene expression analysis based on the negative binomial distribution.
- edgeR: Empirical Analysis of Digital Gene Expression Data in R.
- limma: Linear Models for Microarray Data.
- Running in conda environment, all of the packages and software no need to install mannually.
- Please set software parameters in the configuration file.