Skip to content

Commit d47357d

Browse files
committed
update README
1 parent 1d30443 commit d47357d

File tree

3 files changed

+29
-12
lines changed

3 files changed

+29
-12
lines changed

README.md

Lines changed: 25 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
A tool to GENerate COnsensus REads.
1+
A fast tool to remove sequencing duplications and eliminate sequencing errors by generating consensus reads.
22
* [What's gencore](#whats-gencore)
33
* [A quick example](#a-quick-example)
44
* [Download, compile and install](#get-gencore)
@@ -10,24 +10,29 @@ A tool to GENerate COnsensus REads.
1010
* [Read/cite gencore paper](#citation)
1111

1212
# what's gencore?
13-
`gencore` is a tool to generate consensus reads from next-generation sequencing (NGS) data. It groups the reads derived from the same original DNA template, merges them and generates a consensus read, which contains much less errors than the original reads.
13+
`gencore` is a tool for fast and powerful deduplication for next-generation sequencing (NGS) data. It is much faster and uses much less memory than Picard and other tools. It generates very informative reports in both HTML and JSON formats. It's based on an algorithm for `generating consensus reads`, and that's why it's named `gencore`.
1414

15-
This tool groups the reads of same origin by their mapping positions and unique molecular identifiers (UMI). It can run with or without UMI. If your FASTQ data has UMI integrated, you can use [fastp](https://github.com/OpenGene/fastp) to shift the UMI to read query names, and use `gencore` to generate consensus reads.
15+
Basically, `gencore` groups the reads derived from the same original DNA template, merges them by generating a consensus read, which contains much less errors than the original reads.
16+
17+
`gencore` supports the data with unique molecular identifiers (UMI). If your FASTQ data has UMI integrated, you can use [fastp](https://github.com/OpenGene/fastp) to shift the UMI to read query names, and use `gencore` to generate consensus reads.
1618

1719
This tool can eliminate the errors introduced by library preparation and sequencing processes, and consenquently reduce the false positives for downstream variant calling. This tool can also be used to remove duplicated reads. Since it generates consensus reads from duplicated reads, it outputs much cleaner data than conventional duplication remover. ***Due to these advantages, it is especially useful for processing ultra-deep sequencing data for cancer samples.***
1820

1921
`gencore` accepts a sorted BAM/SAM with its corresponding reference fasta as input, and outputs an unsorted BAM/SAM.
2022

21-
# Take a quick glance of the informative report
23+
# take a quick glance of the informative report
2224
* Sample HTML report: http://opengene.org/gencore/gencore.html
2325
* Sample JSON report: http://opengene.org/gencore/gencore.json
2426

25-
# Try gencore to generate above reports
27+
# try gencore to generate above reports
2628
* BAM file for testing: http://opengene.org/gencore/input.sorted.bam
2729
* BED file for testing: http://opengene.org/gencore/test.bed
28-
* Ref file for testing: ftp://ftp.ncbi.nlm.nih.gov/sra/reports/Assembly/GRCh37-HG19_Broad_variant/Homo_sapiens_assembly19.fasta
29-
* Command for testing: `gencore -i input.sorted.bam -o output.bam -r Homo_sapiens_assembly19.fasta -b test.bed`
30-
* Then check the `gencore.html` and `gencore.json` in the working directory
30+
* Reference genome file: [ftp://ftp.ncbi.nlm.nih.gov/sra/reports/Assembly/GRCh37-HG19_Broad_variant/Homo_sapiens_assembly19.fasta](ftp://ftp.ncbi.nlm.nih.gov/sra/reports/Assembly/GRCh37-HG19_Broad_variant/Homo_sapiens_assembly19.fasta)
31+
* Command for testing:
32+
```shell
33+
gencore -i input.sorted.bam -o output.bam -r Homo_sapiens_assembly19.fasta -b test.bed --coverage_sampling=50000
34+
```
35+
* After the processing is finished, check the `gencore.html` and `gencore.json` in the working directory. The option `--coverage_sampling=50000` is to change the default setting (coverage_sampling=10000) to generate smaller report files by reduce coverage sampling rate.
3136

3237
# quick examples
3338
The simplest way
@@ -38,7 +43,7 @@ With a BED file to specify the capturing regions
3843
```shell
3944
gencore -i input.sorted.bam -o output.bam -r hg19.fasta -b test.bed
4045
```
41-
Only output reads with >=2 supporting reads (useful for denoising by generating consensus reads with only duplicated reads)
46+
Only output the fragment with >=2 supporting reads (useful for aggressive denoising)
4247
```shell
4348
gencore -i input.sorted.bam -o output.bam -r hg19.fasta -b test.bed -s 2
4449
```
@@ -79,6 +84,17 @@ As described above, gencore can eliminate the errors introduced by library prepa
7984

8085
***This is the image showing the result of gencore processed BAM. It becomes much cleaner. Cheers!***
8186

87+
# QC result reported by gencore
88+
gencore also performs some quality control when processing deduplication and generating consensus reads. Basically it reports mapping rate, duplication rate, mismatch rate and some statisticical results. Especially, gencore reports the coverate statistics of input BAM file in genome scale, and in capturing regions (if a BED file is specified).
89+
90+
gencore reports the results both in HTML format and JSON format for manually checking and downstream analysis. See the examples of interactive [HTML](http://opengene.org/gencore/gencore.html) report and [JSON](http://opengene.org/gencore/gencore.html) reports.
91+
92+
## coverate statistics in genome scale
93+
![image](http://www.opengene.org/gencore/coverage-genome.png)
94+
95+
## coverate statistics in capturing regions
96+
![image](http://www.opengene.org/gencore/coverage-bed.png)
97+
8298

8399
# how it works
84100
important steps:

src/bed.cpp

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,6 @@ void Bed::statDepth(int tid, int start, int len) {
7979
}
8080

8181
void Bed::reportJSON(ofstream& ofs) {
82-
ofs << "," << endl;
8382
ofs << "\t\t\"coverage_bed\":{" << endl;
8483
for(int c=0; c<mContigRegions.size();c++) {
8584
string contig(mOptions->bamHeader->target_name[c]);

src/stats.cpp

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -159,11 +159,13 @@ void Stats::reportJSON(ofstream& ofs) {
159159
ofs << ",";
160160
ofs << endl;
161161
}
162-
ofs << "\t\t}" << endl;
162+
ofs << "\t\t}";
163163

164164
if(mOptions->hasBedFile) {
165+
ofs << "," << endl;
165166
mBedStats->reportJSON(ofs);
166-
}
167+
} else
168+
ofs << endl;
167169
}
168170

169171
void Stats::print() {

0 commit comments

Comments
 (0)