update README

sfchen · sfchen · commit d47357d60cc8 · 2019-06-28T13:59:53.000+08:00
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-A tool to GENerate COnsensus REads.
+A fast tool to remove sequencing duplications and eliminate sequencing errors by generating consensus reads.
 * [What's gencore](#whats-gencore)
 * [A quick example](#a-quick-example)
 * [Download, compile and install](#get-gencore)
@@ -10,24 +10,29 @@ A tool to GENerate COnsensus REads.
 * [Read/cite gencore paper](#citation)
 
 # what's gencore?
-`gencore` is a tool to generate consensus reads from next-generation sequencing (NGS) data. It groups the reads derived from the same original DNA template, merges them and generates a consensus read, which contains much less errors than the original reads.
+`gencore` is a tool for fast and powerful deduplication for next-generation sequencing (NGS) data. It is much faster and uses much less memory than Picard and other tools. It generates very informative reports in both HTML and JSON formats. It's based on an algorithm for `generating consensus reads`, and that's why it's named `gencore`.
 
-This tool groups the reads of same origin by their mapping positions and unique molecular identifiers (UMI). It can run with or without UMI. If your FASTQ data has UMI integrated, you can use [fastp](https://github.com/OpenGene/fastp) to shift the UMI to read query names, and use `gencore` to generate consensus reads.
+Basically, `gencore` groups the reads derived from the same original DNA template, merges them by generating a consensus read, which contains much less errors than the original reads.
+
+`gencore` supports the data with unique molecular identifiers (UMI). If your FASTQ data has UMI integrated, you can use [fastp](https://github.com/OpenGene/fastp) to shift the UMI to read query names, and use `gencore` to generate consensus reads.
 
 This tool can eliminate the errors introduced by library preparation and sequencing processes, and consenquently reduce the false positives for downstream variant calling. This tool can also be used to remove duplicated reads. Since it generates consensus reads from duplicated reads, it outputs much cleaner data than conventional duplication remover. ***Due to these advantages, it is especially useful for processing ultra-deep sequencing data for cancer samples.***
 
 `gencore` accepts a sorted BAM/SAM with its corresponding reference fasta as input, and outputs an unsorted BAM/SAM.
 
-# Take a quick glance of the informative report
+# take a quick glance of the informative report
 * Sample HTML report: http://opengene.org/gencore/gencore.html
 * Sample JSON report: http://opengene.org/gencore/gencore.json
 
-# Try gencore to generate above reports
+# try gencore to generate above reports
 * BAM file for testing: http://opengene.org/gencore/input.sorted.bam
 * BED file for testing: http://opengene.org/gencore/test.bed
-* Ref file for testing: ftp://ftp.ncbi.nlm.nih.gov/sra/reports/Assembly/GRCh37-HG19_Broad_variant/Homo_sapiens_assembly19.fasta
-* Command for testing: `gencore -i input.sorted.bam -o output.bam -r Homo_sapiens_assembly19.fasta -b test.bed`
-* Then check the `gencore.html` and `gencore.json` in the working directory
+* Reference genome file: [ftp://ftp.ncbi.nlm.nih.gov/sra/reports/Assembly/GRCh37-HG19_Broad_variant/Homo_sapiens_assembly19.fasta](ftp://ftp.ncbi.nlm.nih.gov/sra/reports/Assembly/GRCh37-HG19_Broad_variant/Homo_sapiens_assembly19.fasta)
+* Command for testing: 
+```shell
+gencore -i input.sorted.bam -o output.bam -r Homo_sapiens_assembly19.fasta -b test.bed --coverage_sampling=50000
+```
+* After the processing is finished, check the `gencore.html` and `gencore.json` in the working directory. The option `--coverage_sampling=50000` is to change the default setting (coverage_sampling=10000) to generate smaller report files by reduce coverage sampling rate.
 
 # quick examples
 The simplest way
@@ -38,7 +43,7 @@ With a BED file to specify the capturing regions
 ```shell
 gencore -i input.sorted.bam -o output.bam -r hg19.fasta -b test.bed
 ```
-Only output reads with >=2 supporting reads (useful for denoising by generating consensus reads with only duplicated reads)
+Only output the fragment with >=2 supporting reads (useful for aggressive denoising)
 ```shell
 gencore -i input.sorted.bam -o output.bam -r hg19.fasta -b test.bed -s 2
 ```
@@ -79,6 +84,17 @@ As described above, gencore can eliminate the errors introduced by library prepa
 
 ***This is the image showing the result of gencore processed BAM. It becomes much cleaner. Cheers!***
 
+# QC result reported by gencore
+gencore also performs some quality control when processing deduplication and generating consensus reads. Basically it reports mapping rate, duplication rate, mismatch rate and some statisticical results. Especially, gencore reports the coverate statistics of input BAM file in genome scale, and in capturing regions (if a BED file is specified).
+
+gencore reports the results both in HTML format and JSON format for manually checking and downstream analysis. See the examples of interactive [HTML](http://opengene.org/gencore/gencore.html) report and [JSON](http://opengene.org/gencore/gencore.html) reports.
+
+## coverate statistics in genome scale
+![image](http://www.opengene.org/gencore/coverage-genome.png) 
+
+## coverate statistics in capturing regions
+![image](http://www.opengene.org/gencore/coverage-bed.png) 
+
 
 # how it works
 important steps:
diff --git a/src/bed.cpp b/src/bed.cpp
@@ -79,7 +79,6 @@ void Bed::statDepth(int tid, int start, int len) {
 }
 
 void Bed::reportJSON(ofstream& ofs) {
-	ofs << "," << endl;
 	ofs << "\t\t\"coverage_bed\":{" << endl;
 	for(int c=0; c<mContigRegions.size();c++) {
 		string contig(mOptions->bamHeader->target_name[c]);
diff --git a/src/stats.cpp b/src/stats.cpp
@@ -159,11 +159,13 @@ void Stats::reportJSON(ofstream& ofs) {
 			ofs << ",";
         ofs << endl;
 	}
-	ofs << "\t\t}" << endl;
+	ofs << "\t\t}";
 
 	if(mOptions->hasBedFile) {
+		ofs << "," << endl;
 		mBedStats->reportJSON(ofs);
-	}
+	} else
+		ofs << endl;
 }
 
 void Stats::print() {

Original file line number	Diff line number	Diff line change
`@@ -79,7 +79,6 @@ void Bed::statDepth(int tid, int start, int len) {`
`79`	`79`	`}`
`80`	`80`
`81`	`81`	`void Bed::reportJSON(ofstream& ofs) {`
`82`		`- ofs << "," << endl;`
`83`	`82`	`ofs << "\t\t\"coverage_bed\":{" << endl;`
`84`	`83`	`for(int c=0; c<mContigRegions.size();c++) {`
`85`	`84`	`string contig(mOptions->bamHeader->target_name[c]);`
Original file line number	Diff line number	Diff line change
`@@ -159,11 +159,13 @@ void Stats::reportJSON(ofstream& ofs) {`
`159`	`159`	`ofs << ",";`
`160`	`160`	`ofs << endl;`
`161`	`161`	`}`
`162`		`- ofs << "\t\t}" << endl;`
	`162`	`+ ofs << "\t\t}";`
`163`	`163`
`164`	`164`	`if(mOptions->hasBedFile) {`
	`165`	`+ ofs << "," << endl;`
`165`	`166`	`mBedStats->reportJSON(ofs);`
`166`		`- }`
	`167`	`+ } else`
	`168`	`+ ofs << endl;`
`167`	`169`	`}`
`168`	`170`
`169`	`171`	`void Stats::print() {`