Skip to content

Commit 6c3f609

Browse files
committed
Readme update
1 parent 26e09ae commit 6c3f609

File tree

1 file changed

+24
-18
lines changed

1 file changed

+24
-18
lines changed

README.md

Lines changed: 24 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
1-
=====================================================================
2-
MiGEC: Molecular Identifier Group-based Error Correction pipeline
3-
=====================================================================
1+
MiGEC: Molecular Identifier Group-based Error Correction pipeline
2+
===================================================================
43

54
This pipeline provides several useful tools for analysis of immune repertoire sequencing data. The pipeline utilizes unique nucleotide tags (UMIs) in order to filter experimental errors from resulting sequences. Those tags are attached to molecules before sequencing library preparation and allow to backtrack the original sequence of molecule. This pipeline is applicable for Illumina MiSeq and HiSeq 2500 reads. Sequencing libraries targeting CDR3 locus of immune receptor genes with high over-sequencing, i.e. ones that have at least 10 reads (optimally 30+ reads) per each starting molecule, should be used.
65

@@ -18,20 +17,20 @@ or simply download a standalone jar and execute
1817

1918
>$java -cp migec.jar Checkout
2019
21-
NOTE: The data from 454 platform should be used with caution, as it contains homopolymer errors which (in present framework) result in reads dropped during consensus assembly. The 454 platform has a relatively low read yield, so additional read dropping could result in over-sequencing level below required threshold. If you still wish to give it a try, we would recommend filtering off all short reads and repairing indels with Coral (http://www.cs.helsinki.fi/u/lmsalmel/coral/), the latter should be run with options ```-mr 2 -mm 1000 -g 3```.
20+
NOTE: The data from 454 platform should be used with caution, as it contains homopolymer errors which (in present framework) result in reads dropped during consensus assembly. The 454 platform has a relatively low read yield, so additional read dropping could result in over-sequencing level below required threshold. If you still wish to give it a try, we would recommend filtering off all short reads and repairing indels with Coral (<http://www.cs.helsinki.fi/u/lmsalmel/coral/>), the latter should be run with options ```-mr 2 -mm 1000 -g 3```.
2221

2322
STANDARD PIPELINE
24-
=================
23+
-----------------
24+
25+
### 1. Checkout
2526

26-
1. Checkout
27-
==============================
2827
Description: A script to perform de-multiplexing and UMI tag extraction
2928

3029
Standard usage:
31-
>$java -cp migec.jar Checkout -cu barcodes.txt R1.fastq.gz R2.fastq.gz ./checkout/
30+
>$java -cp migec.jar Checkout -cute barcodes.txt R1.fastq.gz R2.fastq.gz ./checkout/
3231
3332
For unpaired library:
34-
>$java -cp migec.jar Checkout -cu barcodes.txt R.fastq.gz - ./checkout/
33+
>$java -cp migec.jar Checkout -cute barcodes.txt R.fastq.gz - ./checkout/
3534
3635
barcodes.txt format is the following,
3736
>SAMPLE-ID (tab) MASTER-ADAPTER-SEQUENCE (tab) SLAVE-ADAPTER-SEQUENCE
@@ -45,6 +44,14 @@ will search for AAGGTT seed exact match, then for the remaining adapter sequence
4544

4645
Additional parameters:
4746

47+
```-c``` compressed output (gzip compression).
48+
49+
```-t``` trim adapter sequence from output.
50+
51+
```-e``` also remove trails of template-switching (poly-G) for the case when UMI-containing adapter is added using reverse-transcription (cDNA libraries).
52+
53+
Barcode search parameters:
54+
4855
```-o``` could speed up if reads are oriented (i.e. master adapter should be in R1).
4956

5057
```-r``` will apply a custom RC mask. By default it assumes Illumina reads with mates on different strands, so it reverse-complements read with slave adapter so that output reads will be on master strand.
@@ -53,9 +60,8 @@ Additional parameters:
5360

5461

5562

63+
### 2. Histogram
5664

57-
2. Histogram
58-
==============================
5965
Description: A script to generate over-sequencing statistics
6066

6167
Standard usage:
@@ -66,8 +72,8 @@ Will generate several files, the one important for basic data processing is ./ch
6672

6773

6874

69-
3. Assemble
70-
==============================
75+
### 3. Assemble
76+
7177
Description: A script to perform UMI-guided assembly
7278

7379
Standard usage:
@@ -98,8 +104,8 @@ To inspect the effect of such single-mismatch erroneous UMI sub-variants see "co
98104

99105

100106

101-
4. CdrBlast
102-
===============================
107+
### 4. CdrBlast
108+
103109
Description: A script to extract CDR3 sequences
104110

105111
Standard usage (assuming library contains T-cell Receptor Alpha Chain sequences)
@@ -115,15 +121,15 @@ For raw data:
115121

116122
NOTE:
117123

118-
1) NCBI-BLAST+ package required. Could be directly installed on Linux using a command like $sudo apt-get ncbi-blast+ or downloaded and installed from here: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
124+
1) NCBI-BLAST+ package required. Could be directly installed on Linux using a command like $sudo apt-get ncbi-blast+ or downloaded and installed directly from here: <ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/>
119125

120126
2) Both raw and assembled data should be processed to apply the last step of filtration.
121127

122128

123129

124130

125-
5. FilterCdrBlastResults
126-
============================================
131+
### 5. FilterCdrBlastResults
132+
127133
Description: A script to filter erroneous CDR3 sequences produced due to hot-spot PCR and NGS errors
128134

129135
Standard usage:

0 commit comments

Comments
 (0)