You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Output after typing this command should include the flag \fn{-DHAVE\_SAMTOOLS} if the linking is successful. If compiled successfully, the executable file is available
123
138
in \textbf{preseq/}.
124
139
125
-
If a BAM file is used as input without first having run \begingroup\fontsize{9pt}{11pt}\selectfont\fn{\$ make all SAMTOOLS\_DIR=/loc/of/SAMTools}\endgroup, then the following error will occur: \begingroup\fontsize{9pt}{12pt}\selectfont\fn{terminate called after throwing an instance of 'std::string'}\endgroup.
140
+
If a BAM file is used as input without successful linking to
141
+
SAMTools, then the following error will occur:
142
+
\begingroup\fontsize{9pt}{12pt}\selectfont\fn{terminate called after throwing an instance of 'std::string'}\endgroup.
126
143
127
144
\newpage
128
145
129
-
\section{Using preseq}
146
+
\section{Using \textbf{preseq}}
130
147
\label{sec:usage}
131
148
132
149
\paragraph{Basic usage}~\\~\\[-.2cm]
133
150
\label{sub:basic}
134
-
To generate the complexity plot of a genomic
151
+
To generate the complexity curve of a genomic
135
152
library from a read file in BED or BAM format or a duplicate count file,
coverage is highly variable and uncertain function
160
177
of sequencing depth. Some regions may be missing
161
178
due to locus dropout or preferentially amplified during
162
-
MDA (multiple displacement amplification).
179
+
whole genome amplification.
163
180
\fn{gc\_extrap} allows the level genomic coverage from deep
164
181
sequencing to be predicted based on an initial sample.
165
182
The input file format need to be a mapped read (MR) or BED,
@@ -198,7 +215,9 @@ \section{File Format}
198
215
mapped fragments are counted. This means that both ends
199
216
of a disconcordantly mapped read will each be counted separately.
200
217
If a large number of reads are disconcordant, then
201
-
the default single end should be used. In this case only the mapping
218
+
the default single end should be used or the disconcordantly
219
+
mapped reads removed prior to running \textbf{preseq}.
220
+
In this case only the mapping
202
221
location of the first mate
203
222
is used as the unique molecular identifier~\cite{kivioja2011counting}.
204
223
@@ -223,7 +242,9 @@ \section{File Format}
223
242
\end{alltt}\endgroup
224
243
More complicated unique molecular identifiers
225
244
can be used, such as mapping position plus a random barcode,
226
-
but are too complicated to detail in this manual. For questions with such usage, please contact us at \href{mailto:tdaley@usc.edu}{\nolinkurl{tdaley@usc.edu}}
245
+
but are too complicated to detail in this manual.
246
+
For questions with such usage, please contact us at
\paragraph{Mapped read format for \fn{gc\_extrap}}~\\~\\[-.2cm]
229
250
@@ -247,9 +268,8 @@ \section{Detailed usage}
247
268
\label{sec:complexityplot}
248
269
249
270
\fn{c\_curve} is used to compute the
250
-
expected complexity curve of a mapped read file by
251
-
subsampling smaller experiments without replacement
252
-
and counting the distinct reads.
271
+
expected complexity curve of a mapped read file
272
+
with a hypergeometric formula~\cite{heck1975explicit}.
253
273
Output is a text file with two
254
274
columns. The first gives the total number
255
275
of reads and the second the corresponding number
@@ -265,6 +285,8 @@ \section{Detailed usage}
265
285
\item[\begingroup\fontsize{9pt}{12pt}\selectfont-V, -vals\endgroup] Input is a text file of read counts
266
286
\end{description}
267
287
288
+
\newpage
289
+
268
290
\paragraph{lc\_extrap}~\\~\\[-.2cm]
269
291
\label{sec:librarycomplexity}
270
292
@@ -297,8 +319,11 @@ \section{Detailed usage}
297
319
\item[\begingroup\fontsize{9pt}{12pt}\selectfont-H, -hist\endgroup] Input is a text file of the observed histogram
298
320
\item[\begingroup\fontsize{9pt}{12pt}\selectfont-V, -vals\endgroup] Input is a text file of read counts
299
321
\item[\begingroup\fontsize{9pt}{12pt}\selectfont-Q, -quick\endgroup] Quick mode, option to estimate yield without bootstrapping for confidence intervals
322
+
\item[\begingroup\fontsize{9pt}{12pt}\selectfont-D, -defects\endgroup] Defects mode, estimates the complexity curve without checking for instabilities in the curve. Should only be used on datasets that fail estimation without defects.
300
323
\end{description}
301
324
325
+
\newpage
326
+
302
327
\paragraph{gc\_extrap}~\\~\\[-.2cm]
303
328
\label{sec:genomiccoverage}
304
329
@@ -333,6 +358,8 @@ \section{Detailed usage}
333
358
\item[\begingroup\fontsize{9pt}{12pt}\selectfont-Q, -quick\endgroup] Quick mode, option to estimate genomic coverage without bootstrapping for confidence intervals
The following command will give output of the same format as the above examples.\begingroup\fontsize{9pt}{12pt}\selectfont\begin{alltt} $./preseq lc_extrap -o future_yield.txt -H histogram.txt \end{alltt}\endgroup
498
+
The following command will give output of the same format as the above examples.
Similarly, both \fn{lc\_extrap} and \fn{c\_curve} allow the option to input read counts (text file should contain ONLY the observed counts in a single column). For example, if a dataset had the following counts histogram:
Command should be run with the \fn{-V} flag (not to be confused with \fn{-v} for verbose mode): \begingroup\fontsize{9pt}{12pt}\selectfont\begin{alltt} $./preseq lc_extrap -o future_yield.txt -V counts.txt \end{alltt}\endgroup
523
+
Command should be run with the \fn{-V} flag (not to be confused with \fn{-v} for verbose mode):
This section provides a more detailed example using data from different experiments to illustrate how preseq might be applied. Because it is important to avoid spending time on low complexity samples, it is important to decide after observing an initial experiment whether or not it is beneficial to continue with sequencing. The data in this example comes from a study (accession number SRA061610) using single cell sperm cells amplified by Multiple Annealing and Looping Based Amplification Cycles (MALBAC)~\cite{lu2012probing} and focuses on three libraries coming from different experiments from the study (SRX205369, SRX205370, SRX205372).
705
+
This section provides a more detailed example using data from different experiments to illustrate how \textbf{preseq} might be applied. Because it is important to avoid spending time on low complexity samples, it is important to decide after observing an initial experiment whether or not it is beneficial to continue with sequencing. The data in this example comes from a study (accession number SRA061610) using single cell sperm cells amplified by Multiple Annealing and Looping Based Amplification Cycles (MALBAC)~\cite{lu2012probing} and focuses on three libraries coming from different experiments from the study (SRX205369, SRX205370, SRX205372).
673
706
674
707
These libraries help show what would be considered a relatively poor library and a relatively good library, as well as compare the complexity curves obtained from running \fn{c\_curve} and \fn{lc\_extrap}, to show how \fn{lc\_extrap} would help in the decision to sequence further. The black diagonal line represents an ideal library, in which every read is a distinct read (though this cannot be achieved in reality). The full experiments were down sampled at 5\% to obtain a mock initial experiment of the libraries, as shown here, where we have the complexity curves of the initial experiments generated by \fn{c\_curve}:
675
708
~\newline
@@ -853,7 +886,7 @@ \subsection*{Estimating and analyzing TCR$\beta$ richness}
853
886
854
887
\section{FAQ}
855
888
856
-
\Que{When compiling the preseq binary, I receive the error
889
+
\Que{When compiling the \textbf{preseq} binary, I receive the error
857
890
858
891
\fn{fatal error: gsl/gsl\_cdf.h: No such file or directory
859
892
}
@@ -864,7 +897,7 @@ \section{FAQ}
864
897
865
898
866
899
867
-
\Que{When compiling the preseq binary, I receive the error
900
+
\Que{When compiling the \textbf{preseq} binary, I receive the error
868
901
869
902
\fn{Undefined symbols for architecture x86\_64: ~\\
870
903
\tab"\_packInt16", referenced from:~\\
@@ -883,14 +916,14 @@ \section{FAQ}
883
916
884
917
885
918
886
-
\Que{I compile the preseq binary but receive the error
919
+
\Que{I compile the \textbf{preseq} binary but receive the error
887
920
888
921
\fn{terminate called after throwing an instance of 'std::string'}
889
922
}
890
923
891
924
\Ans{This error is typically called because either the flag -B was not included to
892
925
specify bam input or because the linking to SAMTools was not included when
893
-
compiling preseq. To ensure that the linking was done properly, check for the flag
926
+
compiling \textbf{preseq}. To ensure that the linking was done properly, check for the flag
894
927
\fn{-DHAVE\_SAMTOOLS}.}
895
928
896
929
\Que{When running \fn{lc\_extrap}, I receive the error
@@ -950,12 +983,12 @@ \section{FAQ}
950
983
\vspace{5mm}
951
984
If none of these solutions worked, please email us at
0 commit comments