Skip to content

Commit 9580631

Browse files
committed
corrections in manual
1 parent 19b8739 commit 9580631

File tree

3 files changed

+96
-52
lines changed

3 files changed

+96
-52
lines changed

docs/biblio.bib

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,14 @@
1+
@article{heck1975explicit,
2+
title={Explicit Calculation of the Rarefaction Diversity Measurement and the Determination of Sufficient Sample Size},
3+
author={Heck, Jr, Kenneth L and van Belle, Gerald and Simberloff, Daniel},
4+
journal={Ecology},
5+
volume={56},
6+
number={6},
7+
pages={1459--1461},
8+
year={1975},
9+
publisher={JSTOR}
10+
}
11+
112
@article{willis2015inference,
213
title={Inference for changes in biodiversity},
314
author={Willis, Amy and Bunge, John and Whitman, Thea},

docs/manual.pdf

1.94 KB
Binary file not shown.

docs/manual.tex

Lines changed: 85 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,8 @@
2626
\titleformat*{\paragraph}{\large\bfseries}
2727

2828

29-
\title{The preseq Manual}
30-
\author{Timothy Daley \and Victoria Helus \and Andrew Smith }
29+
\title{The \textbf{preseq} Manual}
30+
\author{Timothy Daley \and Victoria Helus \and Chao Deng \and Andrew Smith }
3131

3232
\begin{document}
3333
\maketitle
@@ -42,39 +42,45 @@ \section{Quick Start}
4242

4343

4444

45-
The \textbf{preseq} package is aimed at predicting
46-
the yield of distinct reads from a genomic library
47-
from an initial sequencing experiment. The estimates
45+
The \textbf{preseq} package is aimed to help researchers
46+
design and optimize sequencing experiments by using
47+
population sampling models to infer properties of the
48+
population or the behavior under deeper sampling based
49+
upon a small initial sequencing experiment. The estimates
4850
can then be used to examine the utility of further
4951
sequencing, optimize the sequencing depth,
5052
or to screen multiple libraries to avoid low complexity
5153
samples.~\\[-.2cm]
5254

53-
\noindent The three main programs are \fn{c\_curve}, \fn{lc\_extrap},
54-
and \fn{gc\_extrap}.
55-
\fn{c\_curve} samples reads without replacement from the
56-
given mapped sequenced read file or duplicate count file to estimate the yield
57-
of the experiment and the subsampled experiments. These estimates
58-
are used construct the complexity
59-
curve of the experiment. \fn{lc\_extrap} uses rational function approximations
55+
\noindent The four main programs are \fn{c\_curve},
56+
\fn{lc\_extrap}, \fn{gc\_extrap}, and \fn{bound\_pop}.
57+
\fn{c\_curve} interpolates the expected complexity
58+
curve based upon a hypergeometric formula and
59+
is primarily used to check predictions from
60+
\fn{lc\_extrap} and \fn{gc\_extrap}.
61+
\fn{lc\_extrap} uses rational function approximations
6062
of Good \& Toulmin's~\cite{good1956number} non-parametric
61-
empirical Bayes estimator to predict the yield
63+
empirical Bayes estimator to predict the library complexity
6264
of future experiments, in essence looking into the future
63-
for hypothetical experiments. \fn{lc\_extrap} is used to predict
64-
the yield and then \fn{c\_curve} can be used to check the yield
65-
from the larger experiment.
65+
for hypothetical experiments.
6666

67-
\fn{gc\_extrap} uses rational function approximations
68-
to Good \& Toulmin's estimator to predict the genomic
69-
coverage, i.e. the number of bases covered at least once,
67+
\fn{gc\_extrap} uses a similar approach as \fn{lc\_extrap}
68+
to predict the genome coverage,
69+
i.e. the number of bases covered at least once,
7070
from deeper sequencing in a single cell or low input sequencing
7171
experiment based on the observed coverage counts.
72-
The option is available to predict the coverage based on binned
72+
An option is available to predict the coverage based on binned
7373
coverage counts to speed up the estimates.
7474
\fn{gc\_extrap} requires mapped read or bed format
7575
input, so the tool \fn{bam2mr} is provided to convert
7676
bam format read to mapped read format.
7777

78+
\fn{bound\_pop} uses a non-parametric moment-based
79+
approach to conservatively estimate the total number
80+
of classes in the sample, also called the species
81+
richness of the population that is sampled.
82+
83+
7884
\newpage
7985

8086
\section{Installation}
@@ -83,7 +89,8 @@ \section{Installation}
8389
\paragraph{Download}
8490
\label{sub:download}~\\~\\[-.2cm]
8591
\raggedright{\textbf{preseq} is available at }
86-
\url{http://smithlab.cmb.usc.edu/software/}.
92+
\url{http://smithlabresearch.org/software/preseq/}
93+
or \url{https://github.com/smithlabcode/preseq}.
8794

8895

8996
\paragraph{System Requirements}
@@ -92,56 +99,66 @@ \section{Installation}
9299
\textbf{preseq} runs on Unix-type system
93100
with GNU Scientific Library (GSL), available
94101
at ~\url{http://www.gnu.org/software/gsl/}.
95-
If the input file is in BAM format, SAMTools is
96-
required, available at ~\url{http://samtools.sourceforge.net/}.
97-
If the input is
98-
a text file of counts in a single column or is
102+
If the input file is in BAM format, the SAMTools
103+
API is required but is included in all binaries and
104+
source code.
105+
If the input is a text file of counts in a single column or is
99106
in BED format,
100107
SAMTools is not required.
101108
It has been tested on Linux and
102109
Mac OS-X.
103110

104111
\paragraph{Installation}~\\~\\[-.2cm]
105112
\label{sub:install}
106-
Download the source code and decompress
107-
it with
113+
If the source code was downloaded from the Smithlab
114+
website the first step is to decompress it using the
115+
command
108116
\begingroup \fontsize{9pt}{12pt}\selectfont \begin{alltt}
109117
$ tar -jxvf preseq.tar.bz2
110118
\end{alltt} \endgroup
119+
To download the source code from GitHub, use
120+
the command
121+
\begingroup \fontsize{9pt}{12pt}\selectfont \begin{alltt}
122+
$ git clone --recursive git://github.com/smithlabcode/preseq.git
123+
\end{alltt} \endgroup
111124
%
112-
Enter the \textbf{preseq/} directory and run
125+
In both cases, enter the \textbf{preseq/} directory and run
113126
\begingroup \fontsize{9pt}{12pt}\selectfont \begin{alltt}
114127
$ make all
115128
\end{alltt}\endgroup
129+
to compile all the code.
116130

117-
The input file may possibly be in BAM format. If the root directory
118-
of SAMTools is \$SAMTools, instead run
131+
If one wishes to link to SAMTools API not
132+
included with the source code, the if the
133+
SAMTools API is located at \$SAMTools instead run
119134
\begingroup \fontsize{9pt}{12pt}\selectfont \begin{alltt}
120135
$ make all SAMTOOLS_DIR=$SAMTools
121136
\end{alltt}\endgroup
122137
Output after typing this command should include the flag \fn{-DHAVE\_SAMTOOLS} if the linking is successful. If compiled successfully, the executable file is available
123138
in \textbf{preseq/}.
124139

125-
If a BAM file is used as input without first having run \begingroup \fontsize{9pt}{11pt}\selectfont \fn{\$ make all SAMTOOLS\_DIR=/loc/of/SAMTools}\endgroup, then the following error will occur: \begingroup \fontsize{9pt}{12pt}\selectfont \fn{terminate called after throwing an instance of 'std::string'}\endgroup.
140+
If a BAM file is used as input without successful linking to
141+
SAMTools, then the following error will occur:
142+
\begingroup \fontsize{9pt}{12pt}\selectfont \fn{terminate called after throwing an instance of 'std::string'}\endgroup.
126143

127144
\newpage
128145

129-
\section{Using preseq}
146+
\section{Using \textbf{preseq}}
130147
\label{sec:usage}
131148

132149
\paragraph{Basic usage}~\\~\\[-.2cm]
133150
\label{sub:basic}
134-
To generate the complexity plot of a genomic
151+
To generate the complexity curve of a genomic
135152
library from a read file in BED or BAM format or a duplicate count file,
136153
use the function \fn{c\_curve}. Use
137154
\fn{-o} to specify the output name.
138155
\begingroup \fontsize{9pt}{12pt}\selectfont \begin{alltt}
139156
$ ./preseq c_curve -o complexity_output.txt input.bed
140157
\end{alltt}\endgroup
141158

142-
To estimate the future yield
143-
of a genomic library
144-
using an initial experiment in BED format,
159+
To predict the complexity curve
160+
of a sequencing library
161+
using an initial experiment in BED format,
145162
use the function \fn{lc\_extrap}.
146163
The required options are \fn{-o} to specify
147164
the output of the yield estimates and
@@ -159,7 +176,7 @@ \section{Using preseq}
159176
coverage is highly variable and uncertain function
160177
of sequencing depth. Some regions may be missing
161178
due to locus dropout or preferentially amplified during
162-
MDA (multiple displacement amplification).
179+
whole genome amplification.
163180
\fn{gc\_extrap} allows the level genomic coverage from deep
164181
sequencing to be predicted based on an initial sample.
165182
The input file format need to be a mapped read (MR) or BED,
@@ -198,7 +215,9 @@ \section{File Format}
198215
mapped fragments are counted. This means that both ends
199216
of a disconcordantly mapped read will each be counted separately.
200217
If a large number of reads are disconcordant, then
201-
the default single end should be used. In this case only the mapping
218+
the default single end should be used or the disconcordantly
219+
mapped reads removed prior to running \textbf{preseq}.
220+
In this case only the mapping
202221
location of the first mate
203222
is used as the unique molecular identifier~\cite{kivioja2011counting}.
204223

@@ -223,7 +242,9 @@ \section{File Format}
223242
\end{alltt}\endgroup
224243
More complicated unique molecular identifiers
225244
can be used, such as mapping position plus a random barcode,
226-
but are too complicated to detail in this manual. For questions with such usage, please contact us at \href{mailto:tdaley@usc.edu}{\nolinkurl{tdaley@usc.edu}}
245+
but are too complicated to detail in this manual.
246+
For questions with such usage, please contact us at
247+
\href{mailto:tdaley@usc.edu}{\nolinkurl{tdaley@usc.edu}}
227248

228249
\paragraph{Mapped read format for \fn{gc\_extrap}}~\\~\\[-.2cm]
229250

@@ -247,9 +268,8 @@ \section{Detailed usage}
247268
\label{sec:complexityplot}
248269

249270
\fn{c\_curve} is used to compute the
250-
expected complexity curve of a mapped read file by
251-
subsampling smaller experiments without replacement
252-
and counting the distinct reads.
271+
expected complexity curve of a mapped read file
272+
with a hypergeometric formula~\cite{heck1975explicit}.
253273
Output is a text file with two
254274
columns. The first gives the total number
255275
of reads and the second the corresponding number
@@ -265,6 +285,8 @@ \section{Detailed usage}
265285
\item[\begingroup \fontsize{9pt}{12pt}\selectfont-V, -vals\endgroup] Input is a text file of read counts
266286
\end{description}
267287

288+
\newpage
289+
268290
\paragraph{lc\_extrap}~\\~\\[-.2cm]
269291
\label{sec:librarycomplexity}
270292

@@ -297,8 +319,11 @@ \section{Detailed usage}
297319
\item[\begingroup \fontsize{9pt}{12pt}\selectfont-H, -hist\endgroup] Input is a text file of the observed histogram
298320
\item[\begingroup \fontsize{9pt}{12pt}\selectfont-V, -vals\endgroup] Input is a text file of read counts
299321
\item[\begingroup \fontsize{9pt}{12pt}\selectfont-Q, -quick\endgroup] Quick mode, option to estimate yield without bootstrapping for confidence intervals
322+
\item[\begingroup \fontsize{9pt}{12pt}\selectfont-D, -defects\endgroup] Defects mode, estimates the complexity curve without checking for instabilities in the curve. Should only be used on datasets that fail estimation without defects.
300323
\end{description}
301324

325+
\newpage
326+
302327
\paragraph{gc\_extrap}~\\~\\[-.2cm]
303328
\label{sec:genomiccoverage}
304329

@@ -333,6 +358,8 @@ \section{Detailed usage}
333358
\item[\begingroup \fontsize{9pt}{12pt}\selectfont-Q, -quick\endgroup] Quick mode, option to estimate genomic coverage without bootstrapping for confidence intervals
334359
\end{description}
335360

361+
\newpage
362+
336363
\paragraph{bound\_pop}~\\~\\[-.2cm]
337364
\label{sec:lib_size}
338365

@@ -468,7 +495,10 @@ \section{lc\_extrap Examples}
468495
10 146334
469496
\end{alltt}\endgroup
470497

471-
The following command will give output of the same format as the above examples.\begingroup \fontsize{9pt}{12pt}\selectfont \begin{alltt} $./preseq lc_extrap -o future_yield.txt -H histogram.txt \end{alltt}\endgroup
498+
The following command will give output of the same format as the above examples.
499+
\begingroup \fontsize{9pt}{12pt}\selectfont \begin{alltt}
500+
$./preseq lc_extrap -o future_yield.txt -H histogram.txt
501+
\end{alltt}\endgroup
472502

473503
Similarly, both \fn{lc\_extrap} and \fn{c\_curve} allow the option to input read counts (text file should contain ONLY the observed counts in a single column). For example, if a dataset had the following counts histogram:
474504

@@ -490,7 +520,10 @@ \section{lc\_extrap Examples}
490520
1
491521
\end{alltt}\endgroup
492522

493-
Command should be run with the \fn{-V} flag (not to be confused with \fn{-v} for verbose mode): \begingroup \fontsize{9pt}{12pt}\selectfont \begin{alltt} $./preseq lc_extrap -o future_yield.txt -V counts.txt \end{alltt}\endgroup
523+
Command should be run with the \fn{-V} flag (not to be confused with \fn{-v} for verbose mode):
524+
\begingroup \fontsize{9pt}{12pt}\selectfont \begin{alltt}
525+
$./preseq lc_extrap -o future_yield.txt -V counts.txt
526+
\end{alltt}\endgroup
494527

495528
\newpage
496529

@@ -665,11 +698,11 @@ \section{bound\_pop Example}
665698

666699
\newpage
667700

668-
\section{preseq Application Examples}
701+
\section{\textbf{preseq} Application Examples}
669702

670703
\subsection*{Screening multiple libraries}
671704
\label{sec:multlib}
672-
This section provides a more detailed example using data from different experiments to illustrate how preseq might be applied. Because it is important to avoid spending time on low complexity samples, it is important to decide after observing an initial experiment whether or not it is beneficial to continue with sequencing. The data in this example comes from a study (accession number SRA061610) using single cell sperm cells amplified by Multiple Annealing and Looping Based Amplification Cycles (MALBAC)~\cite{lu2012probing} and focuses on three libraries coming from different experiments from the study (SRX205369, SRX205370, SRX205372).
705+
This section provides a more detailed example using data from different experiments to illustrate how \textbf{preseq} might be applied. Because it is important to avoid spending time on low complexity samples, it is important to decide after observing an initial experiment whether or not it is beneficial to continue with sequencing. The data in this example comes from a study (accession number SRA061610) using single cell sperm cells amplified by Multiple Annealing and Looping Based Amplification Cycles (MALBAC)~\cite{lu2012probing} and focuses on three libraries coming from different experiments from the study (SRX205369, SRX205370, SRX205372).
673706

674707
These libraries help show what would be considered a relatively poor library and a relatively good library, as well as compare the complexity curves obtained from running \fn{c\_curve} and \fn{lc\_extrap}, to show how \fn{lc\_extrap} would help in the decision to sequence further. The black diagonal line represents an ideal library, in which every read is a distinct read (though this cannot be achieved in reality). The full experiments were down sampled at 5\% to obtain a mock initial experiment of the libraries, as shown here, where we have the complexity curves of the initial experiments generated by \fn{c\_curve}:
675708
~\newline
@@ -853,7 +886,7 @@ \subsection*{Estimating and analyzing TCR$\beta$ richness}
853886

854887
\section{FAQ}
855888

856-
\Que{When compiling the preseq binary, I receive the error
889+
\Que{When compiling the \textbf{preseq} binary, I receive the error
857890

858891
\fn{fatal error: gsl/gsl\_cdf.h: No such file or directory
859892
}
@@ -864,7 +897,7 @@ \section{FAQ}
864897

865898

866899

867-
\Que{When compiling the preseq binary, I receive the error
900+
\Que{When compiling the \textbf{preseq} binary, I receive the error
868901

869902
\fn{Undefined symbols for architecture x86\_64: ~\\
870903
\tab"\_packInt16", referenced from:~\\
@@ -883,14 +916,14 @@ \section{FAQ}
883916

884917

885918

886-
\Que{I compile the preseq binary but receive the error
919+
\Que{I compile the \textbf{preseq} binary but receive the error
887920

888921
\fn{terminate called after throwing an instance of 'std::string'}
889922
}
890923

891924
\Ans{This error is typically called because either the flag -B was not included to
892925
specify bam input or because the linking to SAMTools was not included when
893-
compiling preseq. To ensure that the linking was done properly, check for the flag
926+
compiling \textbf{preseq}. To ensure that the linking was done properly, check for the flag
894927
\fn{-DHAVE\_SAMTOOLS}.}
895928

896929
\Que{When running \fn{lc\_extrap}, I receive the error
@@ -950,12 +983,12 @@ \section{FAQ}
950983
\vspace{5mm}
951984
If none of these solutions worked, please email us at
952985
\href{mailto:tdaley@usc.edu}{\nolinkurl{tdaley@usc.edu}}
953-
and please include the standard output from running preseq in
986+
and please include the standard output from running \textbf{preseq} in
954987
verbose mode (specifically the duplicate counts histogram) so
955988
that we can look into the problem and rectify problems in future
956989
versions. Also, feel free to email us with any other questions or
957990
concerns.
958-
The preseq software is still under development so we would appreciate any
991+
The \textbf{preseq} software is still under development so we would appreciate any
959992
advice, comments, or notification of any possible bugs. Thanks!
960993

961994
\newpage

0 commit comments

Comments
 (0)