Statistics of protein library construction
BIOINFORMATICS APPLICATIONS NOTE
Vol. 21 no. 15 2005, pages 3314–3315
doi:10.1093/bioinformatics/bti516
Sequence analysis
Statistics of protein library construction
Andrew E. Firth1,∗ and Wayne M. Patrick2
1 Department
of Biochemistry, University of Otago, PO Box 56, Dunedin, New Zealand and 2 Center
for Fundamental and Applied Molecular Evolution, Emory University, Atlanta, GA 30322, USA
Received on April 14, 2005; revised on May 21, 2005; accepted on May 23, 2005
Advance Access publication June 2, 2005
INTRODUCTION
Directed evolution is a powerful strategy for generating new proteins
with desirable properties. Central to the technique is the generation of large sequence libraries. There are a number of methods for
generating molecular diversity in these libraries (reviewed by Lutz
and Patrick, 2004). However, to maximize the chances of finding a
desired and rare improved variant, it is important to understand the
statistics of library construction.
Previously, we introduced a suite of algorithms for calculating
library statistics for a variety of protocols. Since then, the equations
and programs have been used a number of times (e.g. Hughes et al.,
2005). However, the programs were a little unwieldy and required
compiling by the user. In this short paper we present an improved
and easy-to-use web interface, which will return a variety of library statistics and graphics for user-defined library sizes, mutation
rates, sequence lengths, etc. These statistics may be used to direct
experimental design (e.g. to determine what library size is required
to sample a given amount of diversity, or to optimize the mutation
rate to maximize diversity) and to interpret results (e.g. by estimating
how many distinct sequences are represented in a given library).
We note that more detailed models of some of the processes
involved in library construction have been published (reviewed by
Moore and Maranas, 2004). However, these models are not generally
accessible to most laboratory researchers, can be CPU-intensive, and
are less widely applicable than the generic tools that we present here.
The web interface is available at http://guinevere.otago.ac.nz/stats.
html. Users are referred to our original paper (Patrick et al., 2003)
for experimental details, usage examples and a few caveats. Users
∗ To
whom correspondence should be addressed.
3314
interested in the mathematics behind the programs are invited to read
the mathematical notes on our website.
In the remainder of this short paper, we introduce the three main
programs, GLUE, PEDEL and DRIVeR, and list situations in which
they may be useful.
EQUALLY PROBABLE VARIANTS
The simplest program, GLUE, is broadly applicable to any protocol where all possible variants are equally likely to occur in
the library. Examples include oligonucleotide-directed randomization, MAX randomization, synthetic shuffling, DHR, ADO and
SISDC.
Given the total number of possible variants, GLUE may be used
to calculate (1) the expected number of distinct variants represented
in a given library, (2) the library size required to sample a given
fraction of the variants or (3) the library size required to have a
given probability of sampling all possible variants. For example, if
there are 1 million possible variants (e.g. an oligonucleotide-directed
randomization involving four NNK codons allows 324 = 1 048 576
variants), GLUE shows that a library of ∼3 million transformants will
be ∼95% complete, while a library of ∼17 million transformants has
a ∼95% probability of being 100% complete.
ERROR-PRONE PCR (epPCR)
In this protocol, random base substitutions are introduced into a parent sequence. Although most recent examples of directed evolution
use epPCR in conjunction with recombination-based strategies such
as DNA shuffling, it is still commonly encountered as a means of
generating random diversity at any position in a gene.
The program PEDEL can be used to calculate the expected number
of distinct variants present in a library, given the library size, mean
substitution rate and parent sequence length. On the web page, the
user may produce plots of the expected number of distinct daughter
sequences as a function of library size and substitution rate. The user
can also produce statistics and plots for the total number of variants
with exactly x mutations, the expected size of the sub-library comprising those sequences with exactly x mutations, the completeness
of each sub-library, and the redundancy of each sub-library.
For example, given a library of 107 clones, a parent sequence
length of 600 nt, and a mean substitution rate of 2 bases per daughter
sequence, PEDEL calculates that the library is expected to contain
∼4.5 × 106 distinct sequences. These comprise ∼1.3, ∼1.8, ∼0.9,
∼0.4 and ∼0.1 million distinct sequences with, respectively, exactly
2, 3, 4, 5 and 6 mutations, together with the parent sequence, the
© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email:
ABSTRACT
Summary: We have investigated the statistics associated with constructing and sampling large protein-encoding libraries. Using fairly
simple statistics we have written algorithms for estimating the diversity
in libraries generated by the most commonly used protocols, including
error-prone PCR, DNA shuffling, StEP PCR, oligonucleotide-directed
randomization, MAX randomization, synthetic shuffling, DHR, ADO
and SISDC.
Availability: Web interface and C++ source code available at
http://guinevere.otago.ac.nz/stats.html
Contact:
Supplementary information: Complete mathematical notes, model
assumptions and justification, users’ guide and worked examples at
above website.
Statistics of randomized library construction
tend to remain linked in daughter sequences, resulting in reduced
library diversity.
Given the library size, parent sequence length, mean crossover rate and the positions of the variable nucleotides (or amino
acids), DRIVeR calculates the expected number of distinct daughter
sequences in the library. On the web page, the user may also produce plots of the expected number of distinct daughter sequences
as a function of library size and crossover rate. For example, for a
sequence length of 1425 nt, nine variable nucleotides at positions
250, 274, 375, 650, 655, 757, 763, 982 and 991, a library of size
1600, and a mean crossover rate of 10 crossovers per sequence, the
expected number of distinct sequences in the library is 161 (out of
512 possible variants).
DRIVeR uses a generic Poisson model for crossover positions. The
parent sequences are assumed to be highly homologous. For parent
sequences that are homologous at the amino acid level but divergent
at the nucleotide level, crossovers preferentially occur in regions
with greater nucleotide sequence similarity. This bias is not reflected
in the DRIVeR model which, nevertheless, provides a useful upper
bound on library diversity.
ACKNOWLEDGEMENTS
This work was fund (...truncated)