User‐friendly algorithms for estimating completeness and diversity in randomized protein‐encoding libraries
Wayne M.Patrick
1
2
Andrew E.Firth
0
1
Jonathan M.Blackburn
1
2
0
Institute of Astronomy, University of Cambridge
,
Madingley Road, Cambridge CB3 0HA
,
UK
1
Protein Engineering 16(6), a Oxford University Press
2
Department of Biochemistry, University of Cambridge
,
Tennis Court Road, Cambridge CB2 1GA
Wayne M. Patrick and Andrew E. Firth contributed equally to this work. 3To whom correspondence should be addressed. E-mail: Directed evolution of proteins depends on the production of molecular diversity by random mutagenesis. While a number of methods have been developed for introducing this diversity, the best ways to sample it are not always clear. Here we used simple statistics to analyse completeness and diversity in randomized libraries generated by oligonucleotide-directed mutagenesis, error-prone polymerase chain reaction (epPCR) and in vitro recombination of highly homologous sequences. For oligonucleotide-directed mutagenesis, we derive equations to estimate how complete a given library is expected to be and also to predict the size of library required to give a fixed probability of being 100% complete. We describe the statistical bases for computer programs which estimate the number of distinct variants represented in epPCR and shuffled libraries, dubbed PEDEL and DRIVeR, respectively. These programs allow the user to calculate (rather than guess) the diversity represented in a given library and also provide empirical guidelines for maximizing this diversity. PEDEL and DRIVeR are available at www.bio.cam.ac.uk/~blackburn/stats.html.
Introduction
In the field of protein engineering, mimicking Darwinian
evolution in vitro has emerged as a powerful means of
generating proteins displaying novel properties and functions.
The cornerstone of all directed evolution protocols is the
production of molecular diversity by random mutagenesis and
a number of methods have been developed to introduce this
diversity into protein-encoding genes. The most adaptable and
widespread of these are based on the polymerase chain reaction
(PCR) and include: oligonucleotide-directed random
mutagenesis (Hermes et al., 1989), error-prone PCR (epPCR) (Cadwell
and Joyce, 1992) and the in vitro recombination protocols DNA
shuffling (Stemmer, 1994a,b) and staggered extension process
(StEP) (Zhao et al., 1998). There are many recent examples in
which improved proteins have been identified in large libraries
of variants generated by one or more of these techniques
[reviewed by Brakmann and Taylor (Brakmann, 2001; Taylor
et al., 2001)].
An assumption underlying all directed evolution
experiments is that the amount of molecular diversity theoretically
possible is enormous compared with our ability to generate and
screen it. Even a small protein of 100 amino acids can be
encoded by 4300 10181 possible DNA sequences, a number
vastly larger than the number of atoms in the observable
Universe (~1080), let alone the biggest protein-encoding
libraries accessible in the laboratory [10121015 using in vitro
selection methods such as mRNA display (Roberts and Ja,
1999)]. Increasingly it is acknowledged that quantitative,
predictive models for the processes underlying randomized
library construction will be useful in targeting and interpreting
that diversity which we are able to generate experimentally
[reviewed by Voigt et al. (Voigt et al., 2001a)]. Recent studies
have included in silico modelling of epPCR and the generation
of crossovers in DNA shuffling (Moore and Maranas, 2000;
Moore et al., 2001) and the construction of computational
prescreens both to identify the regions of proteins most likely to
yield beneficial mutations on randomization (Voigt et al.,
2001b) and also to predict the fragments or schemas of proteins
able to be recombined with minimal disruption of overall
threedimensional structure (Voigt et al., 2002).
While these analyses hint at the insights to be gained from a
quantitative approach to directed evolution, they are too
complex to be generally applicable for the laboratory
researcher. Moreover, the number of mutations or crossovers
required, or even optimal, to effect a given functional change
remains elusive. In this paper, we argue that the likelihood of
finding a variant with improved properties in a given library is
maximized when that library is maximally diverse. We used
simple statistics to derive a series of widely-applicable
equations and computer algorithms for estimating the number
of unique sequence variants in libraries constructed by
randomized oligonucleotide mutagenesis, epPCR and in vitro
recombination. Generally, applying these algorithms provides
mathematical support for the previously empirical guidelines
which have evolved for generating randomized libraries in
which diversity is maximized and unwanted degeneracy is
minimized, although some new strategies for library
construction also become apparent.
Materials and methods
GLUE, PEDEL and DRIVeR are a suite of programs for
calculating library statistics. They have been written in Fortran
77 and are available to be downloaded from www.bio.cam.
ac.uk/~blackburn/stats.html. Supplementary information at
this URL includes comprehensive program notes and a PDF
file describing the mathematics underlying the programs in full
detail.
Results
Oligonucleotide-directed random mutagenesis
Incorporating randomized codons into one of the primers in a
PCR mix allows the generation of molecular diversity at
specific locations in a gene. Intuitively, we know that
randomizing a greater number of codons reduces the likelihood
of sampling all possible random variants. Here we derive
simple equations for estimating how many variants a given
library will actually contain and how large a library needs to be
in order to give a fixed probability (e.g. 95%) that all possible
sequence variants will be represented.
Consider a library containing a number of clones L,
constructed by randomizing M codons or N = 3M base pairs,
in which all possible sequence variants vi are equally probable.
Since the variants are equally probable, the mean number of
occurrences of any one variant vi in the library is given by l =
L/V (where V is the total number of possible sequence variants).
For l << L (e.g. V > 10), the actual number of occurrences of
any variant vi is essentially independent of the number of
occurrences of any other variant vj and can therefore be well
approximated by the Poisson distribution [see Feller for details
(Feller, 1968)]:
e L=V
where P(x) denotes the probability that the variant vi occurs
exactly x times in the library. The probability that vi occurs at
least once is 1 P(0) = 1 el = 1 eL/V. Hence the expected
number of distinct variants in the library is
and the fractional completeness of the library is given by F =
C/V = 1 eL/V.
As an example, let us assume that we mutate four codons in a
gene using NNS (N = A/C/G/T; S = C/G) codons in the
randomization protocol. Because there are 32 possible NNS
codons, it fol (...truncated)