User‐friendly algorithms for estimating completeness and diversity in randomized protein‐encoding libraries

Protein Engineering Design and Selection, Jun 2003

Directed evolution of proteins depends on the production of molecular diversity by random mutagenesis. While a number of methods have been developed for introducing this diversity, the best ways to sample it are not always clear. Here we used simple statistics to analyse completeness and diversity in randomized libraries generated by oligonucleotide‐directed mutagenesis, error‐prone polymerase chain reaction (epPCR) and in vitro recombination of highly homologous sequences. For oligonucleotide‐directed mutagenesis, we derive equations to estimate how complete a given library is expected to be and also to predict the size of library required to give a fixed probability of being 100% complete. We describe the statistical bases for computer programs which estimate the number of distinct variants represented in epPCR and shuffled libraries, dubbed PEDEL and DRIVeR, respectively. These programs allow the user to calculate (rather than guess) the diversity represented in a given library and also provide empirical guidelines for maximizing this diversity. PEDEL and DRIVeR are available at www.bio.cam.ac.uk/∼blackburn/stats.html.

Article PDF cannot be displayed. You can download it here:

https://peds.oxfordjournals.org/content/16/6/451.full.pdf

User‐friendly algorithms for estimating completeness and diversity in randomized protein‐encoding libraries

Wayne M.Patrick 1 2 Andrew E.Firth 0 1 Jonathan M.Blackburn 1 2 0 Institute of Astronomy, University of Cambridge , Madingley Road, Cambridge CB3 0HA , UK 1 Protein Engineering 16(6), a Oxford University Press 2 Department of Biochemistry, University of Cambridge , Tennis Court Road, Cambridge CB2 1GA Wayne M. Patrick and Andrew E. Firth contributed equally to this work. 3To whom correspondence should be addressed. E-mail: Directed evolution of proteins depends on the production of molecular diversity by random mutagenesis. While a number of methods have been developed for introducing this diversity, the best ways to sample it are not always clear. Here we used simple statistics to analyse completeness and diversity in randomized libraries generated by oligonucleotide-directed mutagenesis, error-prone polymerase chain reaction (epPCR) and in vitro recombination of highly homologous sequences. For oligonucleotide-directed mutagenesis, we derive equations to estimate how complete a given library is expected to be and also to predict the size of library required to give a fixed probability of being 100% complete. We describe the statistical bases for computer programs which estimate the number of distinct variants represented in epPCR and shuffled libraries, dubbed PEDEL and DRIVeR, respectively. These programs allow the user to calculate (rather than guess) the diversity represented in a given library and also provide empirical guidelines for maximizing this diversity. PEDEL and DRIVeR are available at www.bio.cam.ac.uk/~blackburn/stats.html. Introduction In the field of protein engineering, mimicking Darwinian evolution in vitro has emerged as a powerful means of generating proteins displaying novel properties and functions. The cornerstone of all directed evolution protocols is the production of molecular diversity by random mutagenesis and a number of methods have been developed to introduce this diversity into protein-encoding genes. The most adaptable and widespread of these are based on the polymerase chain reaction (PCR) and include: oligonucleotide-directed random mutagenesis (Hermes et al., 1989), error-prone PCR (epPCR) (Cadwell and Joyce, 1992) and the in vitro recombination protocols DNA shuffling (Stemmer, 1994a,b) and staggered extension process (StEP) (Zhao et al., 1998). There are many recent examples in which improved proteins have been identified in large libraries of variants generated by one or more of these techniques [reviewed by Brakmann and Taylor (Brakmann, 2001; Taylor et al., 2001)]. An assumption underlying all directed evolution experiments is that the amount of molecular diversity theoretically possible is enormous compared with our ability to generate and screen it. Even a small protein of 100 amino acids can be encoded by 4300 10181 possible DNA sequences, a number vastly larger than the number of atoms in the observable Universe (~1080), let alone the biggest protein-encoding libraries accessible in the laboratory [10121015 using in vitro selection methods such as mRNA display (Roberts and Ja, 1999)]. Increasingly it is acknowledged that quantitative, predictive models for the processes underlying randomized library construction will be useful in targeting and interpreting that diversity which we are able to generate experimentally [reviewed by Voigt et al. (Voigt et al., 2001a)]. Recent studies have included in silico modelling of epPCR and the generation of crossovers in DNA shuffling (Moore and Maranas, 2000; Moore et al., 2001) and the construction of computational prescreens both to identify the regions of proteins most likely to yield beneficial mutations on randomization (Voigt et al., 2001b) and also to predict the fragments or schemas of proteins able to be recombined with minimal disruption of overall threedimensional structure (Voigt et al., 2002). While these analyses hint at the insights to be gained from a quantitative approach to directed evolution, they are too complex to be generally applicable for the laboratory researcher. Moreover, the number of mutations or crossovers required, or even optimal, to effect a given functional change remains elusive. In this paper, we argue that the likelihood of finding a variant with improved properties in a given library is maximized when that library is maximally diverse. We used simple statistics to derive a series of widely-applicable equations and computer algorithms for estimating the number of unique sequence variants in libraries constructed by randomized oligonucleotide mutagenesis, epPCR and in vitro recombination. Generally, applying these algorithms provides mathematical support for the previously empirical guidelines which have evolved for generating randomized libraries in which diversity is maximized and unwanted degeneracy is minimized, although some new strategies for library construction also become apparent. Materials and methods GLUE, PEDEL and DRIVeR are a suite of programs for calculating library statistics. They have been written in Fortran 77 and are available to be downloaded from www.bio.cam. ac.uk/~blackburn/stats.html. Supplementary information at this URL includes comprehensive program notes and a PDF file describing the mathematics underlying the programs in full detail. Results Oligonucleotide-directed random mutagenesis Incorporating randomized codons into one of the primers in a PCR mix allows the generation of molecular diversity at specific locations in a gene. Intuitively, we know that randomizing a greater number of codons reduces the likelihood of sampling all possible random variants. Here we derive simple equations for estimating how many variants a given library will actually contain and how large a library needs to be in order to give a fixed probability (e.g. 95%) that all possible sequence variants will be represented. Consider a library containing a number of clones L, constructed by randomizing M codons or N = 3M base pairs, in which all possible sequence variants vi are equally probable. Since the variants are equally probable, the mean number of occurrences of any one variant vi in the library is given by l = L/V (where V is the total number of possible sequence variants). For l << L (e.g. V > 10), the actual number of occurrences of any variant vi is essentially independent of the number of occurrences of any other variant vj and can therefore be well approximated by the Poisson distribution [see Feller for details (Feller, 1968)]: e L=V where P(x) denotes the probability that the variant vi occurs exactly x times in the library. The probability that vi occurs at least once is 1 P(0) = 1 el = 1 eL/V. Hence the expected number of distinct variants in the library is and the fractional completeness of the library is given by F = C/V = 1 eL/V. As an example, let us assume that we mutate four codons in a gene using NNS (N = A/C/G/T; S = C/G) codons in the randomization protocol. Because there are 32 possible NNS codons, it fol (...truncated)


This is a preview of a remote PDF: https://peds.oxfordjournals.org/content/16/6/451.full.pdf
Article home page: http://peds.oxfordjournals.org/content/16/6/451.abstract

Wayne M. Patrick, Andrew E. Firth, Jonathan M. Blackburn. User‐friendly algorithms for estimating completeness and diversity in randomized protein‐encoding libraries, Protein Engineering Design and Selection, 2003, pp. 451-457, 16/6, DOI: 10.1093/protein/gzg057