GLUE-IT and PEDEL-AA: new programmes for analyzing protein diversity in randomized libraries
Andrew E. Firth
1
Wayne M. Patrick
0
0
Institute of Molecular Biosciences, Massey University
, Auckland 0745,
New Zealand
1
BioSciences Institute, University College Cork
, Cork,
Ireland
There are many methods for introducing random mutations into nucleic acid sequences. Previously, we described a suite of programmes for estimating the completeness and diversity of randomized DNA libraries generated by a number of these protocols. Our programmes suggested some empirical guidelines for library design; however, no information was provided regarding library diversity at the protein (rather than DNA) level. We have now updated our web server, enabling analysis of translated libraries constructed by site-saturation mutagenesis and error-prone PCR (epPCR). We introduce GLUEIncluding Translation (GLUE-IT), which finds the expected amino acid completeness of libraries in which up to six codons have been independently varied (according to any user-specified randomization scheme). We provide two tools for assisting with experimental design: CodonCalculator, for assessing amino acids corresponding to given randomized codons; and AA-Calculator, for finding degenerate codons that encode user-specified sets of amino acids. We also present PEDEL-AA, which calculates amino acid statistics for libraries generated by epPCR. Input includes the parent sequence, overall mutation rate, library size, indel rates and a nucleotide mutation matrix. Output includes amino acid completeness and diversity statistics, and the number and length distribution of sequences truncated by premature termination codons. The web interfaces are available at http:// guinevere.otago.ac.nz/stats.html.
-
INTRODUCTION
In the past 15 years, directed evolution has developed
into a broadly applicable strategy for generating new
biomolecules with desirable properties, for probing protein
structure and function, and for addressing fundamental
questions in molecular evolution. In this approach,
random mutagenesis is used to produce a large and diverse
library of nucleic acid sequences, which is subsequently
interrogated for rare, improved variants. Myriad protocols
have been developed to produce the necessary molecular
diversity (13). However, our ability to generate and screen
randomized libraries is dwarfed by the amount of
molecular diversity contained in protein sequence space.
Even for a small, 100-residue protein, there are more
potential amino acid sequences than there are atoms in the
observable Universe (4).
Increasingly, it is recognized that high-quality libraries
are critical to the success of directed evolution experiments
(5,6). Previously, we argued that the likelihood of finding
a variant with a desired function in a randomized library
is maximized when the library is maximally diverse (7).
To the experimentalist, this corresponds to a library
containing as few redundant sequences (including copies
of the unmutated parental gene) and as many full-length
sequences (lacking premature termination codons) as
possible. To aid in the design of maximally diverse
libraries, we developed a suite of user-friendly programmes
for estimating the completeness and diversity that they
contain (4,8). These programmes were limited to
estimating library diversity at the nucleic acid level, and provided
no explicit information regarding the translated products
of the randomized genes. In this article, we describe an
expanded web server, which enables the analysis of protein
diversity in randomized libraries that have been generated
by site-saturation mutagenesis and error-prone PCR
(epPCR). The nucleotide programmes GLUE (for
randomization techniques where all DNA sequence variants
are equally likely), PEDEL (Programme for Estimating
Diversity in Error-prone PCR Libraries) and DRIVeR
(Diversity Resulting from In Vitro Recombination) are
still maintained on the website, and have been described
previously (4,8).
One of our previous programmes, GLUE, is broadly
applicable to any protocol where all gene variants have
an equal probability of occurring in a library. The most
commonly used example is site-saturation mutagenesis
(also referred to as oligonucleotide-directed
randomization), in which randomized bases are incorporated into
one or more of the primers in a PCR, allowing the
generation of diversity at specific sites in an amplified gene.
Other techniques that result in equally probable daughter
variants (at the DNA level) include MAX randomization
(9) and versions of DNA shuffling that utilize designed
oligonucleotides (1012). GLUE is also a useful estimator
of the diversity in libraries generated by incremental
truncation strategies, such as Expression of Soluble Proteins
by Random Incremental Truncation (ESPRIT) (13), in
which variants are close to being equally probable (14).
We now introduce GLUE-Including Translation
(GLUE-IT), which outputs the expected amino acid
level diversity in any site-saturation mutagenesis library
with up to six variable codons. The user specifies the fully
or partly randomized scheme used for each of the variable
codons, and the size of the library that they have
constructed (or, more often, the number of clones that they
plan to screen).
We provide two tools (CodonCalculator and
AA-Calculator) to assist in choosing an appropriate
randomization scheme for library construction. On
specifying a fully or partly randomized codon, XYZ,
CodonCalculator will output the possible amino acid variants and
the number of times that each is encoded. AA-Calculator
performs the opposite function: the user can specify a
desired set of amino acids, and AA-Calculator will find the
degenerate codon(s) that are optimal for encoding them.
Up to 50 degenerate codons are listed, ranked according
to the fraction of the XYZ-specified codons that code
for the desired amino acids. AA-Calculator therefore offers
a user-friendly alternative to downloading and executing
the LibDesign algorithm (15), and provides users with a
replacement for the Combinatorial Codons programme
(16), which (as far as we are aware) is no longer
available online.
On entering the randomization scheme and library size,
GLUE-IT will output a summary of statistics, including
the number of possible DNA and amino acid variants that
No. of distinct
DNA variants
are encoded by each randomized codon and the total
number of possible amino acid variants in the library. The
probability of a particular variant vi being present in the
library is 1 (1 pi)L, where pi is the probability of any
particular variant in the library being vi, and L is the
library size. In the case of six fully randomized (NNN)
codons, there are 206 = 6.4 107 possible variants. To
quickly calculate the expected number of distinct variants
in the library, C Pvi 1 1 piL, variants are grouped
according to the number of ways in which they can be
encoded. Each individual amino acid can be encoded by
between one and six equiprobable codons, so for six
randomized codons the (...truncated)