Computational generation and screening of RNA motifs in large nucleotide sequence pools
Namhee Kim
1
Joseph A. Izzo
1
Shereef Elmetwaly
1
Hin Hark Gan
1
Tamar Schlick
0
1
0
Courant Institute of Mathematical Sciences, New York University
, 251 Mercer Street,
New York, NY 10021, USA
1
Department of Chemistry, New York University
, 100 Washington Square East,
New York, NY 10003
Although identification of active motifs in large random sequence pools is central to RNA in vitro selection, no systematic computational equivalent of this process has yet been developed. We develop a computational approach that combines target pool generation, motif scanning and motif screening using secondary structure analysis for applications to 1012-1014-sequence pools; large pool sizes are made possible using program redesign and supercomputing resources. We use the new protocol to search for aptamer and ribozyme motifs in pools up to experimental pool size (1014 sequences). We show that motif scanning, structure matching and flanking sequence analysis, respectively, reduce the initial sequence pool by 6-8, 1-2 and 1 orders of magnitude, consistent with the rare occurrence of active motifs in random pools. The final yields match the theoretical yields from probability theory for simple motifs and overestimate experimental yields, which constitute lower bounds, for aptamers because screening analyses beyond secondary structure information are not considered systematically. We also show that designed pools using our nucleotide transition probability matrices can produce higher yields for RNA ligase motifs than random pools. Our methods for generating, analyzing and designing large pools can help improve RNA design via simulation of aspects of in vitro selection.
-
RNA in vitro selection is a sensitive experimental
technology for detecting rare active motifs in random pools of up
to 1016 sequences (13). The versatility of the method has
led to numerous nucleic acid molecules binding targets
(aptamers) as diverse as organic molecules, antibiotics,
proteins and whole viruses (3,4). Importantly, in vitro
selection experiments have enabled discovery of new classes
of RNA enzymes (ribozymes) and have ramifications for
biomolecular engineering, including the design of
allosteric ribozymes and aptamer-based biosensors (57),
and aptamers capable of inhibiting protein function for
functional genomics (8,9). Many aptamers and ribozymes
have also been developed for therapeutic applications
(10,11), such as aptamers inhibiting the TAR RNA
element of HIV-1 (12) and the human vascular endothelial
growth factor in cancer (13). See examples in Table 1.
In vitro selection of RNAs involves three essential steps:
synthesize a large sequence pool, screen the sequence pool
for aptamers or ribozymes and verify active RNA
candidates using functional assays. Initially, a DNA-pool is
chemically synthesized, amplified by PCR and then
transcribed to generate the RNA pool. Ligand-binding
RNAs are detected using, for example, column
chromatography, where target ligands are bound. The
ligandbound RNAs are selected and then reverse-transcribed
and amplified by PCR for further selection rounds (3).
Ribozymes are selected using various strategies, including
attaching chemical tags to RNAs (3). The entire pool
generation and selection process can be laborious, and
complications arise when searching for specific motifs:
selection biases may also occur because detection
strategies may favor some classes of active motifs; false
positives may require further experimental tests (14).
These technical difficulties could be ameliorated by a
systematic computational method for modeling the
process of pool generation and selection of active motifs.
More importantly, modeling could guide fruitful
experimental efforts and discourage less productive search
avenues through analysis and engineering of sequence
pools for target motifs. Reliable simulation models could
Sequence length (nt)
aMotif yield is the number of reported active RNAs per 109 sequences. This value can be biased by experimental details (e.g. RNA selection
strategies, threshold values for binding constants and reaction rates).
also be used to corroborate experimental results and help
to identify technical experimental problems. Ultimately,
modeling and simulation could elucidate the
physiochemical factors that dictate the presence of active RNAs in
sequence pools and relate sequence to structure and
function.
A major challenge in computational modeling of in vitro
selection is the enormous size of sequence pools ( 1015
molecules), roughly eight orders of magnitude larger
than the human genome ( 109 nt) for 100-nt sequence
pools. Modeling of pool generation and screening for
active RNAs requires computation of RNAs primary,
secondary and tertiary structures, as well as ligand
interactions. Computations involving such large pool sizes
demand the use of both novel approaches and large-scale
computing resources.
Already, various mathematical approaches have been
reported for modeling aspects of in vitro selection
(15,16). Waterman and coworkers developed a
mathematical model for in vitro selection and amplification by
relating motif selection probabilities and protein binding
constants (15). Levine and Nilsen-Hamilton (16)
quantified the convergence of in vitro selection by
providing upper and lower bounds on the number of
rounds required to enrich the pool with a specified set of
binding affinities by using an approach originally
developed by Irvine et al. (17).
Knight et al. (18) combined approximate probabilistic
analyses with a secondary folding algorithm which
estimates motif probability; they used this approach to
predict the frequencies of an isoleucine aptamer and
hammerhead ribozyme in random pools by folding a large
number of sequences using computing clusters. Their
investigation showed that certain regions of the composition
space are enriched with these motifs, and that their
computed yields are consistent with reported experimental
results. Recently, in an approach designed for RNA
microarray applications (19), random pools of size 108
sequences have also been screened for RNAs binding
specific targets using a 3D folding algorithm and a
docking program.
The distribution of RNA motifs in nucleotide sequences
has also been investigated by the Cedergren (20) and
Schlick (21) groups using motif scanning programs such
as RNAMOT(22) and RNAMotif (23). These studies
highlighted the over- and under-representation of
specific RNA motifs in randomized sequences; our
additional studies using RNA graphs also led to a similar
conclusion (24). The Cedergren group identified motif
hits without structure folding, whereas the Schlick group
used folding and thermodynamic criteria to filter the
candidates. The present work extends these tools and
develops new methods to handle the voluminous data
associated with large sequence pools.
Recently, we have developed a mathematical tool for
generating pools by nucleotide transition probability
matrices (or mixi (...truncated)