Computational generation and screening of RNA motifs in large nucleotide sequence pools (pdf)

Article PDF cannot be displayed. You can download it here:

https://nar.oxfordjournals.org/content/38/13/e139.full.pdf

Computational generation and screening of RNA motifs in large nucleotide sequence pools

Namhee Kim 1 Joseph A. Izzo 1 Shereef Elmetwaly 1 Hin Hark Gan 1 Tamar Schlick 0 1 0 Courant Institute of Mathematical Sciences, New York University , 251 Mercer Street, New York, NY 10021, USA 1 Department of Chemistry, New York University , 100 Washington Square East, New York, NY 10003 Although identification of active motifs in large random sequence pools is central to RNA in vitro selection, no systematic computational equivalent of this process has yet been developed. We develop a computational approach that combines target pool generation, motif scanning and motif screening using secondary structure analysis for applications to 1012-1014-sequence pools; large pool sizes are made possible using program redesign and supercomputing resources. We use the new protocol to search for aptamer and ribozyme motifs in pools up to experimental pool size (1014 sequences). We show that motif scanning, structure matching and flanking sequence analysis, respectively, reduce the initial sequence pool by 6-8, 1-2 and 1 orders of magnitude, consistent with the rare occurrence of active motifs in random pools. The final yields match the theoretical yields from probability theory for simple motifs and overestimate experimental yields, which constitute lower bounds, for aptamers because screening analyses beyond secondary structure information are not considered systematically. We also show that designed pools using our nucleotide transition probability matrices can produce higher yields for RNA ligase motifs than random pools. Our methods for generating, analyzing and designing large pools can help improve RNA design via simulation of aspects of in vitro selection. - RNA in vitro selection is a sensitive experimental technology for detecting rare active motifs in random pools of up to 1016 sequences (13). The versatility of the method has led to numerous nucleic acid molecules binding targets (aptamers) as diverse as organic molecules, antibiotics, proteins and whole viruses (3,4). Importantly, in vitro selection experiments have enabled discovery of new classes of RNA enzymes (ribozymes) and have ramifications for biomolecular engineering, including the design of allosteric ribozymes and aptamer-based biosensors (57), and aptamers capable of inhibiting protein function for functional genomics (8,9). Many aptamers and ribozymes have also been developed for therapeutic applications (10,11), such as aptamers inhibiting the TAR RNA element of HIV-1 (12) and the human vascular endothelial growth factor in cancer (13). See examples in Table 1. In vitro selection of RNAs involves three essential steps: synthesize a large sequence pool, screen the sequence pool for aptamers or ribozymes and verify active RNA candidates using functional assays. Initially, a DNA-pool is chemically synthesized, amplified by PCR and then transcribed to generate the RNA pool. Ligand-binding RNAs are detected using, for example, column chromatography, where target ligands are bound. The ligandbound RNAs are selected and then reverse-transcribed and amplified by PCR for further selection rounds (3). Ribozymes are selected using various strategies, including attaching chemical tags to RNAs (3). The entire pool generation and selection process can be laborious, and complications arise when searching for specific motifs: selection biases may also occur because detection strategies may favor some classes of active motifs; false positives may require further experimental tests (14). These technical difficulties could be ameliorated by a systematic computational method for modeling the process of pool generation and selection of active motifs. More importantly, modeling could guide fruitful experimental efforts and discourage less productive search avenues through analysis and engineering of sequence pools for target motifs. Reliable simulation models could Sequence length (nt) aMotif yield is the number of reported active RNAs per 109 sequences. This value can be biased by experimental details (e.g. RNA selection strategies, threshold values for binding constants and reaction rates). also be used to corroborate experimental results and help to identify technical experimental problems. Ultimately, modeling and simulation could elucidate the physiochemical factors that dictate the presence of active RNAs in sequence pools and relate sequence to structure and function. A major challenge in computational modeling of in vitro selection is the enormous size of sequence pools ( 1015 molecules), roughly eight orders of magnitude larger than the human genome ( 109 nt) for 100-nt sequence pools. Modeling of pool generation and screening for active RNAs requires computation of RNAs primary, secondary and tertiary structures, as well as ligand interactions. Computations involving such large pool sizes demand the use of both novel approaches and large-scale computing resources. Already, various mathematical approaches have been reported for modeling aspects of in vitro selection (15,16). Waterman and coworkers developed a mathematical model for in vitro selection and amplification by relating motif selection probabilities and protein binding constants (15). Levine and Nilsen-Hamilton (16) quantified the convergence of in vitro selection by providing upper and lower bounds on the number of rounds required to enrich the pool with a specified set of binding affinities by using an approach originally developed by Irvine et al. (17). Knight et al. (18) combined approximate probabilistic analyses with a secondary folding algorithm which estimates motif probability; they used this approach to predict the frequencies of an isoleucine aptamer and hammerhead ribozyme in random pools by folding a large number of sequences using computing clusters. Their investigation showed that certain regions of the composition space are enriched with these motifs, and that their computed yields are consistent with reported experimental results. Recently, in an approach designed for RNA microarray applications (19), random pools of size 108 sequences have also been screened for RNAs binding specific targets using a 3D folding algorithm and a docking program. The distribution of RNA motifs in nucleotide sequences has also been investigated by the Cedergren (20) and Schlick (21) groups using motif scanning programs such as RNAMOT(22) and RNAMotif (23). These studies highlighted the over- and under-representation of specific RNA motifs in randomized sequences; our additional studies using RNA graphs also led to a similar conclusion (24). The Cedergren group identified motif hits without structure folding, whereas the Schlick group used folding and thermodynamic criteria to filter the candidates. The present work extends these tools and develops new methods to handle the voluminous data associated with large sequence pools. Recently, we have developed a mathematical tool for generating pools by nucleotide transition probability matrices (or mixi (...truncated)