Dependence among Sites in RNA Evolution

Aug 2006

Although probabilistic models of genotype (e.g., DNA sequence) evolution have been greatly elaborated, less attention has been paid to the effect of phenotype on the evolution of the genotype. Here we propose an evolutionary model and a Bayesian inference procedure that are aimed at filling this gap. In the model, RNA secondary structure links genotype and phenotype by treating the approximate free energy of a sequence folded into a secondary structure as a surrogate for fitness. The underlying idea is that a nucleotide substitution resulting in a more stable secondary structure should have a higher rate than a substitution that yields a less stable secondary structure. This free energy approach incorporates evolutionary dependencies among sequence positions beyond those that are reflected simply by jointly modeling change at paired positions in an RNA helix. Although there is not a formal requirement with this approach that secondary structure be known and nearly invariant over evolutionary time, computational considerations make these assumptions attractive and they have been adopted in a software program that permits statistical analysis of multiple homologous sequences that are related via a known phylogenetic tree topology. Analyses of 5S ribosomal RNA sequences are presented to illustrate and quantify the strong impact that RNA secondary structure has on substitution rates. Analyses on simulated sequences show that the new inference procedure has reasonable statistical properties. Potential applications of this procedure, including improved ancestral sequence inference and location of functionally interesting sites, are discussed.

Article PDF cannot be displayed. You can download it here:

https://mbe.oxfordjournals.org/content/23/8/1525.full.pdf

Dependence among Sites in RNA Evolution

Jiaye Yu 0 1 Jeffrey L. Thorne 0 1 0 The Author 2006. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions , please 1 Bioinformatics Research Center, North Carolina State University Although probabilistic models of genotype (e.g., DNA sequence) evolution have been greatly elaborated, less attention has been paid to the effect of phenotype on the evolution of the genotype. Here we propose an evolutionary model and a Bayesian inference procedure that are aimed at filling this gap. In the model, RNA secondary structure links genotype and phenotype by treating the approximate free energy of a sequence folded into a secondary structure as a surrogate for fitness. The underlying idea is that a nucleotide substitution resulting in a more stable secondary structure should have a higher rate than a substitution that yields a less stable secondary structure. This free energy approach incorporates evolutionary dependencies among sequence positions beyond those that are reflected simply by jointly modeling change at paired positions in an RNA helix. Although there is not a formal requirement with this approach that secondary structure be known and nearly invariant over evolutionary time, computational considerations make these assumptions attractive and they have been adopted in a software program that permits statistical analysis of multiple homologous sequences that are related via a known phylogenetic tree topology. Analyses of 5S ribosomal RNA sequences are presented to illustrate and quantify the strong impact that RNA secondary structure has on substitution rates. Analyses on simulated sequences show that the new inference procedure has reasonable statistical properties. Potential applications of this procedure, including improved ancestral sequence inference and location of functionally interesting sites, are discussed. - secondary structure. We discuss the strengths and weaknesses of our approach as well as some of its potential applications. Materials and Methods Parameterization We introduce a Markovian model for evolution of RNA-encoding regions with constraints due to secondary structure. In terms of both parameterization and statistical inference, this study closely parallels the protein evolution work of Robinson et al. (2003). Because we do not use the assumption of independent changes among sites, we define an instantaneous rate matrix R that specifies rates of change from each possible sequence to each other possible sequence. All sequences are assumed to have length N, and we allow no more than one position of a sequence to change in a particular instant. This means that a rate of change Rij from sequence i to sequence j equals 0 if i and j differ at more than one position. As a result, the rate matrix R is sparse. In each matrix row i, the diagonal entry is negative, and the entries for the 3N neighboring sequences j that are different from i at exactly one position are positive. The remaining 4N (3N 1 1) entries in row i must be 0. The values of the 3N entries in each row of R that can be positive depend on how the corresponding substitutions affect the approximate free energy. The biologically plausible expectation is that a rate should be relatively high if the substitution improves the stability of RNA secondary structure. Likewise, if the secondary structure becomes less stable due to a specific substitution event, then the rate should be low. To assess the stability of a sequence folded into a specific secondary structure, we approximate the free energy of the sequence. This approximate free energy will be denoted E(i), and details for how it is calculated in our implementation will be provided below. Otherwise, our parameterization is similar to that of widely used nucleotide models. We include parameters pA, pG, pC, and pT (pA 1 pG 1 pC 1 pT 5 1 and these parameters are all positive) so that mutation rates to the 4 nucleotide types need not be equal. Here, we are studying the evolution of DNA sequences that encode RNAs. Although all thymines are transcribed to uracils, evolution occurs at the DNA level, and we therefore use pT rather than pU. The parameter j differentiates between transitions and transversions. The instantaneous substitution rate Rij is set to 0 if sequences i and j differ at more than one site. For cases where i and j differ by exactly one site that has type h (h 2 fA, C, G, Tg) in j, the rate matrix entries are where u is a rate-scaling factor. When the parameter s is zero, secondary structure does not affect the substitution rates, and our model reduces to the widely used HKY independent site model (Hasegawa et al. 1985). The biologically reasonable s values are positive because these mean that low free energy of RNA secondary structure is favored by evolution. It is formally possible to have a negative value for s. This would indicate that higher free energies and unstable secondary structures are favored by evolution. This scenario is biologically implausible for most situations. A partial validation of our approach would be to determine whether data support positive values for s. Stationary Probabilities of Sequences For this model, the stationary probability of a sequence i with length N is where im is the nucleotide type at position m of sequence i and where kn is the nth position of a sequence k that has n total residues. Here, h represents all the parameters in this dependence model. Inspection verifies that detailed balance is fulfilled (p(ijh)Rij 5 p(jjh)Rji for all possible i and j), and therefore, the model is time reversible. The denominator of the equation for p(ijh) is the summation over all possible sequences with length N. When s 5 0, the denominator is 1, and the stationary distribution becomes the product of nucleotide frequencies. If s is substantially above 0, the stationary distribution is concentrated among sequences with a particularly good fit between sequence and secondary structure. Likewise, if s is substantially below 0, the stationary distribution is characterized by a relatively small number of sequences with particularly high free energies. The dependence of this stationary distribution on s means that s can be estimated from a single sequence. Therefore, it is not necessary to have a data set with 2 or more related sequences to get information about the impact of RNA secondary structure on evolution. Sequence Path Density For conventional models of sequence change, the dimension of the substitution rate matrix is manageable (e.g., a codon-based model with a 61 3 61 rate matrix), and matrix exponentiation can be applied to calculate a transition probability. With the dependence structure adopted here, the rate matrix R is 4N 3 4N, and it is not feasible to exponentiate R unless N is extremely small. To overcome this high dimensionality, we employ a sequence path approach (Jensen and Pedersen 2000; Pedersen an (...truncated)


This is a preview of a remote PDF: https://mbe.oxfordjournals.org/content/23/8/1525.full.pdf
Article home page: http://mbe.oxfordjournals.org/content/23/8/1525.abstract

Jiaye Yu, Jeffrey L. Thorne. Dependence among Sites in RNA Evolution, 2006, pp. 1525-1537, 23/8, DOI: 10.1093/molbev/msl015