Dependence among Sites in RNA Evolution
Jiaye Yu
0
1
Jeffrey L. Thorne
0
1
0
The Author 2006. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions
,
please
1
Bioinformatics Research Center, North Carolina State University
Although probabilistic models of genotype (e.g., DNA sequence) evolution have been greatly elaborated, less attention has been paid to the effect of phenotype on the evolution of the genotype. Here we propose an evolutionary model and a Bayesian inference procedure that are aimed at filling this gap. In the model, RNA secondary structure links genotype and phenotype by treating the approximate free energy of a sequence folded into a secondary structure as a surrogate for fitness. The underlying idea is that a nucleotide substitution resulting in a more stable secondary structure should have a higher rate than a substitution that yields a less stable secondary structure. This free energy approach incorporates evolutionary dependencies among sequence positions beyond those that are reflected simply by jointly modeling change at paired positions in an RNA helix. Although there is not a formal requirement with this approach that secondary structure be known and nearly invariant over evolutionary time, computational considerations make these assumptions attractive and they have been adopted in a software program that permits statistical analysis of multiple homologous sequences that are related via a known phylogenetic tree topology. Analyses of 5S ribosomal RNA sequences are presented to illustrate and quantify the strong impact that RNA secondary structure has on substitution rates. Analyses on simulated sequences show that the new inference procedure has reasonable statistical properties. Potential applications of this procedure, including improved ancestral sequence inference and location of functionally interesting sites, are discussed.
-
secondary structure. We discuss the strengths and weaknesses
of our approach as well as some of its potential applications.
Materials and Methods
Parameterization
We introduce a Markovian model for evolution of
RNA-encoding regions with constraints due to secondary
structure. In terms of both parameterization and statistical
inference, this study closely parallels the protein evolution
work of Robinson et al. (2003). Because we do not use the
assumption of independent changes among sites, we define
an instantaneous rate matrix R that specifies rates of change
from each possible sequence to each other possible
sequence. All sequences are assumed to have length N,
and we allow no more than one position of a sequence
to change in a particular instant. This means that a rate
of change Rij from sequence i to sequence j equals 0 if
i and j differ at more than one position. As a result, the
rate matrix R is sparse. In each matrix row i, the diagonal
entry is negative, and the entries for the 3N neighboring
sequences j that are different from i at exactly one position
are positive. The remaining 4N (3N 1 1) entries in row
i must be 0.
The values of the 3N entries in each row of R that can
be positive depend on how the corresponding substitutions
affect the approximate free energy. The biologically
plausible expectation is that a rate should be relatively high if
the substitution improves the stability of RNA secondary
structure. Likewise, if the secondary structure becomes less
stable due to a specific substitution event, then the rate
should be low. To assess the stability of a sequence folded
into a specific secondary structure, we approximate the free
energy of the sequence. This approximate free energy will
be denoted E(i), and details for how it is calculated in our
implementation will be provided below.
Otherwise, our parameterization is similar to that of
widely used nucleotide models. We include parameters
pA, pG, pC, and pT (pA 1 pG 1 pC 1 pT 5 1 and these
parameters are all positive) so that mutation rates to the
4 nucleotide types need not be equal. Here, we are studying
the evolution of DNA sequences that encode RNAs.
Although all thymines are transcribed to uracils, evolution
occurs at the DNA level, and we therefore use pT rather than
pU. The parameter j differentiates between transitions and
transversions. The instantaneous substitution rate Rij is set
to 0 if sequences i and j differ at more than one site. For
cases where i and j differ by exactly one site that has type
h (h 2 fA, C, G, Tg) in j, the rate matrix entries are
where u is a rate-scaling factor.
When the parameter s is zero, secondary structure does
not affect the substitution rates, and our model reduces to
the widely used HKY independent site model (Hasegawa
et al. 1985). The biologically reasonable s values are
positive because these mean that low free energy of RNA
secondary structure is favored by evolution. It is formally
possible to have a negative value for s. This would indicate
that higher free energies and unstable secondary structures
are favored by evolution. This scenario is biologically
implausible for most situations. A partial validation of our
approach would be to determine whether data support positive
values for s.
Stationary Probabilities of Sequences
For this model, the stationary probability of a sequence
i with length N is
where im is the nucleotide type at position m of sequence
i and where kn is the nth position of a sequence k that has
n total residues. Here, h represents all the parameters in this
dependence model. Inspection verifies that detailed balance
is fulfilled (p(ijh)Rij 5 p(jjh)Rji for all possible i and j), and
therefore, the model is time reversible. The denominator of
the equation for p(ijh) is the summation over all possible
sequences with length N. When s 5 0, the denominator
is 1, and the stationary distribution becomes the product
of nucleotide frequencies. If s is substantially above 0,
the stationary distribution is concentrated among sequences
with a particularly good fit between sequence and
secondary structure. Likewise, if s is substantially below 0, the
stationary distribution is characterized by a relatively small
number of sequences with particularly high free energies.
The dependence of this stationary distribution on s means
that s can be estimated from a single sequence. Therefore, it
is not necessary to have a data set with 2 or more related
sequences to get information about the impact of RNA
secondary structure on evolution.
Sequence Path Density
For conventional models of sequence change, the
dimension of the substitution rate matrix is manageable (e.g.,
a codon-based model with a 61 3 61 rate matrix), and
matrix exponentiation can be applied to calculate a transition
probability. With the dependence structure adopted here,
the rate matrix R is 4N 3 4N, and it is not feasible to
exponentiate R unless N is extremely small. To overcome this
high dimensionality, we employ a sequence path approach
(Jensen and Pedersen 2000; Pedersen an (...truncated)