Predicting genes expressed via −1 and +1 frameshifts
Sanghoon Moon
2
Yanga Byun
2
Hong-Jin Kim
1
2
Sunjoo Jeong
0
2
Kyungsook Han
2
0
Department of Molecular Biology, Dankook University
, Seoul 140-714,
Korea
1
College of Pharmacy, Chung-Ang University
, Seoul 156-756,
Korea
2
School of Computer Science and Engineering, Inha University
, Inchon 402-751,
Korea
Computational identification of ribosomal frameshift sites in genomic sequences is difficult due to their diverse nature, yet it provides useful information for understanding the underlying mechanisms and discovering new genes. We have developed an algorithm that searches entire genomic or mRNA sequences for frameshifting sites, and implements the algorithm as a web-based program called FSFinder (Frameshift Signal Finder). The current version of FSFinder is capable of finding -1 frameshift sites on heptamer sequences X XXY YYZ, and 11 frameshift sites for two genes: protein chain release factor B (prfB) and ornithine decarboxylase antizyme (oaz). We tested FSFinder on 190 genomic and partial DNA sequences from a number of organisms and found that it predicted frameshift sites efficiently and with greater sensitivity and specificity than existing approaches. It has improved sensitivity because it considers many known components of a frameshifting cassette and searches these components on both + and - strands, and its specificity is increased because it focuses on overlapping regions of open reading frames and prioritizes candidate frameshift sites. FSFinder is useful for discovering unknown genes that utilize alternative decoding, as well as for analyzing frameshift sites. It is freely accessible at http://wilab.inha.ac.kr/FSFinder/.
-
INTRODUCTION
Programmed ribosomal frameshifting is involved in the
expression of certain genes in a wide range of organisms
such as viruses, bacteria and eukaryotes including humans
(15). In this process, the ribosome switches to an alternative
frame at a specific site in response to special signals in the
messenger RNA (4). Programmed frameshifting plays a
significant role in morphogenesis, autogenous control and in
producing alternative enzymatic activities (6).
The most common frameshift is a 1 frameshift, in which
the ribosome slips a single nucleotide in the upstream
direction. The major elements of 1 frameshifting consist
of a slippery site, where the ribosome changes reading frames,
and a stimulatory RNA structure such as a pseudoknot or a
stemloop located a few nucleotides downstream (4,69). It is
generally accepted that ribosomes pause at 1 frameshifts, but
Kontos et al. (7) report that pausing is not sufficient to mediate
frameshifting. Most slippery sites consist of a heptameric
sequence of the form X XXY YYZ in the incoming
0-frame (10), but there are other slippery sequences that do
not conform to this motif (5). The slippery heptamer is
separated from the stimulatory structure by a sequence of
59 nt, the so-called spacer (3,8). The length of the spacer
is known to influence the efficiency of frameshifting.
Frameshifts typically produce fusion proteins in which the
N- and C-terminal domains are encoded by overlapping open
reading frames (ORFs) (9), as shown in Figure 1.
+1 frameshifts are much less common than 1 frameshifts
but have been observed in diverse organisms (6). Escherichia
coli prfB encoding release factor 2 (RF2) is a well-known gene
that utilizes +1 frameshifting (11,12). In RF2 frameshifting,
a ShineDalgarno (SD) sequence is often observed upstream
of a slippery sequence, normally CUU UGA C and in a single
known case CUU UAA C (12). Several +1 frameshift sites
have also been recognized in eukaryotic mRNA. For example,
the expression of mammalian antizyme 1 (AZ1) requires a +1
frameshift, and the frameshift signal consists of a slippery
sequence and two stimulatory elementsa sequence of
unknown function, upstream of the slippery sequence, and a
pseudoknot (13).
Computational identification of frameshift sites from
genomic sequences is difficult since the sequence requirements
for frameshifting cassettes are diverse and highly dependent
on the organism. Several computational approaches have been
attempted, but only a few are publicly available. The model for
eukaryotic 1 frameshifting developed by Bekaert et al. (8)
only considers H-type pseudoknots as stimulatory structures
and misses many frameshift sites with other stimulatory
structures. Hammell et al. (9) developed a program to identify 1
frameshift sites in prokaryotic and eukaryotic DNA sequences,
but the sensitivity of their approach is low; it misses many
frameshift sites because it only considers downstream
pseudoknots, and its definition of a pseudoknot is too restrictive.
For example, their approach does not locate the frameshift
sites in Rous sarcoma virus (RSV), because loops 1 and 2
of the pseudoknot are larger than permitted by their approach.
FreqAnalysis developed by Shah et al. (14) is usable to
identify simple novel slippery sequences, but it does not take in
consideration existence of stimulators. A semi-automated
approach by Ivanov et al. (13) finds a gene where antizyme
frameshifting is expected to occur and then identifies the
frameshift. While this approach has been shown to be
successful for identifying ornithine decarboxylase antizyme
(oaz) frameshifting, it omits universality. There are also
computational approaches that identify frameshifting errors
in sequencing when the reference protein sequences are
available (1517).
In this paper, we present an algorithm for locating 1 and
+1 frameshift sites of certain types in genomic or mRNA
sequences. The algorithm is intended to find 1 frameshift
sites of X XXY YYZ type in viruses, bacteria and eukaryotes,
and considers pseudoknots as well as simple stemloops as
downstream stimulatory structures. It also allows the user to
change the stem and loop sizes from their default values. +1
frameshift signals are too diverse among different organisms.
Therefore, the algorithm currently finds only those frameshift
sites that are conserved among many species, namely
frameshift sites used in genes encoding protein chain release factor
B (prfB) and ornithine decarboxylase antizyme (oaz). The
algorithm has been implemented as a web-based application
program called FSFinder (Frameshift Signal Finder), and is
accessible at http://wilab.inha.ac.kr/FSFinder/.
COMPUTATIONAL MODEL
Components of frameshift signals
We have modified the computational model for 1 frameshift
signals of Hammell et al. (9) to improve its sensitivity and
selectivity. Sequences of three codons (9 nt) in a genomic
sequence are first examined for possible slippery sequences
of the form X XXY YYZ. In this sequence X and Z can be any
nucleotide, and Y can be A or U (in Hammells model, Z is
either A, U or C). If a slippery sequence is identified, FSFinder
searches for a downstream structure by sliding 411 nt along
the spacer. Figure 2 shows a programmed 1 frameshift site
with a pseudoknot as stimulatory structure. The pseudoknot (...truncated)