rSNP_Guide, a database system for analysis of transcription factor binding to target sequences: application to SNPs and site-directed mutations
Julia V. Ponomarenko
0
2
Tatyana I. Merkulova
0
2
Gennady V. Vasiliev
0
2
Zoya B. Levashova
0
2
Galina V. Orlova
0
2
Sergey V. Lavryushev
0
2
Oleg N. Fokin
0
2
Mikhail P. Ponomarenko
0
2
Anatoly S. Frolov
0
2
Akinori Sarai
0
1
2
0
and Chemical Research
, RIKEN, 3-1-1 Koyadai, Tsukuba,
Japan
1
The Institute of Physical
2
Institute of Cytology and Genetics
, 10 Lavrentyev Avenue, Novosibirsk, 630090,
Russia
rSNP_Guide is a novel curated database system for analysis of transcription factor (TF) binding to target sequences in regulatory gene regions altered by mutations. It accumulates experimental data on naturally occurring site variants in regulatory gene regions and site-directed mutations. This database system also contains the web tools for SNP analysis, i.e., active applet applying weight matrices to predict the regulatory site candidates altered by a mutation. The current version of the rSNP_Guide is supplemented by six sub-databases: (i) rSNP_DB, on DNA-protein interaction caused by mutation; (ii) SYSTEM, on experimental systems; (iii) rSNP_BIB, on citations to original publications; (iv) SAMPLES, on experimentally identified sequences of known regulatory sites; (v) MATRIX, on weight matrices of known TF sites; (vi) rSNP_Report, on characteristic examples of successful rSNP_Tools implementation. These databases are useful for the analysis of natural SNPs and site-directed mutations. The databases are available through the Web, http://wwwmgs.bionet.nsc.ru/mgs/ systems/rsnp/.
-
Application of Single Nucleotide Polymorphism (SNP) analysis
to the human genome is currently among the greatest challenges
presented by the human genome sequence initiative (1). This
novel research field permits exploration of the influence of
specific sequence alterations on disease susceptibility, drug
resistance/sensitivity and ultimately health care. The number
of experimentally detected SNPs is growing tremendously.
Currently the HGMD database (2) contains more than 10 000
SNPs that alter codon translation, more than 1000 that affect
splice sites, and less than 200 that influence gene regulatory
regions. In the databases, dbSNP (3), HGBASE (4), ALFRED
(5) and OMIM (6), SNPs in regulatory and coding regions are
represented in a similar ratio. Obviously, functional alteration of
highly conserved codons and splice sites, resulting in alteration of
protein structure and function, are detected more easily than
less conserved regulatory regions such as promoters, enhancers,
silencers, introns, etc. (7). Recent experiments (810) have
shown that regulatory SNPs may be manifest in several ways,
including: (i) alteration of function of a site important for
normal regulation; (ii) a difference in affinity of protein
binding at such a site; or (iii) acquired function of a site not
normally participating in proper regulation. Thus, as has been
shown experimentally (9,10), the influence of an SNP cannot
be predicted reliably, only by inspection of the local region for
potential regulatory elements similar to those of known
sequence.
Although SNP analysis is only now being applied to regulatory
regions, it is being developed using experimental findings in
the databases TRANSFAC (11), TRRD (12), COMPEL (13),
ACTIVITY (14) and others, which accumulate information not
only about naturally occurring site variants, but also resulting
from intentional (site-directed) mutagenesis. Among the latter
artificial variants, site-directed mutagenesis altering several
nucleotides is more informative for SNP analysis of regulatory
DNA regions than deletions, insertions or hybrid constructs.
Since disease penetration may be affected not only by the
presence or absence of a transcription factor (TF) binding site
in a regulatory region, but also by quantitative alterations of
binding efficiency [e.g., erythroid-specific DNA-binding
protein(s) affinity alterations cause -thalassemia; 15], the data
on sequence-activity relationships are informative for SNP
analysis of regulatory regions. We anticipate that further
development of the present database will actually have prescriptive
value for specific applications in disease.
From this perspective, our web-resource rSNP_Guide integrates
experimental data on natural SNPs with sequence variations
generated artificially. The core of this resource is the database
rSNP_DB. It compiles data on alterations in DNA binding by
nuclear proteins observed due to natural and experimental
sequence variations. This information is represented in a
simple format adopted for computer analysis. rSNP_DB is
supplemented by four databases: (i) SYSTEM, experimental
conditions; (ii) rSNP_BIB, references to original publications;
(iii) SAMPLES, multiple alignments of the known TF-sites
sequences; and (iv) MATRIX, weight matrices for TF site
recognition. To apply the information stored in these databases
to SNP-analysis of DNA regulatory regions, we have developed
the Java-script applet, rSNP_Tools. We have tested this
rSNP_Tools on a series of examples, which represent both
naturally occurring mutations and relevant artificial constructs.
These test results are documented in the rSNP_Report database
and are helpful for analysis of SNPs and mutagenesis. The
rSNP_Guide is available through the Web, http://
wwwmgs.bionet.nsc.ru/mgs/programs/rsnp/.
DATA REPRESENTATION
A graphical representation of the rSNP_Guide components and
sources of information is given in Figure 1. In this figure, the
arrows link the components of the rSNP_Guide and related
web resources. Initial information on the naturally occurring
mutations is extracted from original publications and the
databases HGMD (2), dbSNP (3), HGBASE (4), ALFRED (5) and
OMIM (6), whereas the site-directed mutagenesis data are
taken from TRANSFAC (11), TRRD (12), COMPEL (13) and
ACTIVITY (14). Using the original publications (rSNP_BIB),
we document the experimental conditions (SYSTEM). Taking
into account experimental conditions, the data on alterations in
nuclear protein binding to DNA with point mutations are
accumulated in rSNP_DB. Next, typical examples of the
rSNP_DB entries are chosen and investigated using the
Javascript applet rSNP_Tools, which implements SAMPLES and
MATRIX (16). Finally, the results are stored in the database
rSNP_Reports.
Each entry of the core database, rSNP_DB, contains the
information on DNAprotein interaction alterations caused by
mutation. The entry has 16 descriptive field names (Fig. 2).
These field names are color-coded. If a user clicks the field
name, the Help function is activated in a separate window,
which contains information about formatting the data, examples,
etc. With the keywords, the database can be queried using SRS
(17).
The second database, SYSTEM, contains the accumulating
data on experimental systems. The entry has nine descriptive
field names. By analogy to rSNP_DB, each field is supported
by the Help function. The detailed description of the SYSTEM
format is given in (14). The third database (...truncated)