Understanding human disease mutations through the use of interspecific genetic variation
Mark P. Miller
0
Sudhir Kumar
0
0
Department of Biology, Arizona State University
, Tempe,
AZ 85287-1501, USA
Data on replacement mutations in genes of disease patients exist in a variety of online resources. In addition, genome sequencing projects and individual gene sequencing efforts have led to the identification of disease gene homologs in diverse metazoan species. The availability of these two types of information provides unique opportunities to investigate factors that are important in the development of genetically based disease by contrasting long and short-term molecular evolutionary patterns. Therefore, we conducted an analysis of disease-associated human genetic variation for seven disease genes: the cystic fibrosis transmembrane conductance regulator, glucose-6-phosphate dehydrogenase, the neural cell adhesion molecule L1, phenylalanine hydroxylase, paired box 6, the X-linked retinoschisis gene and TSC2/tuberin. Our analyses indicate that disease mutations show definite patterns when examined from an evolutionary perspective. Human replacement mutations resulting in disease are overabundant at amino acid positions most conserved throughout the long-term history of metazoans. In contrast, human polymorphic replacement mutations and silent mutations are randomly distributed across sites with respect to the level of conservation of amino acid sites within genes. Furthermore, disease-causing amino acid changes are of types usually not observed among species. Using Grantham's chemical difference matrix, we find that amino acid changes observed in disease patients are far more radical than the variation found among species and in non-diseased humans. Overall, our results demonstrate the usefulness of evolutionary analyses for understanding patterns of human disease mutations and underscore the biomedical significance of sequence data currently being generated from various model organism genome sequencing projects.
-
One central purpose of genome sequencing projects is to effect
a better understanding of the genetics of disease and provide
assistance with the identification of disease-associated genes
(13). However, many human mutation databases containing
genetic variation found in disease patients already exist, and
new databases and database entries are rapidly accumulating
(4,5). Concomitant analysis of these two types of information
provides unique opportunities to identify intrinsic attributes of
disease-associated human genetic variation, leading to a better
understanding of the relationship between mutations and the
development of disease phenotypes.
Information contained in the alignments of homologous
disease-associated genes has long been recognized as an
important factor for understanding contemporary deleterious
genetic variation in humans (4,6). For example, in a given set
of homologous genes, a large fraction of amino acid sites will
be conserved even among distantly related species that
diverged hundreds of millions of years ago. Variations that
arose at such positions throughout evolutionary history have
evidently been under strong purifying selection and eliminated
from populations, suggesting that the existing amino acid
residues at invariant positions are critical for proper gene function.
Thus, information from interspecific alignments can indicate
amino acid residues in gene products that are likely to produce
disease if mutated in humans. Likewise, some positions in
protein sequences vary among species, and such variable sites
may indicate positions that are under less severe selective
constraints. These variable positions suggest sites where
residue changes can be tolerated by natural selection and
provide insights into the types of amino acids that can be freely
exchanged without negatively impacting protein function.
Since the logic of these statements is often used by
researchers to indicate the potential for an observed amino acid
change to produce disease in humans (610), we conducted a
study to directly evaluate the extent that interspecific sequence
alignments reveal common attributes of the deleterious
mutations observed in humans. We performed three types of
analyses using disease mutation data and homologous gene
sequences from seven disease-associated genes (Table 1 and
Fig. 1): cystic fibrosis transmembrane conductance regulator
(CFTR), glucose-6-phosphate dehydrogenase (G6PD), neural
cell adhesion molecule L1 (L1CAM), phenylalanine hydroxylase
(PAH), paired box 6 (PAX6), the X-linked retinoschisis gene
Number of mutations analyzed
(disease/polymorphic/silent)a
aDisease mutations refer to those amino acid changes that produce a disease
phenotype. Polymorphic mutations are amino acid changes that are presumably
not disease related. Silent mutations are DNA sequence changes that do not
alter the encoded amino acid.
bThe database analyzed contained 48 type I mutations that result in chronic
non-spherocytic hemolytic anemic and 62 less severe types II, III or IV
mutations.
and a gene associated with tuberous sclerosis (TSC2). First, we
determined the association between the prevalence of disease
mutations and the extent to which corresponding amino acid
sites in other species have been conserved throughout the
evolutionary history of metazoans. Secondly, we compared the
frequency of a given type of amino acid change in disease
patients to frequencies obtained from interspecific
comparisons. Finally, we compared the chemical property differences
of amino acid changes seen among species and non-diseased
humans with those observed in disease patients.
RESULTS AND DISCUSSION
The association of disease mutations and evolutionarily
conserved amino acid residues
A null hypothesis describing the distribution of human genetic
variation among amino acid sites in a gene can be generated
assuming that point mutations occur randomly throughout that
gene. If a set of mutations found in a population is
representative of the random mutational process, then the number of
mutations observed at a given type of site in a gene should be
proportional to the frequency with which sites of that type
appear in a sequence. Using information from interspecific
comparisons, we tested the null hypothesis that
disease-associated replacement mutations are randomly distributed among
different classes of amino acid sites which were determined
based on their variability among extant metazoans. This
analysis permits a direct assessment of statements suggesting
that disease mutations are more common at evolutionarily
conserved residues. If we do not reject the null hypothesis of
random association for a set of disease mutations, then
mutations at conserved sites are no more important than those at
variable sites for the development of the disease phenotype. In
contrast, analyses will illustrate the importance of replacement
mutations at conserved sites if the null hypothesis is rejected
due to an overabundance of disease mutations at conserved
positions and a (...truncated)