The intolerance to functional genetic variation of protein domains predicts the localization of pathogenic mutations within genes
Gussow et al. Genome Biology
The intolerance to functional genetic variation of protein domains predicts the localization of pathogenic mutations within genes
Ayal B. Gussow 0 2
Slavé Petrovski 0 1
Quanli Wang 0
Andrew S. Allen 3
David B. Goldstein 0
0 Institute for Genomic Medicine, Columbia University , New York, NY , USA
1 Department of Medicine, The University of Melbourne, Austin Health and Royal Melbourne Hospital , Melbourne, VIC , Australia
2 Program in Computational Biology and Bioinformatics, Duke University , Durham, NC , USA
3 Department of Biostatistics and Bioinformatics, Duke University , Durham, NC , USA
Ranking human genes based on their tolerance to functional genetic variation can greatly facilitate patient genome interpretation. It is well established, however, that different parts of proteins can have different functions, suggesting that it will ultimately be more informative to focus attention on functionally distinct portions of genes. Here we evaluate the intolerance of genic sub-regions using two biological sub-region classifications. We show that the intolerance scores of these sub-regions significantly correlate with reported pathogenic mutations. This observation extends the utility of intolerance scores to indicating where pathogenic mutations are mostly likely to fall within genes.
RVIS; Intolerance; subRVIS; subGERP; Domains; Exons; Pathogenic
Background
We previously introduced the Residual Variation
Intolerance Score (RVIS) [
1
], a framework that ranks
protein-coding genes based on their intolerance to
functional variation, by comparing the overall number
of observed variants in a gene to the observed common
functional variants. The basic idea behind this approach is
the same as that behind approaches using phylogenetic
conservation that rank genes by the degree to which they
are evolutionarily conserved, except using standing human
genetic variation to identify genes in which functional
variation is strongly selected against and thus likely to
be deleterious. This approach proved successful in
prioritizing genes most likely to result in Mendelian
disease [
1
]. Using the gene as the unit of analysis however
fails to represent the reality that pathogenic mutations can
often cluster in particular parts of genes.
While there are many approaches that assess various
characteristics of variants [
2–4
] which can in turn be used
to try and determine whether or not a variant is likely to
be pathogenic, current approaches to the problem of
localizing pathogenic variants within sub-regions of a gene
rely heavily on conservation to define important
boundaries. The thought behind this is that more conserved
regions within a gene are more likely to contain pathogenic
variants. Another option to define genic sub regions is to
utilize the functional information about the corresponding
protein from databases of manually annotated proteins,
such as Swiss-Prot [
5
]. In fact, some variant level
predictors, such as MutationTaster [
2
], take these data into
account when they are available. However, while ideally an
approach that focused on parts of proteins would use
divisions that correspond to functionally distinct parts of
proteins, this information is not yet comprehensively available.
Here, we take a first step at an approach to divide the
gene into sub-regions and rank the resulting sub-regions
by their intolerance to functional variation. We use two
divisions as surrogates for functionally distinct parts of
the protein. The first is a division into protein domains,
defined by sequence homology to known conserved
domains. The second is a division into exons, reflecting
that a gene can encode different isoforms of the protein
using different exonic configurations.
For the protein domain division, we annotate each gene’s
protein domains based on the Conserved Domain Database
(CDD) [
6
], a collection of conserved domain sequences.
The coding region of each gene was aligned to the CDD.
The final domain coordinates for each gene were defined as
the regions within the gene that aligned to the CDD and
the unaligned regions between each CDD alignment.
This table contains the AIC comparisons between different sets of predictors. All models contain the mutation rate as a covariate (Methods). Entries labeled ‘base’
indicate models using only the mutation rate and no other predictors. P is the probability that the model with the larger AIC minimizes the information loss from
the model with the lower AIC
Following this, we sought to create a ranking of the
resulting sub-regions that would reflect their intolerance
to functional variation. One common approach to this is
to rank stretches of sequence by their phylogenetic
conservation [
7
]. However, relying on conservation alone
can fail to capture human specific constraint. Thus, we
used the RVIS approach introduced in [
1
] to rank these
regions solely based on human polymorphism data. We
therefore generated the RVIS as (...truncated)