The intolerance to functional genetic variation of protein domains predicts the localization of pathogenic mutations within genes

Genome Biology, Jan 2016

Ranking human genes based on their tolerance to functional genetic variation can greatly facilitate patient genome interpretation. It is well established, however, that different parts of proteins can have different functions, suggesting that it will ultimately be more informative to focus attention on functionally distinct portions of genes. Here we evaluate the intolerance of genic sub-regions using two biological sub-region classifications. We show that the intolerance scores of these sub-regions significantly correlate with reported pathogenic mutations. This observation extends the utility of intolerance scores to indicating where pathogenic mutations are mostly likely to fall within genes.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

http://genomebiology.com/content/pdf/s13059-016-0869-4.pdf

The intolerance to functional genetic variation of protein domains predicts the localization of pathogenic mutations within genes

Gussow et al. Genome Biology The intolerance to functional genetic variation of protein domains predicts the localization of pathogenic mutations within genes Ayal B. Gussow 0 2 Slavé Petrovski 0 1 Quanli Wang 0 Andrew S. Allen 3 David B. Goldstein 0 0 Institute for Genomic Medicine, Columbia University , New York, NY , USA 1 Department of Medicine, The University of Melbourne, Austin Health and Royal Melbourne Hospital , Melbourne, VIC , Australia 2 Program in Computational Biology and Bioinformatics, Duke University , Durham, NC , USA 3 Department of Biostatistics and Bioinformatics, Duke University , Durham, NC , USA Ranking human genes based on their tolerance to functional genetic variation can greatly facilitate patient genome interpretation. It is well established, however, that different parts of proteins can have different functions, suggesting that it will ultimately be more informative to focus attention on functionally distinct portions of genes. Here we evaluate the intolerance of genic sub-regions using two biological sub-region classifications. We show that the intolerance scores of these sub-regions significantly correlate with reported pathogenic mutations. This observation extends the utility of intolerance scores to indicating where pathogenic mutations are mostly likely to fall within genes. RVIS; Intolerance; subRVIS; subGERP; Domains; Exons; Pathogenic Background We previously introduced the Residual Variation Intolerance Score (RVIS) [ 1 ], a framework that ranks protein-coding genes based on their intolerance to functional variation, by comparing the overall number of observed variants in a gene to the observed common functional variants. The basic idea behind this approach is the same as that behind approaches using phylogenetic conservation that rank genes by the degree to which they are evolutionarily conserved, except using standing human genetic variation to identify genes in which functional variation is strongly selected against and thus likely to be deleterious. This approach proved successful in prioritizing genes most likely to result in Mendelian disease [ 1 ]. Using the gene as the unit of analysis however fails to represent the reality that pathogenic mutations can often cluster in particular parts of genes. While there are many approaches that assess various characteristics of variants [ 2–4 ] which can in turn be used to try and determine whether or not a variant is likely to be pathogenic, current approaches to the problem of localizing pathogenic variants within sub-regions of a gene rely heavily on conservation to define important boundaries. The thought behind this is that more conserved regions within a gene are more likely to contain pathogenic variants. Another option to define genic sub regions is to utilize the functional information about the corresponding protein from databases of manually annotated proteins, such as Swiss-Prot [ 5 ]. In fact, some variant level predictors, such as MutationTaster [ 2 ], take these data into account when they are available. However, while ideally an approach that focused on parts of proteins would use divisions that correspond to functionally distinct parts of proteins, this information is not yet comprehensively available. Here, we take a first step at an approach to divide the gene into sub-regions and rank the resulting sub-regions by their intolerance to functional variation. We use two divisions as surrogates for functionally distinct parts of the protein. The first is a division into protein domains, defined by sequence homology to known conserved domains. The second is a division into exons, reflecting that a gene can encode different isoforms of the protein using different exonic configurations. For the protein domain division, we annotate each gene’s protein domains based on the Conserved Domain Database (CDD) [ 6 ], a collection of conserved domain sequences. The coding region of each gene was aligned to the CDD. The final domain coordinates for each gene were defined as the regions within the gene that aligned to the CDD and the unaligned regions between each CDD alignment. This table contains the AIC comparisons between different sets of predictors. All models contain the mutation rate as a covariate (Methods). Entries labeled ‘base’ indicate models using only the mutation rate and no other predictors. P is the probability that the model with the larger AIC minimizes the information loss from the model with the lower AIC Following this, we sought to create a ranking of the resulting sub-regions that would reflect their intolerance to functional variation. One common approach to this is to rank stretches of sequence by their phylogenetic conservation [ 7 ]. However, relying on conservation alone can fail to capture human specific constraint. Thus, we used the RVIS approach introduced in [ 1 ] to rank these regions solely based on human polymorphism data. We therefore generated the RVIS as (...truncated)


This is a preview of a remote PDF: http://genomebiology.com/content/pdf/s13059-016-0869-4.pdf

Ayal Gussow, Slavé Petrovski, Quanli Wang, Andrew Allen, David Goldstein. The intolerance to functional genetic variation of protein domains predicts the localization of pathogenic mutations within genes, Genome Biology, 2016, pp. 9, 17, DOI: 10.1186/s13059-016-0869-4