Evolutionary inaccuracy of pairwise structural alignments (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/bioinformatics/article-pdf/28/9/1209/48879814/bioinformatics_28_9_1209.pdf

Evolutionary inaccuracy of pairwise structural alignments

Copyedited by: TRJ BIOINFORMATICS MANUSCRIPT CATEGORY: ORIGINAL PAPER ORIGINAL PAPER Structural bioinformatics Vol. 28 no. 9 2012, pages 1209–1215 doi:10.1093/bioinformatics/bts103 Advance Access publication March 6, 2012 Evolutionary inaccuracy of pairwise structural alignments M. I. Sadowski∗ and W. R. Taylor Division of Mathematical Biology, MRC National Institute for Medical Research, The Ridgeway, Mill Hill, London NW71AA, UK Associate Editor: Alfonso Valencia Received on August 23, 2011; revised on January 13, 2012; accepted on February 24, 2012 1 INTRODUCTION Despite its apparent simplicity, the problem of aligning pairs of protein structures has attracted a significant level of research effort. Methods vary in the details of their objective function, problem representation, null model of comparison statistics and approaches to searching alignment space (Alesker et al., 1996; Birzele et al., 2007; Chen and Crippen, 2005; Holm and Sander, 1993; Kifer et al., 2011; Kolodny et al., 2005; Lackner et al., 2000; Novosad et al., 2010; ∗ To whom correspondence should be addressed. Pandit and Skolnick, 2008; Shibberu and Holder, 2011; Shindyalov and Bourne, 1998; Taylor, 1999; Vesterstrom and Taylor, 2006; Zhang and Skolnick, 2004), Reviewed in Carugo (2007). Common variations include the use of flexible alignment (Mosca et al., 2008; Rocha et al., 2009; Salem et al., 2010; Shatsky et al., 2004; Ye and Godzik, 2003) or using fragments and topological filters for initial alignments to improve quality and speed (Budowski-Tal et al., 2010; Gibrat et al., 1996; Krissinel and Henrick, 2004; Veeramalai et al., 2009). Two previous benchmarks of pairwise structure alignment methods have been published in the last decade (Kolodny et al., 2005; Mayr et al., 2007). These considered the degree to which the methods tested find a good solution as judged by geometric criteria (Kolodny et al., 2005) and the agreement of the aligned residues with a set of manually curated ‘gold standard’ alignments (Mayr et al., 2007). Both studies are important contributions but a recent study which covers current methods is lacking, as is a study which considers both geometric performance and ability to find homologous relationships between positions simultaneously. We would ideally like to assess the ability of these aligners to find homologous relationships as well as geometric similarities for a large number of proteins but this is problematic since for distantly related proteins we rarely know how individual positions are related. As an alternative to using gold standards and limiting the size of the dataset, we propose to make use of the fundamental property that homology is transitive: if A and B are homologous, B and C are homologous then A and C must also be homologous. Homology, therefore, establishes a set of equivalence classes over the residues in sets of related protein structures (symmetry and self-identity being obvious properties) and the more closely a structural alignment method approaches this situation the better its performance. The exception to this occurs only if a set of residues are related by a star phylogeny—for example where a gene duplication has resulted in duplicate internal structures such as are found in repeat proteins (Taylor and Sadowski, 2010a). In this study, we use this idea to compare the most widelyused methods for pairwise structural alignment, in addition to considering alignment accuracy relative to other annotation sources: DSSP structural classes (Kabsch and Sander, 1983) and solvent accessibilities. Additionally, following Kolodny et al. (2005) we consider the quality of the scores implemented by the methods with respect to external annotations [SCOP folds (Andreeva et al., 2008), GO annotations (Morais et al., 2011) and topological distances (Hollup et al. 2011; Sadowski and Taylor, 2010b)] and several geometric scores. Seven methods were chosen: the choice was based on their free availability for general academic use, their importance for publicly available resources or being widely used as judged by ABSTRACT Motivation: Structural alignment methods are widely used to generate gold standard alignments for improving multiple sequence alignments and transferring functional annotations, as well as for assigning structural distances between proteins. However, the correctness of the alignments generated by these methods is difﬁcult to assess objectively since little is known about the exact evolutionary history of most proteins. Since homology is an equivalence relation, an upper bound on alignment quality can be found by assessing the consistency of alignments. Measuring the consistency of current methods of structure alignment and determining the causes of inconsistencies can, therefore, provide information on the quality of current methods and suggest possibilities for further improvement. Results: We analyze the self-consistency of seven widelyused structural alignment methods (SAP, TM-align, Fr-TM-align, MAMMOTH, DALI, CE and FATCAT) on a diverse, non-redundant set of 1863 domains from the SCOP database and demonstrate that even for relatively similar proteins the degree of inconsistency of the alignments on a residue level is high (30%). We further show that levels of consistency vary substantially between methods, with two methods (SAP and Fr-TM-align) producing more consistent alignments than the rest. Inconsistency is found to be higher near gaps and for proteins of low structural complexity, as well as for helices. The ability of the methods to identify good structural alignments is also assessed using geometric measures, for which FATCAT (ﬂexible mode) is found to be the best performer despite being highly inconsistent. We conclude that there is substantial scope for improving the consistency of structural alignment methods. Contact: Supplementary information: Supplementary data are available at Bioinformatics online. © The Author(s) 2012. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. [13:01 24/4/2012 Bioinformatics-bts103.tex] Page: 1209 1209–1215 Copyedited by: TRJ MANUSCRIPT CATEGORY: ORIGINAL PAPER M.I.Sadowski and W.R.Taylor citation counts. We were interested in examining not only the relative performance of the methods as judged on several criteria but also the relationship between geometric accuracy and the accuracy of homology assignment. We find that the different assessment methods highlight different strengths and weaknesses of each of the methods, although TMalign and its newer sibling Fr-TM-align generally perform very well overall. We note that for the FATCAT method, flexible alignment increases geometric accuracy (...truncated)