Evolutionary inaccuracy of pairwise structural alignments
Copyedited by: TRJ
BIOINFORMATICS
MANUSCRIPT CATEGORY: ORIGINAL PAPER
ORIGINAL PAPER
Structural bioinformatics
Vol. 28 no. 9 2012, pages 1209–1215
doi:10.1093/bioinformatics/bts103
Advance Access publication March 6, 2012
Evolutionary inaccuracy of pairwise structural alignments
M. I. Sadowski∗ and W. R. Taylor
Division of Mathematical Biology, MRC National Institute for Medical Research, The Ridgeway, Mill Hill, London
NW71AA, UK
Associate Editor: Alfonso Valencia
Received on August 23, 2011; revised on January 13, 2012; accepted
on February 24, 2012
1
INTRODUCTION
Despite its apparent simplicity, the problem of aligning pairs of
protein structures has attracted a significant level of research effort.
Methods vary in the details of their objective function, problem
representation, null model of comparison statistics and approaches to
searching alignment space (Alesker et al., 1996; Birzele et al., 2007;
Chen and Crippen, 2005; Holm and Sander, 1993; Kifer et al., 2011;
Kolodny et al., 2005; Lackner et al., 2000; Novosad et al., 2010;
∗ To
whom correspondence should be addressed.
Pandit and Skolnick, 2008; Shibberu and Holder, 2011; Shindyalov
and Bourne, 1998; Taylor, 1999; Vesterstrom and Taylor, 2006;
Zhang and Skolnick, 2004), Reviewed in Carugo (2007). Common
variations include the use of flexible alignment (Mosca et al., 2008;
Rocha et al., 2009; Salem et al., 2010; Shatsky et al., 2004; Ye and
Godzik, 2003) or using fragments and topological filters for initial
alignments to improve quality and speed (Budowski-Tal et al., 2010;
Gibrat et al., 1996; Krissinel and Henrick, 2004; Veeramalai et al.,
2009).
Two previous benchmarks of pairwise structure alignment
methods have been published in the last decade (Kolodny et al.,
2005; Mayr et al., 2007). These considered the degree to which
the methods tested find a good solution as judged by geometric
criteria (Kolodny et al., 2005) and the agreement of the aligned
residues with a set of manually curated ‘gold standard’ alignments
(Mayr et al., 2007). Both studies are important contributions but
a recent study which covers current methods is lacking, as is a
study which considers both geometric performance and ability to
find homologous relationships between positions simultaneously.
We would ideally like to assess the ability of these aligners to
find homologous relationships as well as geometric similarities for
a large number of proteins but this is problematic since for distantly
related proteins we rarely know how individual positions are related.
As an alternative to using gold standards and limiting the size of the
dataset, we propose to make use of the fundamental property that
homology is transitive: if A and B are homologous, B and C are
homologous then A and C must also be homologous. Homology,
therefore, establishes a set of equivalence classes over the residues
in sets of related protein structures (symmetry and self-identity being
obvious properties) and the more closely a structural alignment
method approaches this situation the better its performance. The
exception to this occurs only if a set of residues are related by a
star phylogeny—for example where a gene duplication has resulted
in duplicate internal structures such as are found in repeat proteins
(Taylor and Sadowski, 2010a).
In this study, we use this idea to compare the most widelyused methods for pairwise structural alignment, in addition to
considering alignment accuracy relative to other annotation sources:
DSSP structural classes (Kabsch and Sander, 1983) and solvent
accessibilities. Additionally, following Kolodny et al. (2005) we
consider the quality of the scores implemented by the methods with
respect to external annotations [SCOP folds (Andreeva et al., 2008),
GO annotations (Morais et al., 2011) and topological distances
(Hollup et al. 2011; Sadowski and Taylor, 2010b)] and several
geometric scores. Seven methods were chosen: the choice was based
on their free availability for general academic use, their importance
for publicly available resources or being widely used as judged by
ABSTRACT
Motivation: Structural alignment methods are widely used to
generate gold standard alignments for improving multiple sequence
alignments and transferring functional annotations, as well as
for assigning structural distances between proteins. However, the
correctness of the alignments generated by these methods is
difficult to assess objectively since little is known about the
exact evolutionary history of most proteins. Since homology is an
equivalence relation, an upper bound on alignment quality can
be found by assessing the consistency of alignments. Measuring
the consistency of current methods of structure alignment and
determining the causes of inconsistencies can, therefore, provide
information on the quality of current methods and suggest
possibilities for further improvement.
Results: We analyze the self-consistency of seven widelyused structural alignment methods (SAP, TM-align, Fr-TM-align,
MAMMOTH, DALI, CE and FATCAT) on a diverse, non-redundant set
of 1863 domains from the SCOP database and demonstrate that
even for relatively similar proteins the degree of inconsistency of
the alignments on a residue level is high (30%). We further show
that levels of consistency vary substantially between methods, with
two methods (SAP and Fr-TM-align) producing more consistent
alignments than the rest. Inconsistency is found to be higher near
gaps and for proteins of low structural complexity, as well as
for helices. The ability of the methods to identify good structural
alignments is also assessed using geometric measures, for which
FATCAT (flexible mode) is found to be the best performer despite
being highly inconsistent. We conclude that there is substantial scope
for improving the consistency of structural alignment methods.
Contact:
Supplementary information: Supplementary data are available at
Bioinformatics online.
© The Author(s) 2012. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
[13:01 24/4/2012 Bioinformatics-bts103.tex]
Page: 1209
1209–1215
Copyedited by: TRJ
MANUSCRIPT CATEGORY: ORIGINAL PAPER
M.I.Sadowski and W.R.Taylor
citation counts. We were interested in examining not only the relative
performance of the methods as judged on several criteria but also
the relationship between geometric accuracy and the accuracy of
homology assignment.
We find that the different assessment methods highlight different
strengths and weaknesses of each of the methods, although TMalign and its newer sibling Fr-TM-align generally perform very well
overall. We note that for the FATCAT method, flexible alignment
increases geometric accuracy (...truncated)