Refining multiple sequence alignments with conserved core regions
Saikat Chakrabarti
0
Christopher J. Lanczycki
0
Anna R. Panchenko
0
Teresa M. Przytycka
0
Paul A. Thiessen
0
Stephen H. Bryant
0
0
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health
,
Bethesda, MD
, 20894,
USA
Accurate multiple sequence alignments of proteins are very important to several areas of computational biology and provide an understanding of phylogenetic history of domain families, their identification and classification. This article presents a new algorithm, REFINER, that refines a multiple sequence alignment by iterative realignment of its individual sequences with the predetermined conserved core (block) model of a protein family. Realignment of each sequence can correct misalignments between a given sequence and the rest of the profile and at the same time preserves the family's overall block model. Large-scale benchmarking studies showed a noticeable improvement of alignment after refinement. This can be inferred from the increased alignment score and enhanced sensitivity for database searching using the sequence profiles derived from refined alignments compared with the original alignments. A standalone version of the program is available by ftp distribution (ftp://ftp.ncbi.nih.gov/pub/REFINER) and will be incorporated into the next release of the Cn3D structure/alignment viewer.
-
The advent of large genome projects has led to an explosion of
sequence data in public databases. In this connection, the
establishment of structural, functional and evolutionary
similarity between proteins and protein domains is a challenging
task. Several domain databases are now available which
combine homologous protein domains into the distinct families
and represent them in the form of domain multiple sequence
alignments. The accuracy of domain identification, protein
classification and reconstruction of phylogenetic history of
domain families crucially depends on the quality of underlying
sequence alignments. Some domain resources, such as PFAM
(1) and ProDom (2), rely on the automated methods of
multiple sequence alignment while others, such as SMART (3) and
CDD (4), employ careful manual intervention in constructing
the domain models. The CDD database contains alignments
that are carefully curated to be consistent with structure
structure alignments to preserve the conserved core of a
protein domain family. Each curated CDD alignment records
conserved features within the family members in terms of
blocks, the regions where every sequence is aligned without
the gaps.
Different methods have been proposed to produce a
multiple sequence alignment. Some of them align all
sequences simultaneously (5,6), while others apply a
progressive alignment strategy (710). According to the latter, the
sequences are aligned in a predetermined order dictated
usually by the guide tree which groups similar sequences together
with the subsequent addition of more dissimilar ones. This
approach has been implemented in variety of programs and
packages such as MULTALIGN (11), MULTAL (9) and
CLUSTALW (10). While being widely accepted, progressive
alignment has its own pitfalls as the misalignment made at
previous stages can not be corrected afterwards and can
propagate into serious alignment errors. Moreover, the final
alignment strongly depends on the order of sequences being
aligned. To overcome these flaws, iterative approaches have
introduced the capacity to reconsider and realign previously
aligned sequences at each iteration with the goal of improving
the overall alignment score (7,1219). Sequences are realigned
in a random order and the iteration cycle ends as soon as a
convergence criterion has been satisfied. While this strategy
faces the problem of being trapped in a local minimum and
producing suboptimal alignment (like most other multiple
sequence alignment methods), it proves to be robust and
produces more accurate alignments (14,1819).
In this article we present an algorithm named REFINER
that aims to refine an existing multiple alignment using the
predetermined block model of that domain family as a
biologically relevant constraint on the search space. A block
model represents conserved sequence/structure regions
which are highly unlikely to contain gaps and are common
to all family members. The refinement protocol works by
iterative random selection and realignment of sequences
with the family block model until the alignment score saturates
to a stable value or until the iteration cycle terminates.
Realignment of each sequence can correct misalignments
between a given sequence and the rest of the profile by shifting
the individual blocks on that sequence and at the same time
preserves the familys overall block model (i.e. the conserved
core regions). The latter constraint prohibits the insertion of
gap characters in the middle of conserved blocks. Following a
cycle of shifting for each sequence, the blocks in the block
model can also be extended or shortened in size depending on
the overall score improvement. The algorithm has been
benchmarked against structure-based, manually curated (CDD) and
un-curated (PFAM) alignments and shows an overall score
improvement. Comparison of the algorithm against another
independent alignment refinement method (19) showed better
performance. The refinement is further tested and validated by
checking the reliability in retrieval of functionally important
sites and enhanced sensitivity for profile-based database
searches compared with the original curated alignments.
This method is reasonably fast and can realign several
hundreds of highly diverse sequences within minutes. The
refinement method also provides a means to detect the outlier
sequences within an alignment and may thereby point a
way towards new subfamily identification schemes.
MATERIALS AND METHODS
A benchmark to evaluate the accuracy of
refinement algorithm
To test the overall performance of the refinement algorithm we
used a collection of 362 manually curated parent alignments
(set_362) from the CDD version 2.00 (4). The current version
of CDD is available at http://www.ncbi.nlm.nih.gov/Structure/
cdd/cdd.shtml. A parent alignment corresponds to the most
ancient (i.e. topmost) family in a domain family hierarchy,
several hundred of which are currently defined by CDD.
A smaller test set (set_94) of 94 multiple alignments from
CDD (with more than five protein structure entries) was
used to optimize the parameters for block extension cycles.
In addition, we applied the refinement algorithm also on
900 un-curated PFAM (1) alignments [generated either by
CLUSTALW (10) or T-Coffee (20)].
To compare the database search sensitivity of the Position
Specific Scoring Matrices (PSSMs) computed from multiple
sequence alignments before and after the refinement
procedure, we first constructed a list of true positives for the
conserved domain families from set_362. True positives
here are defined as those proteins/domains which a (...truncated)