Refining multiple sequence alignments with conserved core regions (pdf)

Article PDF cannot be displayed. You can download it here:

https://nar.oxfordjournals.org/content/34/9/2598.full.pdf

Refining multiple sequence alignments with conserved core regions

Saikat Chakrabarti 0 Christopher J. Lanczycki 0 Anna R. Panchenko 0 Teresa M. Przytycka 0 Paul A. Thiessen 0 Stephen H. Bryant 0 0 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health , Bethesda, MD , 20894, USA Accurate multiple sequence alignments of proteins are very important to several areas of computational biology and provide an understanding of phylogenetic history of domain families, their identification and classification. This article presents a new algorithm, REFINER, that refines a multiple sequence alignment by iterative realignment of its individual sequences with the predetermined conserved core (block) model of a protein family. Realignment of each sequence can correct misalignments between a given sequence and the rest of the profile and at the same time preserves the family's overall block model. Large-scale benchmarking studies showed a noticeable improvement of alignment after refinement. This can be inferred from the increased alignment score and enhanced sensitivity for database searching using the sequence profiles derived from refined alignments compared with the original alignments. A standalone version of the program is available by ftp distribution (ftp://ftp.ncbi.nih.gov/pub/REFINER) and will be incorporated into the next release of the Cn3D structure/alignment viewer. - The advent of large genome projects has led to an explosion of sequence data in public databases. In this connection, the establishment of structural, functional and evolutionary similarity between proteins and protein domains is a challenging task. Several domain databases are now available which combine homologous protein domains into the distinct families and represent them in the form of domain multiple sequence alignments. The accuracy of domain identification, protein classification and reconstruction of phylogenetic history of domain families crucially depends on the quality of underlying sequence alignments. Some domain resources, such as PFAM (1) and ProDom (2), rely on the automated methods of multiple sequence alignment while others, such as SMART (3) and CDD (4), employ careful manual intervention in constructing the domain models. The CDD database contains alignments that are carefully curated to be consistent with structure structure alignments to preserve the conserved core of a protein domain family. Each curated CDD alignment records conserved features within the family members in terms of blocks, the regions where every sequence is aligned without the gaps. Different methods have been proposed to produce a multiple sequence alignment. Some of them align all sequences simultaneously (5,6), while others apply a progressive alignment strategy (710). According to the latter, the sequences are aligned in a predetermined order dictated usually by the guide tree which groups similar sequences together with the subsequent addition of more dissimilar ones. This approach has been implemented in variety of programs and packages such as MULTALIGN (11), MULTAL (9) and CLUSTALW (10). While being widely accepted, progressive alignment has its own pitfalls as the misalignment made at previous stages can not be corrected afterwards and can propagate into serious alignment errors. Moreover, the final alignment strongly depends on the order of sequences being aligned. To overcome these flaws, iterative approaches have introduced the capacity to reconsider and realign previously aligned sequences at each iteration with the goal of improving the overall alignment score (7,1219). Sequences are realigned in a random order and the iteration cycle ends as soon as a convergence criterion has been satisfied. While this strategy faces the problem of being trapped in a local minimum and producing suboptimal alignment (like most other multiple sequence alignment methods), it proves to be robust and produces more accurate alignments (14,1819). In this article we present an algorithm named REFINER that aims to refine an existing multiple alignment using the predetermined block model of that domain family as a biologically relevant constraint on the search space. A block model represents conserved sequence/structure regions which are highly unlikely to contain gaps and are common to all family members. The refinement protocol works by iterative random selection and realignment of sequences with the family block model until the alignment score saturates to a stable value or until the iteration cycle terminates. Realignment of each sequence can correct misalignments between a given sequence and the rest of the profile by shifting the individual blocks on that sequence and at the same time preserves the familys overall block model (i.e. the conserved core regions). The latter constraint prohibits the insertion of gap characters in the middle of conserved blocks. Following a cycle of shifting for each sequence, the blocks in the block model can also be extended or shortened in size depending on the overall score improvement. The algorithm has been benchmarked against structure-based, manually curated (CDD) and un-curated (PFAM) alignments and shows an overall score improvement. Comparison of the algorithm against another independent alignment refinement method (19) showed better performance. The refinement is further tested and validated by checking the reliability in retrieval of functionally important sites and enhanced sensitivity for profile-based database searches compared with the original curated alignments. This method is reasonably fast and can realign several hundreds of highly diverse sequences within minutes. The refinement method also provides a means to detect the outlier sequences within an alignment and may thereby point a way towards new subfamily identification schemes. MATERIALS AND METHODS A benchmark to evaluate the accuracy of refinement algorithm To test the overall performance of the refinement algorithm we used a collection of 362 manually curated parent alignments (set_362) from the CDD version 2.00 (4). The current version of CDD is available at http://www.ncbi.nlm.nih.gov/Structure/ cdd/cdd.shtml. A parent alignment corresponds to the most ancient (i.e. topmost) family in a domain family hierarchy, several hundred of which are currently defined by CDD. A smaller test set (set_94) of 94 multiple alignments from CDD (with more than five protein structure entries) was used to optimize the parameters for block extension cycles. In addition, we applied the refinement algorithm also on 900 un-curated PFAM (1) alignments [generated either by CLUSTALW (10) or T-Coffee (20)]. To compare the database search sensitivity of the Position Specific Scoring Matrices (PSSMs) computed from multiple sequence alignments before and after the refinement procedure, we first constructed a list of true positives for the conserved domain families from set_362. True positives here are defined as those proteins/domains which a (...truncated)