Accounting For Alignment Uncertainty in Phylogenomics

PLOS ONE, Dec 2019

Uncertainty in multiple sequence alignments has a large impact on phylogenetic analyses. Little has been done to evaluate the quality of individual positions in protein sequence alignments, which directly impact the accuracy of phylogenetic trees. Here we describe ZORRO, a probabilistic masking program that accounts for alignment uncertainty by assigning confidence scores to each alignment position. Using the BALIBASE database and in simulation studies, we demonstrate that masking by ZORRO significantly reduces the alignment uncertainty and improves the tree accuracy.

Accounting For Alignment Uncertainty in Phylogenomics

Citation: Wu M, Chatterji S, Eisen JA ( Accounting For Alignment Uncertainty in Phylogenomics Martin Wu 0 Sourav Chatterji 0 Jonathan A. Eisen 0 Marco Salemi, University of Florida, United States of America 0 1 Department of Biology, University of Virginia, Charlottesville, Virginia, United States of America, 2 Genome Center, University of California, Davis, California, United States of America, 3 Section of Evolution and Ecology, College of Biological Sciences, University of California, Davis, California, United States of America, 4 Department of Medical Microbiology and Immunology, School of Medicine, University of California , Davis, California , United States of America Uncertainty in multiple sequence alignments has a large impact on phylogenetic analyses. Little has been done to evaluate the quality of individual positions in protein sequence alignments, which directly impact the accuracy of phylogenetic trees. Here we describe ZORRO, a probabilistic masking program that accounts for alignment uncertainty by assigning confidence scores to each alignment position. Using the BALIBASE database and in simulation studies, we demonstrate that masking by ZORRO significantly reduces the alignment uncertainty and improves the tree accuracy. - Multiple sequence alignment is critical for many biological studies making use of sequence data. For evolutionary analysis, columns in multiple sequence alignments are hypothesized to contain homologous residues in different sequences. This is known generally as positional homology. The assignment of positional homology can be problematic, however. It follows that the quality of a sequence alignment has a large impact on the final phylogenetic trees [1,2,3,4,5,6], so much so the inferred phylogeny may be more dependent upon the methods of alignment than on the methods of phylogenetic reconstruction [1,5,6,7,8,9]. This is especially true for highly divergent sequences whose alignments are more difficult and less consistent. A plethora of programs developed recently have led to significant improvement of the overall alignment accuracy [10,11,12,13,14,15,16,17,18]. Despite this, the alignment uncertainty in typical real sequence dataset continues to cause problems for phylogenetic studies. In one striking example, Landan and Graur showed that aligning protein sequences from the Nterminus (the head) to the C-terminus (the tail) can, and in many cases do, produce alignments that were highly different from the same sequences aligned from the C- to the N-terminus, despite that identical sequences and alignment algorithms were used [9]. This is thought to be caused by the presence of multiple equally optimal but distinct solutions during the alignment process. To deal with the equivocality, alignment programs either intentionally or not, end up making arbitrary decisions that can lead to significantly different alignments [19] and incongruent phylogenies [9]. Alignment uncertainty has become even a bigger problem in the era of phylogenomics, when phylogenetic analyses of thousands of genes are routinely carried out automatically without accounting for the alignment reliability. For example, using genomic data from seven yeast species, Wong and colleagues demonstrated that variations in sequence alignments produced by different alignment methods were significant enough to lead to different phylogenetic conclusions 46.2% of the 1,507 genes had one or more differing trees depending on the alignment method used [20]. In one particularly striking case, seven alignments of a gene family produced six different phylogenies of seven yeast species [20]. Several metrics have been introduced to help assess the overall alignment quality [21,22,23,24]. Although appearing in slightly different forms, they all basically quantitatively measure the differences between alignment variants of the same set of sequences and use these scores to evaluate the overall alignment quality. The underlying assumption in this approach is that if the alignment fluctuates considerably with different methods (and thus can be considered unstable), this implies that the alignment is difficult and might be of poor quality. When compared to high quality reference alignments, the overall sensitivity (defined as the number of correctly aligned residues divided by the number of residues aligned in the reference alignment) and specificity (defined as the number of correctly aligned residues divided by the number of residues aligned by a particular alignment program) of an alignment can be calculated as well. For example, using simulated datasets and the SABmark database [25], Pachter and his colleagues benchmarked several commonly used alignment programs. They found that all were heavily biased toward maximizing the sensitivity at the expense of the specificity [26], i.e., although many residues were correctly aligned, it is also the case that a large fraction of the characters were aligned incorrectly. In other words, the assumption of positional homology was invalid for many of the aligned positions. A critical, yet largely unsolved problem in the field is how to assess the quality of the alignment at each individual position, (i.e., the validity of the assignment of positional homology). Knowing the quality of the individual position is important as poorly aligned columns are more likely to contribute noise than signal. Detecting and removing ambiguously aligned regions, a step known as masking and trimming in molecular phylogenetics, increases the signal-to-noise ratio and improves the discriminatory power of phylogenetic methods [27,28,29]. Traditionally, masking and trimming of regions thought to be poorly aligned were done as part of manual curation process. Such manual efforts are not only subjective but also impractical in large-scale phylogenetic inferences. Positions with gaps are often considered unreliable and therefore are trimmed. However, it has been shown that trimming by simply removing positions that contain gaps results in excessive loss of informative sites and does not necessarily lead to better trees [20,30]. GBLOCKS is currently the most frequently used masking program that attempts to assess the quality of alignment position by position. It calculates the degree of conservation for each aligned position and then uses it to select conserved blocks for further analyses [27]. However, positions with low conservation scores could still be homologous (e.g., fast evolving sites). Such sites might contain useful phylogenetic information, sometimes more so than these highly conserved positions. To overcome this limitation, GBLOCKS tries to rescue these poorly conserved but potentially homologous positions as long as they belong to a block flanked by highly conserved columns at both ends and satisfy a set of ad hoc rules (e.g., the maximum number of contiguous nonconserved positions allowed is 8 and the minimum length of a block is 10). However, in real al (...truncated)


This is a preview of a remote PDF: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0030288&type=printable
Article home page: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0030288

Martin Wu, Sourav Chatterji, Jonathan A. Eisen. Accounting For Alignment Uncertainty in Phylogenomics, PLOS ONE, 2012, 1, DOI: 10.1371/journal.pone.0030288