Accounting For Alignment Uncertainty in Phylogenomics
Citation: Wu M, Chatterji S, Eisen JA (
Accounting For Alignment Uncertainty in Phylogenomics
Martin Wu 0
Sourav Chatterji 0
Jonathan A. Eisen 0
Marco Salemi, University of Florida, United States of America
0 1 Department of Biology, University of Virginia, Charlottesville, Virginia, United States of America, 2 Genome Center, University of California, Davis, California, United States of America, 3 Section of Evolution and Ecology, College of Biological Sciences, University of California, Davis, California, United States of America, 4 Department of Medical Microbiology and Immunology, School of Medicine, University of California , Davis, California , United States of America
Uncertainty in multiple sequence alignments has a large impact on phylogenetic analyses. Little has been done to evaluate the quality of individual positions in protein sequence alignments, which directly impact the accuracy of phylogenetic trees. Here we describe ZORRO, a probabilistic masking program that accounts for alignment uncertainty by assigning confidence scores to each alignment position. Using the BALIBASE database and in simulation studies, we demonstrate that masking by ZORRO significantly reduces the alignment uncertainty and improves the tree accuracy.
-
Multiple sequence alignment is critical for many biological
studies making use of sequence data. For evolutionary analysis,
columns in multiple sequence alignments are hypothesized to
contain homologous residues in different sequences. This is known
generally as positional homology. The assignment of positional
homology can be problematic, however. It follows that the quality
of a sequence alignment has a large impact on the final
phylogenetic trees [1,2,3,4,5,6], so much so the inferred phylogeny
may be more dependent upon the methods of alignment than on
the methods of phylogenetic reconstruction [1,5,6,7,8,9]. This is
especially true for highly divergent sequences whose alignments
are more difficult and less consistent.
A plethora of programs developed recently have led to
significant improvement of the overall alignment accuracy
[10,11,12,13,14,15,16,17,18]. Despite this, the alignment
uncertainty in typical real sequence dataset continues to cause problems
for phylogenetic studies. In one striking example, Landan and
Graur showed that aligning protein sequences from the
Nterminus (the head) to the C-terminus (the tail) can, and in many
cases do, produce alignments that were highly different from the
same sequences aligned from the C- to the N-terminus, despite
that identical sequences and alignment algorithms were used [9].
This is thought to be caused by the presence of multiple equally
optimal but distinct solutions during the alignment process. To
deal with the equivocality, alignment programs either intentionally
or not, end up making arbitrary decisions that can lead to
significantly different alignments [19] and incongruent phylogenies
[9]. Alignment uncertainty has become even a bigger problem in
the era of phylogenomics, when phylogenetic analyses of
thousands of genes are routinely carried out automatically without
accounting for the alignment reliability. For example, using
genomic data from seven yeast species, Wong and colleagues
demonstrated that variations in sequence alignments produced by
different alignment methods were significant enough to lead to
different phylogenetic conclusions 46.2% of the 1,507 genes had
one or more differing trees depending on the alignment method
used [20]. In one particularly striking case, seven alignments of a
gene family produced six different phylogenies of seven yeast
species [20].
Several metrics have been introduced to help assess the overall
alignment quality [21,22,23,24]. Although appearing in slightly
different forms, they all basically quantitatively measure the
differences between alignment variants of the same set of
sequences and use these scores to evaluate the overall alignment
quality. The underlying assumption in this approach is that if the
alignment fluctuates considerably with different methods (and thus
can be considered unstable), this implies that the alignment is
difficult and might be of poor quality. When compared to high
quality reference alignments, the overall sensitivity (defined as the
number of correctly aligned residues divided by the number of
residues aligned in the reference alignment) and specificity (defined
as the number of correctly aligned residues divided by the number
of residues aligned by a particular alignment program) of an
alignment can be calculated as well. For example, using simulated
datasets and the SABmark database [25], Pachter and his
colleagues benchmarked several commonly used alignment
programs. They found that all were heavily biased toward
maximizing the sensitivity at the expense of the specificity [26],
i.e., although many residues were correctly aligned, it is also the
case that a large fraction of the characters were aligned incorrectly.
In other words, the assumption of positional homology was invalid
for many of the aligned positions.
A critical, yet largely unsolved problem in the field is how to
assess the quality of the alignment at each individual position, (i.e.,
the validity of the assignment of positional homology). Knowing
the quality of the individual position is important as poorly
aligned columns are more likely to contribute noise than signal.
Detecting and removing ambiguously aligned regions, a step
known as masking and trimming in molecular phylogenetics,
increases the signal-to-noise ratio and improves the discriminatory
power of phylogenetic methods [27,28,29]. Traditionally, masking
and trimming of regions thought to be poorly aligned were done as
part of manual curation process. Such manual efforts are not only
subjective but also impractical in large-scale phylogenetic
inferences. Positions with gaps are often considered unreliable
and therefore are trimmed. However, it has been shown that
trimming by simply removing positions that contain gaps results in
excessive loss of informative sites and does not necessarily lead to
better trees [20,30].
GBLOCKS is currently the most frequently used masking
program that attempts to assess the quality of alignment position
by position. It calculates the degree of conservation for each
aligned position and then uses it to select conserved blocks for
further analyses [27]. However, positions with low conservation
scores could still be homologous (e.g., fast evolving sites). Such sites
might contain useful phylogenetic information, sometimes more so
than these highly conserved positions. To overcome this limitation,
GBLOCKS tries to rescue these poorly conserved but
potentially homologous positions as long as they belong to a block
flanked by highly conserved columns at both ends and satisfy a set
of ad hoc rules (e.g., the maximum number of contiguous
nonconserved positions allowed is 8 and the minimum length of a
block is 10). However, in real al (...truncated)