FastHap: fast and accurate single individual haplotype reconstruction using fuzzy conflict graphs
Vol. 30 ECCB 2014, pages i371–i378
doi:10.1093/bioinformatics/btu442
BIOINFORMATICS
FastHap: fast and accurate single individual haplotype
reconstruction using fuzzy conflict graphs
Sepideh Mazrouee* and Wei Wang
Computer Science Department, University of California Los Angeles (UCLA), 3551 Boelter Hall, Los Angeles,
CA 90095-1596, USA
ABSTRACT
1 INTRODUCTION
All diploid organisms have two homologous copies of each
chromosome, one inherited from each parent. The two DNA
sequences of a homologous chromosome pair are usually not
identical to each other. The most common DNA sequence variants are single nucleotide polymorphism (SNP). We refer to the
sites at which the two DNA sequences differ as heterozygous
sites. Current high-throughput sequencing technologies (Eid
et al., 2009) are incapable of reading the DNA sequence of an
entire chromosome. Instead, they produce a huge collection of
short reads of DNA fragments. The process of inferring two
DNA sequences (i.e. haplotypes) from a set of reads is referred
to as haplotype assembly, which has become a crucial computational task to reconstruct one’s genome from these reads.
Haplotype assembly methods usually involve three main
stages before reconstruction phase. First, a sequence aligner is
used to align the reads to the reference genome. Then, only the
read alignments at the heterozygous sites are kept for haplotype
reconstruction. Last, reads that span multiple heterozygous sites
are used to infer the alleles belonging to each haplotype.
*To whom correspondence should be addressed.
The quality of the reconstructed haplotypes may be dramatically
affected by errors in sequencing and alignment. The objective,
therefore, is to design algorithms that mitigate this impact and
rebuild the most likely copies of each chromosome accurately.
This has led to development of accurate haplotype reconstruction algorithms in the past few years. We are, however, observing
a critical shift in sequencing technology where larger datasets
with longer reads and higher coverage become available. This
shift necessitates the development of algorithms that not only
reconstruct haplotypes accurately but also require low computation time and can scale to large datasets. In this article, we introduce a new framework for fast and accurate haplotype assembly.
1.1
Motivation and related work
The past decade has witnessed much research effort on enhancing accuracy of haplotype assembly methods. The research,
however, lacks a method that is not only accurate but also fast
enough that can be used widely on large-scale datasets. In
particular, current trends in sequencing technologies demonstrate that the sequence read lengths are being extended significantly and access to reads of up to several thousand base pair
long will become a reality in near future.
Haplotype assembly approaches can be divided into two categories: (i) fragment partitioning; (ii) SNP partitioning. The fragment partitioning techniques partition the set of fragments into
two disjoint sets each representing one copy of the haplotype.
Examples of such techniques are FastHare (Panconesi and Sozio,
2004) and the greedy heuristic in (Levy et al., 2007). The SNP
partitioning approaches such as HapCut (Bansal and Bafna,
2008), HapCompass (Aguiar and Istrail, 2012) and the approach
in (He et al., 2010) rely on partitioning the SNPs into two disjoint
sets and finding those variants whose corresponding haplotype
bits need to be flipped to improve minimum error correction
(MEC). In any of the two scenarios, an iterative process is
involved. From a computational complexity point of view, the
main drawback with existing techniques is that they perform
much computation during each iteration of the algorithm.
HapCut (Bansal et al., 2008) is an example of the algorithms
that use SNP partitioning technique to minimize MEC criterion.
The process involves iteratively reconstructing a weighted graph
and finding a max-cut of the graph. Clearly, most of the
computation occurs in a loop. The algorithm has proved to be
fairly accurate at the cost of high computation. The greedy heuristic algorithm in (Levy et al., 2007) is a fragment partitioning
approach. The iteration, however, involves two major computing
tasks: (i) reconstructing a partial haplotype based on the fragments that are already assigned to a partition; (ii) calculating
distance between unassigned fragments and each one of the
ß The Author 2014. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/), which permits
non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact
Motivation: Understanding exact structure of an individual’s haplotype plays a significant role in various fields of human genetics.
Despite tremendous research effort in recent years, fast and accurate
haplotype reconstruction remains as an active research topic, mainly
owing to the computational challenges involved. Existing haplotype
assembly algorithms focus primarily on improving accuracy of the
assembly, making them computationally challenging for applications
on large high-throughput sequence data. Therefore, there is a need to
develop haplotype reconstruction algorithms that are not only accurate but also highly scalable.
Results: In this article, we introduce FastHap, a fast and accurate
haplotype reconstruction approach, which is up to one order of magnitude faster than the state-of-the-art haplotype inference algorithms
while also delivering higher accuracy than these algorithms. FastHap
leverages a new similarity metric that allows us to precisely measure
distances between pairs of fragments. The distance is then used in
building the fuzzy conflict graphs of fragments. Given that optimal
haplotype reconstruction based on minimum error correction is
known to be NP-hard, we use our fuzzy conflict graphs to develop a
fast heuristic for fragment partitioning and haplotype reconstruction.
Availability: An implementation of FastHap is available for sharing on
request.
Contact:
S.Mazrouee and W.Wang
haplotype copies. FastHare (Panconesi and Sozio, 2004) is another fragment partitioning algorithm. It sorts all fragments
based on their positions before execution of the iterative
module. Computationally intensive tasks that occur iteratively
in FastHare include (i) reconstruction of a partial haplotype
based on the fragments that are already assigned to a partition;
(ii) calculating distance between the current fragment and each
one of the two haplotype copies.
1.2
Contributions
i372
MATERIALS AND METHODS
2.1
Problem statement and assumptions
We assume that the input to the haplotype assembly algorithm is a 2D
array containing only heterozygous sites of the aligned fragments, called
variant matrix, X, of size m (...truncated)