FastHap: fast and accurate single individual haplotype reconstruction using fuzzy conflict graphs (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/30/17/i371.full.pdf

FastHap: fast and accurate single individual haplotype reconstruction using fuzzy conflict graphs

Vol. 30 ECCB 2014, pages i371–i378 doi:10.1093/bioinformatics/btu442 BIOINFORMATICS FastHap: fast and accurate single individual haplotype reconstruction using fuzzy conflict graphs Sepideh Mazrouee* and Wei Wang Computer Science Department, University of California Los Angeles (UCLA), 3551 Boelter Hall, Los Angeles, CA 90095-1596, USA ABSTRACT 1 INTRODUCTION All diploid organisms have two homologous copies of each chromosome, one inherited from each parent. The two DNA sequences of a homologous chromosome pair are usually not identical to each other. The most common DNA sequence variants are single nucleotide polymorphism (SNP). We refer to the sites at which the two DNA sequences differ as heterozygous sites. Current high-throughput sequencing technologies (Eid et al., 2009) are incapable of reading the DNA sequence of an entire chromosome. Instead, they produce a huge collection of short reads of DNA fragments. The process of inferring two DNA sequences (i.e. haplotypes) from a set of reads is referred to as haplotype assembly, which has become a crucial computational task to reconstruct one’s genome from these reads. Haplotype assembly methods usually involve three main stages before reconstruction phase. First, a sequence aligner is used to align the reads to the reference genome. Then, only the read alignments at the heterozygous sites are kept for haplotype reconstruction. Last, reads that span multiple heterozygous sites are used to infer the alleles belonging to each haplotype. *To whom correspondence should be addressed. The quality of the reconstructed haplotypes may be dramatically affected by errors in sequencing and alignment. The objective, therefore, is to design algorithms that mitigate this impact and rebuild the most likely copies of each chromosome accurately. This has led to development of accurate haplotype reconstruction algorithms in the past few years. We are, however, observing a critical shift in sequencing technology where larger datasets with longer reads and higher coverage become available. This shift necessitates the development of algorithms that not only reconstruct haplotypes accurately but also require low computation time and can scale to large datasets. In this article, we introduce a new framework for fast and accurate haplotype assembly. 1.1 Motivation and related work The past decade has witnessed much research effort on enhancing accuracy of haplotype assembly methods. The research, however, lacks a method that is not only accurate but also fast enough that can be used widely on large-scale datasets. In particular, current trends in sequencing technologies demonstrate that the sequence read lengths are being extended significantly and access to reads of up to several thousand base pair long will become a reality in near future. Haplotype assembly approaches can be divided into two categories: (i) fragment partitioning; (ii) SNP partitioning. The fragment partitioning techniques partition the set of fragments into two disjoint sets each representing one copy of the haplotype. Examples of such techniques are FastHare (Panconesi and Sozio, 2004) and the greedy heuristic in (Levy et al., 2007). The SNP partitioning approaches such as HapCut (Bansal and Bafna, 2008), HapCompass (Aguiar and Istrail, 2012) and the approach in (He et al., 2010) rely on partitioning the SNPs into two disjoint sets and finding those variants whose corresponding haplotype bits need to be flipped to improve minimum error correction (MEC). In any of the two scenarios, an iterative process is involved. From a computational complexity point of view, the main drawback with existing techniques is that they perform much computation during each iteration of the algorithm. HapCut (Bansal et al., 2008) is an example of the algorithms that use SNP partitioning technique to minimize MEC criterion. The process involves iteratively reconstructing a weighted graph and finding a max-cut of the graph. Clearly, most of the computation occurs in a loop. The algorithm has proved to be fairly accurate at the cost of high computation. The greedy heuristic algorithm in (Levy et al., 2007) is a fragment partitioning approach. The iteration, however, involves two major computing tasks: (i) reconstructing a partial haplotype based on the fragments that are already assigned to a partition; (ii) calculating distance between unassigned fragments and each one of the ß The Author 2014. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact Motivation: Understanding exact structure of an individual’s haplotype plays a significant role in various fields of human genetics. Despite tremendous research effort in recent years, fast and accurate haplotype reconstruction remains as an active research topic, mainly owing to the computational challenges involved. Existing haplotype assembly algorithms focus primarily on improving accuracy of the assembly, making them computationally challenging for applications on large high-throughput sequence data. Therefore, there is a need to develop haplotype reconstruction algorithms that are not only accurate but also highly scalable. Results: In this article, we introduce FastHap, a fast and accurate haplotype reconstruction approach, which is up to one order of magnitude faster than the state-of-the-art haplotype inference algorithms while also delivering higher accuracy than these algorithms. FastHap leverages a new similarity metric that allows us to precisely measure distances between pairs of fragments. The distance is then used in building the fuzzy conflict graphs of fragments. Given that optimal haplotype reconstruction based on minimum error correction is known to be NP-hard, we use our fuzzy conflict graphs to develop a fast heuristic for fragment partitioning and haplotype reconstruction. Availability: An implementation of FastHap is available for sharing on request. Contact: S.Mazrouee and W.Wang haplotype copies. FastHare (Panconesi and Sozio, 2004) is another fragment partitioning algorithm. It sorts all fragments based on their positions before execution of the iterative module. Computationally intensive tasks that occur iteratively in FastHare include (i) reconstruction of a partial haplotype based on the fragments that are already assigned to a partition; (ii) calculating distance between the current fragment and each one of the two haplotype copies. 1.2 Contributions i372 MATERIALS AND METHODS 2.1 Problem statement and assumptions We assume that the input to the haplotype assembly algorithm is a 2D array containing only heterozygous sites of the aligned fragments, called variant matrix, X, of size m (...truncated)