Integrating dilution-based sequencing and population genotypes for single individual haplotyping
Hirotaka Matsumoto
0
Hisanori Kiryu
0
0
Department of Computational Biology, Faculty of Frontier Science, The University of Tokyo
,
5-1-5, Kashiwanoha, Kashiwa, Chiba 277-8561
,
Japan
Background: Haplotype information is useful for many genetic analyses and haplotypes are usually inferred using computational approaches. Among such approaches, the importance of single individual haplotyping (SIH), which infers individual haplotypes from sequence fragments, has been increasing with the advent of novel sequencing techniques, such as dilution-based sequencing. These techniques could produce virtual long read fragments by separating DNA fragments into multiple low-concentration aliquots, sequencing and mapping each aliquot, and merging clustered short reads. Although these experimental techniques are sophisticated, they have the problem of producing chimeric fragments whose left and right parts match different chromosomes. In our previous research, we found that chimeric fragments significantly decrease the accuracy of SIH. Although chimeric fragments can be removed by using haplotypes which are determined from pedigree genotypes, pedigree genotypes are generally not available. The length of reads cluster and heterozygous calls were also used to detect chimeric fragments. Although some chimeric fragments will be removed with these features, considerable number of chimeric fragments will be undetected because of the dispersion of the length and the absence of SNPs in the overlapped regions. For these reasons, a general method to detect and remove chimeric fragments is needed. Results: In this paper, we propose a general method to detect chimeric fragments. The basis of our method is that a chimeric fragment would correspond to an artificial recombinant haplotype and would differ from biological haplotypes. To detect differences from biological haplotypes, we integrated statistical phasing, which is a haplotype inference approach from population genotypes, into our method. We applied our method to two datasets and detected chimeric fragments with high AUC. AUC values of our method are higher than those of just using cluster length and heterozygous calls. We then used multiple SIH algorithm to compare the accuracy of SIH before and after removing the chimeric fragment candidates. The accuracy of assembled haplotypes increased significantly after removing chimeric fragment candidates. Conclusions: Our method is useful for detecting chimeric fragments and improving SIH accuracy. The Ruby script is available at https://sites.google.com/site/hmatsu1226/software/csp.
-
Background
Advances in experimental techniques for DNA
sequencing and genotyping have made it possible to determine
many individual human genomes and detect variations,
such as single nucleotide polymorphisms (SNPs) [1,2].
This has brought about great progress in genome
analyses, such as genome-wide association studies (GWAS)
[3], inference of population structure [4], and expression
phenotypes [5]. However, most technologies give only
genotype information and most current research does not
determine the haplotype origin of the variations.
Haplotypes contain more detailed information than genotypes
and are valuable for GWAS [6], and for analyzing genetic
structures such as linkage disequilibrium,
recombination patterns [1], and correlations between variations and
diseases [7].
Determining haplotypes experimentally is difficult, and
there are three main computational approaches for
haplotype inference. The first approach is the statistical phasing
method, which infers population haplotypes from
population genotypes using statistical computation [8-12].
Algorithms for statistical phasing have been developed in
response to technological advances for genotyping, and
its basis is that the diversity of haplotypes is limited,
and there are conserved haplotypes [13]. Because of the
strategy, statistical phasing does not work well in
chromosomal regions which exhibit several different haplotypes,
particularly regions of low linkage disequilibrium. This
approach is also weak in inferring long haplotypes because
the complexity of population haplotypes increases
exponentially according to the number of SNPs.
In the second approach, haplotypes are inferred from
genotypes of pedigrees. For example, a childs haplotypes
are determined from the genotypes of child and its
parents (trio-based haplotyping). The origin of childs alleles
can be determined if only one of the parents has the
same alleles. However, the haplotypes of sites at which all
family members have the same genotype cannot be
determined and, furthermore, family genotype data are not
always available. In addition, neither the statistical
phasing method nor this approach can identify spontaneous
mutations.
The third approach uses DNA sequencing data and is
called single individual haplotyping (SIH) or haplotype
assembly [14-22]. SIH utilizes the fact that each sequenced
read is derived from only one of the haplotypes. If a read
spans two or more heterozygous sites, the haplotype can
be determined from the co-occurrence of alleles in the
read. Two reads are determined to originate from the same
chromosome if they overlap at a region that has at least
one heterozygous site, and they have the same alleles at
these sites.
SIH did not attract much attention until recently, since it
needed long DNA sequencing reads that spanned multiple
heterozygous sites, and obtaining such reads quickly and
economically was difficult. However, this situation is
changing rapidly with the advent of new experimental
techniques, such as fosmid pool-based next-generation
sequencing [17,23,24], long read fragment technology
[25], and dilution-amplification-based sequencing [26]
that can produce virtual long reads. In these methods,
long DNA fragments are separated into distinct
lowconcentration aliquots which each contain less than one
fragment per chromosomal region. After sequencing an
aliquot with a next-generation sequencer and mapping
short reads, clusters are formed in which the reads are
close to each other. A cluster corresponds to a long DNA
fragment and is supposed to be derived from a single
haplotype. Thus, virtual long reads can be obtained by
merging the short reads in a cluster (see Figure 1).
Although such experimental techniques are
sophisticated, they have the problem of producing chimeric
fragments whose left and right parts match different
chromosomes very well. Because long DNA fragments are
separated into aliquots randomly, there are cases where
an aliquot has some long DNA fragments derived from
the same region of different chromosomes and,
consequently, reads with different chromosomal origins are
regarded as one cluster and merged into a single fragment
(see Figure 1). In the process of developing MixSIH [22],
which is the first SIH algorithm that can evaluate the
reliability of a haplotype region, we have shown that such
chimeric fragments significantly decrea (...truncated)