Whole Genome Mapping with Feature Sets from High-Throughput Sequencing Data
September
Whole Genome Mapping with Feature Sets from High-Throughput Sequencing Data
Yonglong Pan 0 1
Xiaoming Wang 0 1
Lin Liu 0 1
Hao Wang 0 1
Meizhong Luo 0 1
0 National Key Laboratory of Crop Genetic Improvement and College of Life Science and Technology, Huazhong Agricultural University , Wuhan, 430070 , China
1 Editor: Frank Alexander Feltus, Clemson University , UNITED STATES
A good physical map is essential to guide sequence assembly in de novo whole genome sequencing, especially when sequences are produced by high-throughput sequencing such as next-generation-sequencing (NGS) technology. We here present a novel method, Feature sets-based Genome Mapping (FGM). With FGM, physical map and draft whole genome sequences can be generated, anchored and integrated using the same data set of NGS sequences, independent of restriction digestion. Method model was created and parameters were inspected by simulations using the Arabidopsis genome sequence. In the simulations, when ~4.8X genome BAC library including 4,096 clones was used to sequence the whole genome, ~90% of clones were successfully connected to physical contigs, and 91.58% of genome sequences were mapped and connected to chromosomes. This method was experimentally verified using the existing physical map and genome sequence of rice. Of 4,064 clones covering 115 Mb sequence selected from ~3 tiles of 3 chromosomes of a rice draft physical map, 3,364 clones were reconstructed into physical contigs and 98 Mb sequences were integrated into the 3 chromosomes. The physical map-integrated draft genome sequences can provide permanent frameworks for eventually obtaining high-quality reference sequences by targeted sequencing, gap filling and combining other sequences.
-
OPEN ACCESS
Data Availability Statement: All relevant data are
within the paper and its Supporting Information files.
The main result data could be accessed publicly by
visiting the website of http://gresource.hzau.edu.cn/
fgm. The raw data were uploaded to the database of
European Nucleotide Archive on EMBL-EBI (https://
www.ebi.ac.uk/ena/) [study accession number:
PRJEB12942].
Introduction
Since 2005, the number of registered genome sequencing projects has doubled every two years,
reaching 11,472 as of September, 2011 [
1
]. Recent projects have expended a tremendous
amount of effort to sequence more complex genomes [
2
]. Many projects aimed to generate
reference genome sequences for the genus or species of interest. A reference genome sequence is
an important tool to explore genome structure and function, identify genomic variations, infer
information about species evolution, and guide the genome assembly of closely related species
[
3–8
]. However, in all cases, the high quality of a reference genome sequence is critical to
ensure reliable outcomes [9].
Two approaches, clone-by-clone (CBC) and whole genome shotgun (WGS), were developed
for whole genome sequencing [
10–13
]. WGS has been widely used along with high-throughput
sequencing such as next-generation sequencing (NGS) technologies [14]. Due to the
highdata collection and analysis, decision to publish, or
preparation of the manuscript.
throughput and cost-effective nature, many genomes have been sequenced using WGS/NGS.
However, this approach suffers from the key problem that the NGS reads are too short to
reliably locate and order scaffolds on chromosomes and complete chromosome assemblies,
especially when a genome is large and contains an abundance of repetitive sequences, large gene
families, and extensive segmental duplications [
5
]. As the development of the single-molecule
sequencing or the third generation sequencing technology, longer sequencing reads and more
continuous contigs could be obtained [
15–17
]. However, the technology alone is still difficult
to complete sequences of complex genomes at the present. CBC does not suffer from these
problems and is considered a “gold standard” for genome sequencing [
18, 19
]. In the CBC
approach, a physical map is first constructed using large-insert clones, mainly bacterial
artificial chromosomes (BACs) [20] and used as a framework for the allocation of assembled
sequences to chromosomes [
10, 12, 21
]. Physical clone maps are also important tools for
locating genes for map-based cloning [
22, 23
], assembling genomic repeats [24] and filling gaps
[
25
].
Fingerprinting technology has been widely used for physical clone mapping [
26–28
]. In this
technology, large insert clones such as BACs are fingerprinted with restriction enzyme(s), and
the shared restriction bands are used to identify overlaps between clones [
29
]. This technology
has been implemented in automated and high-throughput systems [
26
]. However, it is costly
and has a limited resolution for large genome mapping [
30
]. Optical mapping [
31
],
nanochannel genome mapping [
32
] and whole genome profiling (WGP) [
30
] methods have been
developed as alternatives to con (...truncated)