Using paired-end sequences to optimise parameters for alignment of sequence reads against related genomes
BMC Genomics
Using paired-end sequences to optimise parameters for alignment of sequence reads against related genomes
Abhirami Ratnakumar 0 1
Sean McWilliam 1
Wesley Barris 1
Brian P Dalrymple 1
0 Department of Medical Biochemistry and Microbiology, Uppsala University , Box 582, 751 23 Uppsala , Sweden
1 CSIRO Livestock Industries , 306 Carmody Road, St. Lucia, QLD 4067 , Australia
Background: The advent of cheap high through-put sequencing methods has facilitated low coverage skims of a large number of organisms. To maximise the utility of the sequences, assembly into contigs and then ordering of those contigs is required. Whilst sequences can be assembled into contigs de novo, using assembled genomes of closely related organisms as a framework can considerably aid the process. However, the preferred search programs and parameters that will optimise the sensitivity and specificity of the alignments between the sequence reads and the framework genome(s) are not necessarily obvious. Here we demonstrate a process that uses pairedend sequence reads to choose an optimal program and alignment parameters. Results: Unlike two single fragment reads, in paired-end sequence reads, such as BAC-end sequences, the two sequences in the pair have a known positional relationship in the original genome. This provides an additional level of confidence over match scores and e-values in the accuracy of the positional assignment of the reads in the comparative genome. Three commonly used sequence alignment programs: MegaBLAST, Blastz and PatternHunter were used to align a set of ovine BAC-end sequences against the equine genome assembly. A range of different search parameters, with a particular focus on contiguous and discontiguous seeds, were used for each program. The number of reads with a hit and the number of read pairs with hits for the two end sequences in the tail-to-tail paired-end configuration were plotted relative to the theoretical maximum expected curve. Of the programs tested, MegaBLAST with short contiguous seed lengths (word size 8-11) performed best in this particular task. In addition the data also provides estimates of the false positive and false negative rates, which can be used to determine the appropriate values of additional parameters, such as score cut-off, to balance sensitivity and specificity. To determine whether the approach also worked for the alignment of shorter reads, the first 240 bases of each BAC end sequence were also aligned to the equine genome. Again, contiguous MegaBLAST performed the best in optimising the sensitivity and specificity with which sheep BAC end reads map to the equine and bovine genomes. Conclusions: Paired-end reads, such as BAC-end sequences, provide an efficient mechanism to optimise sequence alignment parameters, for example for comparative genome assemblies, by providing an objective standard to evaluate performance.
-
Background
With the availability of the so-called Next Generation
Sequencing (NGS), relatively cheap high-throughput
short molecule sequencing technologies such as Illumina
GA and ABI SOLiD, and medium length sequencing
technologies such as Roche 454 is giving non-specialist
* Correspondence:
1CSIRO Livestock Industries, 306 Carmody Road, St. Lucia, QLD 4067, Australia
Full list of author information is available at the end of the article
laboratories the ability to sequence large genomes.
However, the large number of reads produced by these NGS
technologies creates problems for the utilisation of the
sequence data. In the last few years a number of new
programs for the alignment of short reads, for example
in the range 30-150 bases, have been described, these
include Maq [1], SOAP [2] and Bowtie [3]. In general,
these programs are designed for resequencing projects,
where few nucleotide sequence differences are expected
between the sequence reads and the reference genome.
However, many projects are likely to be low coverage
skims of previously unsequenced genomes [4] possibly
combining identification of SNPs with a survey of the
genome sequence. The optimal design of SNP chips and
effective utilisation of the chips in whole genome
association analyses requires the relative order of and the
distance between the SNPs and their association with
genes to be known. Obtaining this information is likely
to rely on comparative genomics by utilising the
assemblies of related genomes to order and orientate sequence
reads and contigs. The assembly of the cat genome
based on Sanger sequencing used such a process to
build an assembly from a 1.9 fold coverage of the
genome [5]. For the cat, a combination of MegaBLAST and
Blastz was used to generate the genome assembly, which
utilised alignments to a number of other genomes such
as human, chimpanzee, mouse, rat, dog, and bovine [5].
In recent years a wide range of different programs
have emerged to complement BLAST, itself a
compromise between specificity and sensitivity relative to the
Smith-Waterman algorithm. True Smith-Waterman is
too slow for large scale projects, but in an effort to
approach its speed, sensitivity and specificity,
MegaBLAST [6] and PatternHunter [7,8], amongst others,
have been developed. A key to increasing the speed of
the sequence alignment programs has been the
utilisation of discontiguous seeds [7,9], allowing the matches
to be spread over longer sequences with internal
mismatches and therefore the utilisation of longer seeds for
the same sensitivity. This approach has been
implemented in MegaBLAST, Blastz [10] and PatternHunter
amongst other programs. Using discontiguous seeds
improves the specificity and sensitivity of the programs.
Further innovations have included using multiple
discontiguous seeds and refining the patterns of the seeds
[11,12]. However, much of the analysis and comparisons
of approaches have been carried out on mRNA/EST
sequence sets [9,12] and not on genomic DNA which,
in the eukaryotes, has quite different distributions of
repeats. The new sequence alignment programs that
have been developed for the alignment of sequence
reads against reference genome sequences for
resequencing projects (see above) do not appear to be suitable
for comparative genomics approaches. For aligning
medium length genomic sequence reads (150-500 bases)
against related genomes, it is not immediately clear
which program and which parameters would yield the
best compromise between sensitivity and specificity.
Here we use the example of the analysis of the
effectiveness of three widely used DNA sequence alignment
programs to position ovine BES reads against the equine
and bovine genome assemblies to demonstrate the
utility of the approach. We use the information about
the positional relationship of the end sequences of each
BAC in the ovine genome to estimate the sensitivity and
specificity of the methods of determining the positions
in the related, but not identical genomes.
Results and Discussion
Alignment of ovine BAC-end sequences (BESs) (...truncated)