Using paired-end sequences to optimise parameters for alignment of sequence reads against related genomes (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1471-2164-11-458.pdf

Using paired-end sequences to optimise parameters for alignment of sequence reads against related genomes

BMC Genomics Using paired-end sequences to optimise parameters for alignment of sequence reads against related genomes Abhirami Ratnakumar 0 1 Sean McWilliam 1 Wesley Barris 1 Brian P Dalrymple 1 0 Department of Medical Biochemistry and Microbiology, Uppsala University , Box 582, 751 23 Uppsala , Sweden 1 CSIRO Livestock Industries , 306 Carmody Road, St. Lucia, QLD 4067 , Australia Background: The advent of cheap high through-put sequencing methods has facilitated low coverage skims of a large number of organisms. To maximise the utility of the sequences, assembly into contigs and then ordering of those contigs is required. Whilst sequences can be assembled into contigs de novo, using assembled genomes of closely related organisms as a framework can considerably aid the process. However, the preferred search programs and parameters that will optimise the sensitivity and specificity of the alignments between the sequence reads and the framework genome(s) are not necessarily obvious. Here we demonstrate a process that uses pairedend sequence reads to choose an optimal program and alignment parameters. Results: Unlike two single fragment reads, in paired-end sequence reads, such as BAC-end sequences, the two sequences in the pair have a known positional relationship in the original genome. This provides an additional level of confidence over match scores and e-values in the accuracy of the positional assignment of the reads in the comparative genome. Three commonly used sequence alignment programs: MegaBLAST, Blastz and PatternHunter were used to align a set of ovine BAC-end sequences against the equine genome assembly. A range of different search parameters, with a particular focus on contiguous and discontiguous seeds, were used for each program. The number of reads with a hit and the number of read pairs with hits for the two end sequences in the tail-to-tail paired-end configuration were plotted relative to the theoretical maximum expected curve. Of the programs tested, MegaBLAST with short contiguous seed lengths (word size 8-11) performed best in this particular task. In addition the data also provides estimates of the false positive and false negative rates, which can be used to determine the appropriate values of additional parameters, such as score cut-off, to balance sensitivity and specificity. To determine whether the approach also worked for the alignment of shorter reads, the first 240 bases of each BAC end sequence were also aligned to the equine genome. Again, contiguous MegaBLAST performed the best in optimising the sensitivity and specificity with which sheep BAC end reads map to the equine and bovine genomes. Conclusions: Paired-end reads, such as BAC-end sequences, provide an efficient mechanism to optimise sequence alignment parameters, for example for comparative genome assemblies, by providing an objective standard to evaluate performance. - Background With the availability of the so-called Next Generation Sequencing (NGS), relatively cheap high-throughput short molecule sequencing technologies such as Illumina GA and ABI SOLiD, and medium length sequencing technologies such as Roche 454 is giving non-specialist * Correspondence: 1CSIRO Livestock Industries, 306 Carmody Road, St. Lucia, QLD 4067, Australia Full list of author information is available at the end of the article laboratories the ability to sequence large genomes. However, the large number of reads produced by these NGS technologies creates problems for the utilisation of the sequence data. In the last few years a number of new programs for the alignment of short reads, for example in the range 30-150 bases, have been described, these include Maq [1], SOAP [2] and Bowtie [3]. In general, these programs are designed for resequencing projects, where few nucleotide sequence differences are expected between the sequence reads and the reference genome. However, many projects are likely to be low coverage skims of previously unsequenced genomes [4] possibly combining identification of SNPs with a survey of the genome sequence. The optimal design of SNP chips and effective utilisation of the chips in whole genome association analyses requires the relative order of and the distance between the SNPs and their association with genes to be known. Obtaining this information is likely to rely on comparative genomics by utilising the assemblies of related genomes to order and orientate sequence reads and contigs. The assembly of the cat genome based on Sanger sequencing used such a process to build an assembly from a 1.9 fold coverage of the genome [5]. For the cat, a combination of MegaBLAST and Blastz was used to generate the genome assembly, which utilised alignments to a number of other genomes such as human, chimpanzee, mouse, rat, dog, and bovine [5]. In recent years a wide range of different programs have emerged to complement BLAST, itself a compromise between specificity and sensitivity relative to the Smith-Waterman algorithm. True Smith-Waterman is too slow for large scale projects, but in an effort to approach its speed, sensitivity and specificity, MegaBLAST [6] and PatternHunter [7,8], amongst others, have been developed. A key to increasing the speed of the sequence alignment programs has been the utilisation of discontiguous seeds [7,9], allowing the matches to be spread over longer sequences with internal mismatches and therefore the utilisation of longer seeds for the same sensitivity. This approach has been implemented in MegaBLAST, Blastz [10] and PatternHunter amongst other programs. Using discontiguous seeds improves the specificity and sensitivity of the programs. Further innovations have included using multiple discontiguous seeds and refining the patterns of the seeds [11,12]. However, much of the analysis and comparisons of approaches have been carried out on mRNA/EST sequence sets [9,12] and not on genomic DNA which, in the eukaryotes, has quite different distributions of repeats. The new sequence alignment programs that have been developed for the alignment of sequence reads against reference genome sequences for resequencing projects (see above) do not appear to be suitable for comparative genomics approaches. For aligning medium length genomic sequence reads (150-500 bases) against related genomes, it is not immediately clear which program and which parameters would yield the best compromise between sensitivity and specificity. Here we use the example of the analysis of the effectiveness of three widely used DNA sequence alignment programs to position ovine BES reads against the equine and bovine genome assemblies to demonstrate the utility of the approach. We use the information about the positional relationship of the end sequences of each BAC in the ovine genome to estimate the sensitivity and specificity of the methods of determining the positions in the related, but not identical genomes. Results and Discussion Alignment of ovine BAC-end sequences (BESs) (...truncated)