STRsearch: a new pipeline for targeted profiling of short tandem repeats in massively parallel sequencing data (pdf)

Article PDF cannot be displayed. You can download it here:

https://hereditasjournal.biomedcentral.com/counter/pdf/10.1186/s41065-020-00120-6

STRsearch: a new pipeline for targeted profiling of short tandem repeats in massively parallel sequencing data

Wang et al. Hereditas (2020) 157:8 https://doi.org/10.1186/s41065-020-00120-6 RESEARCH Open Access STRsearch: a new pipeline for targeted profiling of short tandem repeats in massively parallel sequencing data Dong Wang1†, Ruiyang Tao2†, Zhiqiang Li1, Dun Pan1, Zhuo Wang1*, Chengtao Li2* and Yongyong Shi1* Abstract Background: Short tandem repeats (STRs) are important polymorphism makers for human identification and kinship analyses in forensic science. With the continuous development of massively parallel sequencing (MPS), more laboratories have utilized this technology for forensic applications. Existing STR genotyping tools, mostly developed for whole-genome sequencing data, are not effective for MPS data. More importantly, their backward compatibility with the conventional capillary electrophoresis (CE) technology has not been evaluated and guaranteed. Results: In this study, we developed a new end-to-end pipeline called STRsearch for STR-MPS data analysis. The STRsearch can not only determine the allele by counting repeat patterns and INDELs that are actually in the STR region, but it also translates MPS results into standard STR nomenclature (numbers and letters). We evaluated the performance of STRsearch in two forensic sequencing datasets, and the concordance with CE genotypes was 75.73 and 75.75%, increasing 12.32 and 9.05% than the existing tool named STRScan, respectively. Additionally, we trained a base classifier using sequence properties and used it to predict the probability of correct genotyping at a given locus, resulting in the highest accuracy of 96.13%. Conclusions: All these results demonstrated that STRsearch was a better tool to protect the backward compatibility with CE for the targeted STR profiling in MPS data. STRsearch is available as open-source software at https://github. com/AnJingwd/STRsearch. Keywords: Short tandem repeats, Massively parallel sequencing, STR genotyping, Validation studies, Forensic sequencing Background Short tandem repeats (STRs) are short tandemly repeated DNA sequences composed of repetitive units of 1–6 bp [1]. STRs are widespread throughout the human genome and serve as widely used polymorphism markers in forensic science [1, 2]. For forensic casework, ideal STR loci should generally have the following characteristics such as approximate fragments ranging from 100 to 500 bp, high heterozygosity, low stutter, a low mutation * Correspondence: ; ; † Dong Wang and Ruiyang Tao contributed equally to this work. 1 Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders (Ministry of Education), Collaborative Innovation Center for Brain Science, Shanghai Jiao Tong University, Shanghai, China 2 Shanghai Key Laboratory of Forensic Medicine, Shanghai Forensic Service Platform, Academy of Forensic Science, Ministry of Justice, Shanghai 200063, People’s Republic of China rate, and so on [3, 4]. Currently, the capillary electrophoresis (CE) technology is the gold standard for STR genotyping, and it is commonly used in national DNA databases. The main process of the CE method includes PCR amplification of multiple STR loci, STR allele separation and sizing, and profile interpretation [3, 5, 6]. Each STR amplicon has been fluorescently labeled during PCR, and then STR alleles are separated via gel or CE based on dye color and migration time. Finally, compared to the allelic ladder with calibrated repeat numbers, the number of repeats of each allele is determined [3]. However, the CE method can only identify length variation and does not account for any sequence variation in repeat or flanking regions [7]. Compared to the CE method, massively parallel sequencing (MPS) can not only analyze an increased number of STR loci simultaneously, but it also provides © The Author(s). 2020 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Wang et al. Hereditas (2020) 157:8 higher discrimination power by detecting various sequence variants such as SNPs or INDELs [8]. However, there are three main difficulties in developing new tool for STR-MPS data analysis: (i) the amplification of STR loci during sequencing is also subject to slippage, creating copy number errors in read data; (ii) the low information content of repetitive sequence reads makes it difficult to align them reliably [9]; (iii) existing bioinformatics tools, mostly can make reliable calls only if sequencing reads completely span the actual repeat region [10]. For the first challenge, these errors are usually termed as stutters, which are commonly encountered artifacts during STR analysis both in CE and MPS data. They are caused by the slippage of the DNA polymerase during the extension phase of the PCR, generating the deletion or extra one repeat unit in the nascent DNA strand [11]. For the second challenge, a previous study [10] performed a comprehensive survey and then demonstrated that Stampy [12] was the most accurate with regards to mapping reads in STR regions, while Novoalign (http:// www.novocraft.com), Bowtie2 [13] and BWA [14] consumed much shorter running times. For the third challenge, research demonstrated that long-read sequencing technologies (such as Nanopore or PacBio) could potentially sequence through larger repeat loci with accuracy and effective cost [15]. Furthermore, short paired-end reads with sequence overlaps can be assembled to create longer sequences, and assembled reads will span the full length of the original DNA fragment. So far, for STR analysis in whole-genome sequencing data, many tools have been developed, the most notable of which are LobSTR [16], HipSTR [17] and RepeatSeq [18]. However, the capacity of these tools was severely restricted to detecting STR variation within read length. To solve this problem, another tool called STRetch [19] estimated the approximate size of STR allele using the normalized read counts that were linearly related to the length. For targeted profiling of STRs, STRScan [20] identified STRs by comparing read sequences with repeat patterns. However, the priori assumption on allele size had the potential to induce allelic dropout. While STRaitRazor [21] adopted approximate string matching of flanking sequences to characterize haplotypes of STRs. So sufficient and unique flanking sequences were required to allow them to be mapped correctly. Although the importance of internal and external quality control (QC) was highlighted for STRs analysi (...truncated)