STRsearch: a new pipeline for targeted profiling of short tandem repeats in massively parallel sequencing data
Wang et al. Hereditas
(2020) 157:8
https://doi.org/10.1186/s41065-020-00120-6
RESEARCH
Open Access
STRsearch: a new pipeline for targeted
profiling of short tandem repeats in
massively parallel sequencing data
Dong Wang1†, Ruiyang Tao2†, Zhiqiang Li1, Dun Pan1, Zhuo Wang1*, Chengtao Li2* and Yongyong Shi1*
Abstract
Background: Short tandem repeats (STRs) are important polymorphism makers for human identification and
kinship analyses in forensic science. With the continuous development of massively parallel sequencing (MPS), more
laboratories have utilized this technology for forensic applications. Existing STR genotyping tools, mostly developed
for whole-genome sequencing data, are not effective for MPS data. More importantly, their backward compatibility
with the conventional capillary electrophoresis (CE) technology has not been evaluated and guaranteed.
Results: In this study, we developed a new end-to-end pipeline called STRsearch for STR-MPS data analysis. The
STRsearch can not only determine the allele by counting repeat patterns and INDELs that are actually in the STR
region, but it also translates MPS results into standard STR nomenclature (numbers and letters). We evaluated the
performance of STRsearch in two forensic sequencing datasets, and the concordance with CE genotypes was 75.73
and 75.75%, increasing 12.32 and 9.05% than the existing tool named STRScan, respectively. Additionally, we trained
a base classifier using sequence properties and used it to predict the probability of correct genotyping at a given
locus, resulting in the highest accuracy of 96.13%.
Conclusions: All these results demonstrated that STRsearch was a better tool to protect the backward compatibility
with CE for the targeted STR profiling in MPS data. STRsearch is available as open-source software at https://github.
com/AnJingwd/STRsearch.
Keywords: Short tandem repeats, Massively parallel sequencing, STR genotyping, Validation studies, Forensic
sequencing
Background
Short tandem repeats (STRs) are short tandemly repeated DNA sequences composed of repetitive units of
1–6 bp [1]. STRs are widespread throughout the human
genome and serve as widely used polymorphism markers
in forensic science [1, 2]. For forensic casework, ideal
STR loci should generally have the following characteristics such as approximate fragments ranging from 100 to
500 bp, high heterozygosity, low stutter, a low mutation
* Correspondence: ; ;
†
Dong Wang and Ruiyang Tao contributed equally to this work.
1
Bio-X Institutes, Key Laboratory for the Genetics of Developmental and
Neuropsychiatric Disorders (Ministry of Education), Collaborative Innovation
Center for Brain Science, Shanghai Jiao Tong University, Shanghai, China
2
Shanghai Key Laboratory of Forensic Medicine, Shanghai Forensic Service
Platform, Academy of Forensic Science, Ministry of Justice, Shanghai 200063,
People’s Republic of China
rate, and so on [3, 4]. Currently, the capillary electrophoresis (CE) technology is the gold standard for STR
genotyping, and it is commonly used in national DNA
databases. The main process of the CE method includes
PCR amplification of multiple STR loci, STR allele separation and sizing, and profile interpretation [3, 5, 6].
Each STR amplicon has been fluorescently labeled during PCR, and then STR alleles are separated via gel or
CE based on dye color and migration time. Finally, compared to the allelic ladder with calibrated repeat numbers, the number of repeats of each allele is determined
[3]. However, the CE method can only identify length
variation and does not account for any sequence variation in repeat or flanking regions [7].
Compared to the CE method, massively parallel sequencing (MPS) can not only analyze an increased
number of STR loci simultaneously, but it also provides
© The Author(s). 2020 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Wang et al. Hereditas
(2020) 157:8
higher discrimination power by detecting various sequence variants such as SNPs or INDELs [8]. However,
there are three main difficulties in developing new tool
for STR-MPS data analysis: (i) the amplification of STR
loci during sequencing is also subject to slippage, creating copy number errors in read data; (ii) the low information content of repetitive sequence reads makes it
difficult to align them reliably [9]; (iii) existing bioinformatics tools, mostly can make reliable calls only if
sequencing reads completely span the actual repeat
region [10].
For the first challenge, these errors are usually termed
as stutters, which are commonly encountered artifacts
during STR analysis both in CE and MPS data. They are
caused by the slippage of the DNA polymerase during
the extension phase of the PCR, generating the deletion
or extra one repeat unit in the nascent DNA strand [11].
For the second challenge, a previous study [10] performed a comprehensive survey and then demonstrated
that Stampy [12] was the most accurate with regards to
mapping reads in STR regions, while Novoalign (http://
www.novocraft.com), Bowtie2 [13] and BWA [14] consumed much shorter running times. For the third challenge, research demonstrated that long-read sequencing
technologies (such as Nanopore or PacBio) could potentially sequence through larger repeat loci with accuracy
and effective cost [15]. Furthermore, short paired-end
reads with sequence overlaps can be assembled to create
longer sequences, and assembled reads will span the full
length of the original DNA fragment.
So far, for STR analysis in whole-genome sequencing
data, many tools have been developed, the most notable
of which are LobSTR [16], HipSTR [17] and RepeatSeq
[18]. However, the capacity of these tools was severely
restricted to detecting STR variation within read length.
To solve this problem, another tool called STRetch [19]
estimated the approximate size of STR allele using the
normalized read counts that were linearly related to the
length. For targeted profiling of STRs, STRScan [20]
identified STRs by comparing read sequences with repeat patterns. However, the priori assumption on allele
size had the potential to induce allelic dropout. While
STRaitRazor [21] adopted approximate string matching
of flanking sequences to characterize haplotypes of
STRs. So sufficient and unique flanking sequences were
required to allow them to be mapped correctly. Although the importance of internal and external quality
control (QC) was highlighted for STRs analysi (...truncated)