Linked-read sequencing for detecting short tandem repeat expansions

Scientific Reports, Jun 2022

Detection of short tandem repeat (STR) expansions with standard short-read sequencing is challenging due to the difficulty in mapping multicopy repeat sequences. In this study, we explored how the long-range sequence information of barcode linked-read sequencing (BLRS) can be leveraged to improve repeat-read detection. We also devised a novel algorithm using BLRS barcodes for distance estimation and evaluated its application for STR genotyping. Both approaches were designed for genotyping large expansions (> 1 kb) that cannot be sized accurately by existing methods. Using simulated and experimental data of genomes with STR expansions from multiple BLRS platforms, we validated the utility of barcode and phasing information in attaining better STR genotypes compared to standard short-read sequencing. Although the coverage bias of extremely GC-rich STRs is an important limitation of BLRS, BLRS is an effective strategy for genotyping many other STR loci.

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41598-022-13024-4.pdf

Linked-read sequencing for detecting short tandem repeat expansions

www.nature.com/scientificreports OPEN Linked‑read sequencing for detecting short tandem repeat expansions Readman Chiu1,5, Indhu‑Shree Rajan‑Babu2,3,5, Inanc Birol1,2* & Jan M. Friedman2,4 Detection of short tandem repeat (STR) expansions with standard short-read sequencing is challenging due to the difficulty in mapping multicopy repeat sequences. In this study, we explored how the long-range sequence information of barcode linked-read sequencing (BLRS) can be leveraged to improve repeat-read detection. We also devised a novel algorithm using BLRS barcodes for distance estimation and evaluated its application for STR genotyping. Both approaches were designed for genotyping large expansions (> 1 kb) that cannot be sized accurately by existing methods. Using simulated and experimental data of genomes with STR expansions from multiple BLRS platforms, we validated the utility of barcode and phasing information in attaining better STR genotypes compared to standard short-read sequencing. Although the coverage bias of extremely GC-rich STRs is an important limitation of BLRS, BLRS is an effective strategy for genotyping many other STR loci. Barcode linked-read sequencing (BLRS) technologies1–3 combine the high per-base accuracy of short-read sequencing (SRS) with long-range sequence information4. BLRS has significantly advanced our ability to map complex genomic regions that are inaccessible to standard SRS, perform de novo diploid genome assemblies, and detect complex structural r earrangements4,5. BLRS enables haplotype reconstruction and accurate variant phasing1,6, which are crucial to identifying putative disease-causing biallelic mutations. BLRS also has been successfully applied to diagnose patients with suspected genetic diseases and failed diagnosis using exome/ genome SRS4. The human genome harbors over a million short tandem repeats (STRs)7. STR expansions are responsible for at least 50 known genetic disorders, and others probably still remain to be d iscovered8. Available computational 9–14 15 methods and p ipelines for STR genotyping in SRS data perform reasonably well in detecting pathogenic repeat expansions (REs) of some disease genes. Detection of in-repeat reads (IRRs)—reads composed entirely of repeat motif sequences—signal a potential RE event, and their abundance correlates with expansion size. However, accurate alignment of IRRs in SRS is often difficult because these reads may either not map to the reference genome or map ambiguously to multiple STR loci with the same repeat motif. As a result, existing short-read STR genotypers may report false-positive as well as false-negative calls for STRs that are larger than sequencing fragment l ength15. We hypothesized that the molecular barcodes in BLRS can be utilized to retrieve IRRs more robustly and improve repeat length estimation and detection of expanded STRs. Furthermore, using barcodes could help assign IRRs to the correct haplotype and enable reliable allele segregation and genotyping of pathogenic biallelic STR expansions. To leverage this information, we devised a novel BLRS STR genotyping algorithm that uses the theoretical relationship between barcode sharing across genomic intervals and interval sizes. We analysed data from three different BLRS methods—10 × Genomics Chromium, MGI stLFR, and Universal Sequencing Technology TELL-Seq. The BLRS datasets we analysed include (1) simulated 10 × and stLFR wholegenome sequencing (WGS) data containing a heterozygous 4000 ATTCT RE in the ATXN10 gene; (2) publiclyavailable NA12878 Genome in a Bottle (GIAB) data from 10x, stLFR, and TELL-Seq BLRS platforms with a 1.1 kilobase (kb) expansion (relative to the reference) of a CCAT-repeat in chromosome 20; and (3) four 10 × WGS datasets with FXN GAA RE and four with FMR1 CGG RE (see supplementary information). We benchmarked our approaches against ExpansionHunter (EH)10,14, the most widely-used short-read STR analysis tool, on BLRS 1 Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada. 2Department of Medical Genetics, University of British Columbia, Vancouver, BC V5Z 4H4, Canada. 3Department of Medical and Molecular Genetics, King’s College London, Strand, London WC2R 2LS, UK. 4BC Children’s Hospital Research Institute, Vancouver, BC V5Z 4H4, Canada. 5These authors contributed equally: Readman Chiu and Indhu-Shree Rajan-Babu. *email: Scientific Reports | (2022) 12:9352 | https://doi.org/10.1038/s41598-022-13024-4 1 Vol.:(0123456789) www.nature.com/scientificreports/ and standard Illumina sequencing data (see supplementary information). To our knowledge, our study is the first to assess the utility of BLRS in detecting and genotyping STR expansions. Results and discussion EH has not been evaluated on BLRS before. We ran EH (v2.5.5) on the BLRS datasets with or without off-target sites (OTS)—STR loci in the reference genome where IRRs containing the same repeat motif may have been mismapped (see supplementary information). Although helpful in detecting and accurately genotyping some large REs in SRS d ata15, using OTS may result in overestimation of repeat lengths and generate ambiguous genotype calls when an individual has more than one “expanded” STR with a shared repeat motif (data not shown). To ascertain whether EH’s performance improves with barcode-retrieved IRR counts, we first identified barcodes from all linked reads that map to the target region and extracted all reads carrying the identified barcodes from the FASTQ files. We then screened these reads to detect IRRs composed of the target motif, and supplied the IRR counts, read length, and sequencing depth to the formula EH uses to impute STR s izes14 (see the Methods section; Fig. 1a). By matching the identities of IRRs collected using barcodes against the “ground truth” determined from the simulated ATXN10 data, we observed both high sensitivity (97% for 10x; 86% for stLFR) and specificity (100% for both 10 × and stLFR) in IRR extraction. All identified IRRs also originated from the haplotype that was simulated to contain the ATXN10 RE. The EH repeat length estimates obtained from applying these barcode-derived IRR counts were close to the ground truth (96% for 10x; 84% for stLFR). EH analysis with OTS and barcode-retrieved IRRs yielded higher IRR counts (Fig. 1b; left panel) and much better repeat length estimates (Fig. 1b; right panel) compared to EH analysis without OTS. Next, we genotyped a 1.1 kb CCAT STR (chr20:38194564-38194636, GRCh38) in the NA12878 BLRS data (Fig. 1c). This 76 base pair (bp) long locus in the reference genome was selected based on our analysis of the Nanopore long-read data of the same sample with Straglr16 and was independently confirmed to have a biallelic expansion (1050 and 1106 bp) from a high-quality haplotype-resolved assembly17. In both stLFR and TELL-Seq barcoded datasets, more IRRs were retrieved in comparison to EH analysis without OTS (Fig. 1c; left panel), (...truncated)


This is a preview of a remote PDF: https://www.nature.com/articles/s41598-022-13024-4.pdf
Article home page: https://www.nature.com/articles/s41598-022-13024-4

Chiu, Readman, Rajan-Babu, Indhu-Shree, Birol, Inanc, Friedman, Jan M.. Linked-read sequencing for detecting short tandem repeat expansions, Scientific Reports, DOI: 10.1038/s41598-022-13024-4