Turning Vice into Virtue: Using Batch-Effects to Detect Errors in Large Genomic Data Sets (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/gbe/article-pdf/10/10/2697/26035116/evy199.pdf

Turning Vice into Virtue: Using Batch-Effects to Detect Errors in Large Genomic Data Sets

GBE Turning Vice into Virtue: Using Batch-Effects to Detect Errors in Large Genomic Data Sets Fabrizio Mafessoni1,*, Rashmi B. Prasad2, Leif Groop2,3, Ola Hansson2, and Kay Prüfer1,* 1 Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany 2 Department of Clinical Sciences, Diabetes and Endocrinology, Lund University Diabetes Center, Malmö, Sweden 3 *Corresponding authors: E-mails: ; . Accepted: September 8, 2018 Data deposition: This code for this project has been deposited at http://bioinf.eva.mpg.de/LDLD/, last accessed September 8, 2018. Abstract It is often unavoidable to combine data from different sequencing centers or sequencing platforms when compiling data sets with a large number of individuals. However, the different data are likely to contain specific systematic errors that will appear as SNPs. Here, we devise a method to detect systematic errors in combined data sets. To measure quality differences between individual genomes, we study pairs of variants that reside on different chromosomes and co-occur in individuals. The abundance of these pairs of variants in different genomes is then used to detect systematic errors due to batch effects. Applying our method to the 1000 Genomes data set, we find that coding regions are enriched for errors, where 1% of the higher frequency variants are predicted to be erroneous, whereas errors outside of coding regions are much rarer (<0.001%). As expected, predicted errors are found less often than other variants in a data set that was generated with a different sequencing technology, indicating that many of the candidates are indeed errors. However, predicted 1000 Genomes errors are also found in other large data sets; our observation is thus not specific to the 1000 Genomes data set. Our results show that batch effects can be turned into a virtue by using the resulting variation in large scale data sets to detect systematic errors. Key words: sequencing errors, 1000 Genomes data set, Illumina, next-generation sequencing, exome sequencing. Introduction Next generation sequencing technologies allowed for the generation of data sets that include genetic data of a large number of individuals. To produce these data sets, sequencing data of different coverage, and from different platforms or different batches of sequencing chemistry may need to be combined. This can result in differences in the type and number of errors across samples (Wall et al. 2014; Wolpin et al. 2014; Schirmer et al. 2015; Torkamaneh et al. 2016; Kircher et al. 2011). Here, we introduce a method to identify individual genomes with a higher error rate in large data sets and to predict which variants are likely due to error. The method first tests pairs of variants that reside on different chromosomes for signals of linkage disequilibrium. Linkage between separate chromosomes is not expected by population genetics theory for a randomly mating population, unless strong epistatic interactions are present. However, such signal can occur if errors affect individual genomes differently, leading to cooccurring erroneous variants in the same individuals but on different chromosomes (fig. 1). This first step is computationally expensive and we therefore limited the computation of linkage to pairs of variants in a subset of the genome. In the second step, we compare the contribution of individual genomes to the total linkage signal to identify outlier individuals that carry more potentially erroneous variants. As a last step, we use the differences in the number of linked pairs between individuals to identify which variants are present primarily in those individuals that carry more predicted errors (fig. 1). This last step can be applied to all variants and not only those that have been tested for linkage, resulting in a list of predicted erroneous variants for the ß The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Genome Biol. Evol. 10(10):2697–2708. doi:10.1093/gbe/evy199 Advance Access publication September 10, 2018 2697 Finnish Institute for Molecular Medicine (FIMM), Helsinki University, Finland GBE Mafessoni et al. in Finland (Fuchsberger et al. 2016). We excluded all offspring and related individuals. Data from the Genome of the Netherlands were filtered and annotated analogously to the 1000 genomes. All analyses shown refer to variants with a 5% MAF cutoff. (a) Outline of Pipeline (b) Step 1: Linkage Disequilibrium FIG. 1.—Outline of the method. (a) Sequencing data generated from samples with different sequencing quality or processing might introduce different errors (black dots). Since these errors will be present in samples coming from the same platform, they will give a signal of linkage between different chromosomes (dashed lines). (b) The contribution to the linkage signal can be computed for each sample (dashed lines), and used to identify samples coming from the same batch and with similar error profiles, as well as the errors. See also supplementary figure 1, Supplementary Material online. When the phase is unknown, as for two physically unlinked loci A and B with possible alleles A-a and B-b, respectively, a composite genotypic linkage disequilibrium can be calculated, by relying on a maximum likelihood estimate of the amount of AB-gametes that are present in samples. Following Weir (Weir 1996), we can arrange the counts of the nine possible observed genotypes for the two loci in a matrix: complete data set. Removing these errors, we repeat the procedure starting from the second step, until no significant differences in the burden of predicted errors is observed between individuals. No knowledge of differences in sequencing technologies or other factors is required by this approach. A/A A/a a/a Materials and Methods Data Handling We downloaded the 1000 genomes phase 3 data set (version June 25, 2014). We used only one representative individual for each set of related individuals, using the 1000 genomes annotation. Only populations with at least 95 unrelated individuals were analyzed further, retaining 12 populations and 1,117 individuals (supplementary table 1, Supplementary Material online). Variants were classified according to frequency using bcftools (common variants: >5% frequency in at least one population, rare variants: 1% < frequency in at least one population, but 5% in all) (Li 2011). We performed all analyses on both common and rare variants, or only on common. Variants were annotated as coding when they fell within 200 bp of the coding exons of the UCSC known gene annotation (Rosenbloom et al. 2015), and as interg (...truncated)