Turning Vice into Virtue: Using Batch-Effects to Detect Errors in Large Genomic Data Sets
GBE
Turning Vice into Virtue: Using Batch-Effects to Detect Errors
in Large Genomic Data Sets
Fabrizio Mafessoni1,*, Rashmi B. Prasad2, Leif Groop2,3, Ola Hansson2, and Kay Prüfer1,*
1
Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
2
Department of Clinical Sciences, Diabetes and Endocrinology, Lund University Diabetes Center, Malmö, Sweden
3
*Corresponding authors: E-mails: ; .
Accepted: September 8, 2018
Data deposition: This code for this project has been deposited at http://bioinf.eva.mpg.de/LDLD/, last accessed September 8, 2018.
Abstract
It is often unavoidable to combine data from different sequencing centers or sequencing platforms when compiling data sets
with a large number of individuals. However, the different data are likely to contain specific systematic errors that will appear
as SNPs. Here, we devise a method to detect systematic errors in combined data sets. To measure quality differences between
individual genomes, we study pairs of variants that reside on different chromosomes and co-occur in individuals. The
abundance of these pairs of variants in different genomes is then used to detect systematic errors due to batch effects.
Applying our method to the 1000 Genomes data set, we find that coding regions are enriched for errors, where 1% of the
higher frequency variants are predicted to be erroneous, whereas errors outside of coding regions are much rarer
(<0.001%). As expected, predicted errors are found less often than other variants in a data set that was generated with
a different sequencing technology, indicating that many of the candidates are indeed errors. However, predicted 1000
Genomes errors are also found in other large data sets; our observation is thus not specific to the 1000 Genomes data set. Our
results show that batch effects can be turned into a virtue by using the resulting variation in large scale data sets to detect
systematic errors.
Key words: sequencing errors, 1000 Genomes data set, Illumina, next-generation sequencing, exome sequencing.
Introduction
Next generation sequencing technologies allowed for the
generation of data sets that include genetic data of a large
number of individuals. To produce these data sets, sequencing data of different coverage, and from different platforms
or different batches of sequencing chemistry may need to be
combined. This can result in differences in the type and number of errors across samples (Wall et al. 2014; Wolpin et al.
2014; Schirmer et al. 2015; Torkamaneh et al. 2016; Kircher
et al. 2011).
Here, we introduce a method to identify individual
genomes with a higher error rate in large data sets and to
predict which variants are likely due to error. The method first
tests pairs of variants that reside on different chromosomes
for signals of linkage disequilibrium. Linkage between separate chromosomes is not expected by population genetics
theory for a randomly mating population, unless strong epistatic interactions are present. However, such signal can occur
if errors affect individual genomes differently, leading to cooccurring erroneous variants in the same individuals but on
different chromosomes (fig. 1). This first step is computationally expensive and we therefore limited the computation of
linkage to pairs of variants in a subset of the genome. In the
second step, we compare the contribution of individual
genomes to the total linkage signal to identify outlier individuals that carry more potentially erroneous variants. As a
last step, we use the differences in the number of linked
pairs between individuals to identify which variants are present primarily in those individuals that carry more predicted
errors (fig. 1). This last step can be applied to all variants
and not only those that have been tested for linkage,
resulting in a list of predicted erroneous variants for the
ß The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse,
distribution, and reproduction in any medium, provided the original work is properly cited.
Genome Biol. Evol. 10(10):2697–2708. doi:10.1093/gbe/evy199 Advance Access publication September 10, 2018
2697
Finnish Institute for Molecular Medicine (FIMM), Helsinki University, Finland
GBE
Mafessoni et al.
in Finland (Fuchsberger et al. 2016). We excluded all offspring
and related individuals.
Data from the Genome of the Netherlands were filtered
and annotated analogously to the 1000 genomes. All analyses shown refer to variants with a 5% MAF cutoff.
(a)
Outline of Pipeline
(b)
Step 1: Linkage Disequilibrium
FIG. 1.—Outline of the method. (a) Sequencing data generated from
samples with different sequencing quality or processing might introduce
different errors (black dots). Since these errors will be present in samples
coming from the same platform, they will give a signal of linkage between
different chromosomes (dashed lines). (b) The contribution to the linkage
signal can be computed for each sample (dashed lines), and used to identify samples coming from the same batch and with similar error profiles, as
well as the errors. See also supplementary figure 1, Supplementary
Material online.
When the phase is unknown, as for two physically unlinked
loci A and B with possible alleles A-a and B-b, respectively, a
composite genotypic linkage disequilibrium can be calculated,
by relying on a maximum likelihood estimate of the amount
of AB-gametes that are present in samples. Following Weir
(Weir 1996), we can arrange the counts of the nine possible
observed genotypes for the two loci in a matrix:
complete data set. Removing these errors, we repeat the
procedure starting from the second step, until no significant differences in the burden of predicted errors is observed between individuals. No knowledge of differences
in sequencing technologies or other factors is required by
this approach.
A/A
A/a
a/a
Materials and Methods
Data Handling
We downloaded the 1000 genomes phase 3 data set (version
June 25, 2014). We used only one representative individual
for each set of related individuals, using the 1000 genomes
annotation. Only populations with at least 95 unrelated individuals were analyzed further, retaining 12 populations and
1,117 individuals (supplementary table 1, Supplementary
Material online). Variants were classified according to frequency using bcftools (common variants: >5% frequency in
at least one population, rare variants: 1% < frequency in at
least one population, but 5% in all) (Li 2011). We performed all analyses on both common and rare variants, or
only on common. Variants were annotated as coding when
they fell within 200 bp of the coding exons of the UCSC
known gene annotation (Rosenbloom et al. 2015), and as
interg (...truncated)