The Doppelgänger Effect: Hidden Duplicates in Databases of Transcriptome Profiles

JNCI: Journal of the National Cancer Institute, Nov 2016

Whole-genome analysis of cancer specimens is commonplace, and investigators frequently share or re-use specimens in later studies. Duplicate expression profiles in public databases will impact re-analysis if left undetected, a so-called “doppelgänger” effect. We propose a method that should be routine practice to accurately match duplicate cancer transcriptomes when nucleotide-level sequence data are unavailable, even for samples profiled by different microarray technologies or by both microarray and RNA sequencing. We demonstrate the effectiveness of the method in databases containing dozens of datasets and thousands of ovarian, breast, bladder, and colorectal cancer microarray profiles and of matching microarray and RNA sequencing expression profiles from The Cancer Genome Atlas (TCGA). We identified probable duplicates among more than 50% of studies, originating in different continents, using different technologies, published years apart, and even within the TCGA itself. Finally, we provide the doppelgangR Bioconductor package for screening transcriptome databases for duplicates. Given the potential for unrecognized duplication to falsely inflate prediction accuracy and confidence in differential expression, doppelgänger-checking should be a part of standard procedure for combining multiple genomic datasets.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://academic.oup.com/jnci/article-pdf/108/11/djw146/17314782/djw146.pdf

The Doppelgänger Effect: Hidden Duplicates in Databases of Transcriptome Profiles

JNCI J Natl Cancer Inst ( The Doppelg a€nger Effect: Hidden Duplicates in Databases of Transcriptome Profiles Levi Waldron 0 1 2 Markus Riester 0 1 2 Marcel Ramos 0 1 2 Giovanni Parmigiani 0 1 2 Michael Birrer 0 1 2 0 The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions , please 1 Boston , MA 02114 , USA 2 Affiliations of authors: City University of New York School of Public Health , New York, NY (LW , MRa); Novartis Institutes for BioMedical Research , Cambridge, MA (MRi ); Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute/Harvard Medical School , Boston, MA (GP); Center for Cancer Research, Massachusetts General Hospital , Boston, MA, MB , USA Whole-genome analysis of cancer specimens is commonplace, and investigators frequently share or re-use specimens in later studies. Duplicate expression profiles in public databases will impact re-analysis if left undetected, a so-called “doppelga€nger” effect. We propose a method that should be routine practice to accurately match duplicate cancer transcriptomes when nucleotide-level sequence data are unavailable, even for samples profiled by different microarray technologies or by both microarray and RNA sequencing. We demonstrate the effectiveness of the method in databases containing dozens of datasets and thousands of ovarian, breast, bladder, and colorectal cancer microarray profiles and of matching microarray and RNA sequencing expression profiles from The Cancer Genome Atlas (TCGA). We identified probable duplicates among more than 50% of studies, originating in different continents, using different technologies, published years apart, and even within the TCGA itself. Finally, we provide the doppelgangR Bioconductor package for screening transcriptome databases for duplicates. Given the potential for unrecognized duplication to falsely inflate prediction accuracy and confidence in differential expression, doppelga€nger-checking should be a part of standard procedure for combining multiple genomic datasets. - Sufficient germ-line sequence markers provide a “fingerprint” that can be matched uniquely in a database of genotypes ( 1 ). Publicly available human genomic data is therefore normally summarized at a level that cannot be identified uniquely to protect patient privacy. Cancer transcriptomes undergo alterations that are highly distinctive but much more difficult to identify uniquely in summarized form. Re-use of tissue specimens is widespread in clinical genomic studies, creating a “doppelga€nger effect” in publicly available datasets: hidden duplicates that, if left undetected, can inflate statistical significance or apparent accuracy of genomic models when combining data from different studies (Figure 1A). The proposed method relies on exhaustive comparisons of dataset pairs and sample pairs to empirically estimate the distribution of pairwise transcriptome correlations between biological replicates within a dataset or between two datasets where potentially different profiling technologies were used. The key aspects to identifying duplicates in a pair of datasets are 1) using transcript identifiers available in both datasets, 2) batch correction ( 2 ), 3) calculating Pearson’s Correlation Coefficient (PCC) between every sample in one dataset against every sample in the other dataset, and 4) duplicate-oriented outlier detection. The background distribution of pairwise PCC values varies depending on the tissue assayed and the technologies used, and must be estimated for every dataset pair. Doppelga€ngers can be identified as outliers at the high end of the distribution of batch-corrected correlations. The detailed methodology of package development and validation can be found in the Supplementary Material (available online). We studied databases of ovarian, breast, bladder, and colorectal cancers and of cell lines and assessed their accuracy against a “gold standard” of duplicated samples generated through further manual inspection of expression data, clinical annotations, and sample identifiers (Supplementary Table 1, available online). Confirmed doppelga€ngers were identified in more than half of all studies (Table 1). For example, among the 1467 breast cancer gene expression profiles, doppelgangR identifies 59 samples present in both the Sotiriou et al. ( 3 ) and Miller et al. ( 4 ) studies (Figure 1B; additional samples are duplicated by the TRANSBIG dataset, see Table 1). Although these studies were published by Belgian and Singaporean groups, respectively, careful reading of the papers reveals that their datasets shared a cohort of samples originating from Uppsala County, Sweden. Such international collaborations are beneficial to the cancer research community, but pose challenges to investigators developing independent validations and meta-analyses. In the ovarian cancer database, which we have inspected in great detail ( 5 ), we identified 17% of records as non (...truncated)


This is a preview of a remote PDF: https://academic.oup.com/jnci/article-pdf/108/11/djw146/17314782/djw146.pdf

Waldron, Levi, Riester, Markus, Ramos, Marcel, Parmigiani, Giovanni, Birrer, Michael. The Doppelgänger Effect: Hidden Duplicates in Databases of Transcriptome Profiles, JNCI: Journal of the National Cancer Institute, 2016, Volume 108, Issue 11, DOI: 10.1093/jnci/djw146