Statistically controlled identification of differentially expressed genes in one-to-one cell line comparisons of the CMAP database for drug repositioning

Journal of Translational Medicine, Sep 2017

The Connectivity Map (CMAP) database, an important public data source for drug repositioning, archives gene expression profiles from cancer cell lines treated with and without bioactive small molecules. However, there are only one or two technical replicates for each cell line under one treatment condition. For such small-scale data, current fold-changes-based methods lack statistical control in identifying differentially expressed genes (DEGs) in treated cells. Especially, one-to-one comparison may result in too many drug-irrelevant DEGs due to random experimental factors. To tackle this problem, CMAP adopts a pattern-matching strategy to build “connection” between disease signatures and gene expression changes associated with drug treatments. However, many drug-irrelevant genes may blur the “connection” if all the genes are used instead of pre-selected DEGs induced by drug treatments. We applied OneComp, a customized version of RankComp, to identify DEGs in such small-scale cell line datasets. For a cell line, a list of gene pairs with stable relative expression orderings (REOs) were identified in a large collection of control cell samples measured in different experiments and they formed the background stable REOs. When applying OneComp to a small-scale cell line dataset, the background stable REOs were customized by filtering out the gene pairs with reversal REOs in the control samples of the analyzed dataset. In simulated data, the consistency scores of overlapping genes between DEGs identified by OneComp and SAM were all higher than 99%, while the consistency score of the DEGs solely identified by OneComp was 96.85% according to the observed expression difference method. The usefulness of OneComp was exemplified in drug repositioning by identifying phenformin and metformin related genes using small-scale cell line datasets which helped to support them as a potential anti-tumor drug for non-small-cell lung carcinoma, while the pattern-matching strategy adopted by CMAP missed the two connections. The implementation of OneComp is available at https://github.com/pathint/reoa . OneComp performed well in both the simulated and real data. It is useful in drug repositioning studies by helping to find hidden “connections” between drugs and diseases.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://translational-medicine.biomedcentral.com/track/pdf/10.1186/s12967-017-1302-9?site=translational-medicine.biomedcentral.com

Statistically controlled identification of differentially expressed genes in one-to-one cell line comparisons of the CMAP database for drug repositioning

He et al. J Transl Med Statistically controlled identification of differentially expressed genes in one-to-one cell line comparisons of the CMAP database for drug repositioning Jun He 0 Haidan Yan 0 Hao Cai 0 Xiangyu Li 0 Qingzhou Guan 0 Weicheng Zheng 0 Rou Chen 0 Huaping Liu 0 Kai Song 2 Zheng Guo 0 1 Xianlong Wang 0 1 0 Department of Bioinformatics, Key Laboratory of Ministry of Education for Gastrointestinal Cancer, Fujian Medical University , Fuzhou 350122 , China 1 Fujian Key Laboratory of Tumor Microbiology, Fujian Medical University , Fuzhou 350122 , China 2 Department of Systems Biology, College of Bioinformatics Science and Technology, Harbin Medical University , Harbin 150086 , China Background: The Connectivity Map (CMAP) database, an important public data source for drug repositioning, archives gene expression profiles from cancer cell lines treated with and without bioactive small molecules. However, there are only one or two technical replicates for each cell line under one treatment condition. For such small-scale data, current fold-changes-based methods lack statistical control in identifying differentially expressed genes (DEGs) in treated cells. Especially, one-to-one comparison may result in too many drug-irrelevant DEGs due to random experimental factors. To tackle this problem, CMAP adopts a pattern-matching strategy to build “connection” between disease signatures and gene expression changes associated with drug treatments. However, many drug-irrelevant genes may blur the “connection” if all the genes are used instead of pre-selected DEGs induced by drug treatments. Methods: We applied OneComp, a customized version of RankComp, to identify DEGs in such small-scale cell line datasets. For a cell line, a list of gene pairs with stable relative expression orderings (REOs) were identified in a large collection of control cell samples measured in different experiments and they formed the background stable REOs. When applying OneComp to a small-scale cell line dataset, the background stable REOs were customized by filtering out the gene pairs with reversal REOs in the control samples of the analyzed dataset. Results: In simulated data, the consistency scores of overlapping genes between DEGs identified by OneComp and SAM were all higher than 99%, while the consistency score of the DEGs solely identified by OneComp was 96.85% according to the observed expression difference method. The usefulness of OneComp was exemplified in drug repositioning by identifying phenformin and metformin related genes using small-scale cell line datasets which helped to support them as a potential anti-tumor drug for non-small-cell lung carcinoma, while the pattern-matching strategy adopted by CMAP missed the two connections. The implementation of OneComp is available at https://github.com/ pathint/reoa. Conclusions: OneComp performed well in both the simulated and real data. It is useful in drug repositioning studies by helping to find hidden “connections” between drugs and diseases. The Connectivity Map; Differentially expressed genes; Drug repositioning; Phenformin; Metformin Background There are many strategies for drug repositioning based on different data such as chemical structural similarities, genetic variation (SNPs-disease correlations, SNPs-drug response) and gene expression profiling [ 1 ]. The strategy based on gene expression profiling has the advantage that it does not require a great amount of a priori knowledge on the diseases or drugs [ 2–4 ]. An ideal database of gene expression profiles for drug repositioning study should include gene expression profiles of many cell lines representing a diverse range of diseases before and after drug treatments usually for thousands of drugs or candidate drugs. Thus, to create such a data source is a huge project and very costive. As far as we know, currently there are only two such large databases [ 2 ]. One is the LINCS database, which, however, has only profiled 978 genes for 25,581 drugs, and the other is the CMAP database which has profiled more than 12,000 genes on mainly three types of cancer cell lines, MCF7, PC3 and HL60, treated with 1309 bioactive small molecules at various concentrations [ 5 ]. As an important public data source, the Connectivity Map (CMAP) [ 3, 6 ] has been widely applied to study drug repositioning [ 2, 7, 8 ] and drug action mechanisms [ 9, 10 ]. It has been cited over 940 times in the past 10 years, as shown in the PubMed Central database. However, in the CMAP project, usually only one drug-treated sample was measured for one drug concentration against several control samples. For such small cell line datasets, traditional statistical methods such as the significance analysis of microarrays (SAM) [ 11, 12 ] and Student’s t test [13] lack power in identifying differentially expressed genes (DEGs) in the treated cells, while the commonly used fold-change (FC) method with an arbitrary cut-off value [ 14–1 (...truncated)


This is a preview of a remote PDF: https://translational-medicine.biomedcentral.com/track/pdf/10.1186/s12967-017-1302-9?site=translational-medicine.biomedcentral.com

Jun He, Haidan Yan, Hao Cai, Xiangyu Li, Qingzhou Guan, Weicheng Zheng, Rou Chen, Huaping Liu, Kai Song, Zheng Guo, Xianlong Wang. Statistically controlled identification of differentially expressed genes in one-to-one cell line comparisons of the CMAP database for drug repositioning, Journal of Translational Medicine, 2017, pp. 198, DOI: 10.1186/s12967-017-1302-9