A Review of Computational Methods for Clustering Genes with Similar Biological Functions

Aug 2019

Clustering techniques can group genes based on similarity in biological functions. However, the drawback of using clustering techniques is the inability to identify an optimal number of potential clusters beforehand. Several existing optimization techniques can address the issue. Besides, clustering validation can predict the possible number of potential clusters and hence increase the chances of identifying biologically informative genes. This paper reviews and provides examples of existing methods for clustering genes, optimization of the objective function, and clustering validation. Clustering techniques can be categorized into partitioning, hierarchical, grid-based, and density-based techniques. We also highlight the advantages and the disadvantages of each category. To optimize the objective function, here we introduce the swarm intelligence technique and compare the performances of other methods. Moreover, we discuss the differences of measurements between internal and external criteria to validate a cluster quality. We also investigate the performance of several clustering techniques by applying them on a leukemia dataset. The results show that grid-based clustering techniques provide better classification accuracy; however, partitioning clustering techniques are superior in identifying prognostic markers of leukemia. Therefore, this review suggests combining clustering techniques such as CLIQUE and k-means to yield high-quality gene clusters.

Article PDF cannot be displayed. You can download it here:

https://www.mdpi.com/2227-9717/7/9/550/pdf

A Review of Computational Methods for Clustering Genes with Similar Biological Functions

processes Review A Review of Computational Methods for Clustering Genes with Similar Biological Functions Hui Wen Nies 1 , Zalmiyah Zakaria 1 , Mohd Saberi Mohamad 2, *, Weng Howe Chan 1 , Nazar Zaki 3 , Richard O. Sinnott 4 , Suhaimi Napis 5 , Pablo Chamoso 6 , Sigeru Omatu 7 and Juan Manuel Corchado 6 1 2 3 4 5 6 7 * School of Computing, Faculty of Engineering, Universiti Teknologi Malaysia, Skudai 81310, Johor, Malaysia Institute for Artificial Intelligence and Big Data, Universiti Malaysia Kelantan, Kota Bharu 16100, Kelantan, Malaysia Department of Computer Science and Software Engineering, College of Information Technology, United Arab Emirate University, Al Ain 15551, UAE School of Computing and Information Systems, University of Melbourne, Parkville 3010, Victoria, Australia Faculty of Biotechnology and Biomolecular Sciences, Universiti Putra Malaysia, Serdang 43400, Selangor, Malaysia BISITE Research Group, Digital Innovation Hub, University of Salamanca, Edificio I+D+i, C/ Espejos s/n, 37007 Salamanca, Spain Division of Data-Driven Smart Systems Design, Digital Monozukuri (Manufacturing) Education and Research Center, Hiroshima University, #210, 3-10-31 Kagamiyama, Higashi-Hiroshima 739-0046, Hiroshima Prefecture, Japan Correspondence: Received: 8 July 2019; Accepted: 16 August 2019; Published: 21 August 2019   Abstract: Clustering techniques can group genes based on similarity in biological functions. However, the drawback of using clustering techniques is the inability to identify an optimal number of potential clusters beforehand. Several existing optimization techniques can address the issue. Besides, clustering validation can predict the possible number of potential clusters and hence increase the chances of identifying biologically informative genes. This paper reviews and provides examples of existing methods for clustering genes, optimization of the objective function, and clustering validation. Clustering techniques can be categorized into partitioning, hierarchical, grid-based, and density-based techniques. We also highlight the advantages and the disadvantages of each category. To optimize the objective function, here we introduce the swarm intelligence technique and compare the performances of other methods. Moreover, we discuss the differences of measurements between internal and external criteria to validate a cluster quality. We also investigate the performance of several clustering techniques by applying them on a leukemia dataset. The results show that grid-based clustering techniques provide better classification accuracy; however, partitioning clustering techniques are superior in identifying prognostic markers of leukemia. Therefore, this review suggests combining clustering techniques such as CLIQUE and k-means to yield high-quality gene clusters. Keywords: gene clustering; swarm intelligence; biological functions detection; informative genes 1. Introduction Analysis of gene expression levels is essential in studying and detecting genes functions. According to Chandra and Tripathi [1], genes that have similar gene expression levels are likely to involve similar biological functions. The authors showed that the clustering process was quite useful to identify co-expressed genes in a group of genes and, in addition, to detect unique genes in different groups. Processes 2019, 7, 550; doi:10.3390/pr7090550 www.mdpi.com/journal/processes Processes 2019, Processes 2019, 7, 7, 550 x FOR PEER REVIEW of 20 18 22 of groups. Therefore, clustering can be quite helpful to extract valuable knowledge from a large amount Therefore, clustering can be quite helpful to extract valuable knowledge from a large amount of of biological data [2], which could lead to prevention, prognosis, and treatment in biomedical biological data [2], which could lead to prevention, prognosis, and treatment in biomedical research. research. Cai et al. [3] developed a random walk-based technique to cluster similar genes. The authors show Cai et al. [3] developed a random walk-based technique to cluster similar genes. The authors that the proposed method was useful in strengthening the interaction between genes by considering show that the proposed method was useful in strengthening the interaction between genes by the types of interactions that exist in the same group of genes. Many previous random walk-based considering the types of interactions that exist in the same group of genes. Many previous random methods managed to extract local information from a large graph without knowledge of the whole walk-based methods managed to extract local information from a large graph without knowledge of graph data [4]. In a random walk-based method, a gene is important if it interacts with many other the whole graph data [4]. In a random walk-based method, a gene is important if it interacts with genes [5–8]. As illustrated in Figure 1, gene 1 has a higher degree than gene 2 (two outgoing links) many other genes [5–8]. As illustrated in Figure 1, gene 1 has a higher degree than gene 2 (two compared to one outgoing link from gene 3 to gene 4. In this case, gene 1 is the most important gene outgoing links) compared to one outgoing link from gene 3 to gene 4. In this case, gene 1 is the most among the four genes shown in the hypothetical gene network. important gene among the four genes shown in the hypothetical gene network. A hypothetical hypothetical gene gene network network to illustrate the importance of genes in a random walk. Figure 1. A Several previous previous studies studies have have noted noted the the importance importance of of clustering clustering to to identify identify co-expressed co-expressed genes genes Several in a cluster and inactive genes in another cluster [1,9]. Clustering can also discover the fundamental in a cluster and inactive genes in another cluster [1,9]. Clustering can also discover the fundamental hidden structure structure of of biomedical biomedical data, data, which which can can be be used used for for diagnosis diagnosis and and treatments treatments [9]. [9]. In In addition, addition, hidden clustering is extremely vital for identifying cancer subtyping and the detection of the tumor. clustering is extremely vital for identifying cancer subtyping and the detection of the tumor. Researchers typically typically focus on clustering clustering by by assuming assuming the the number number of of clusters clusters beforehand, beforehand, which which Researchers focus on can be be seen seen in in [10,11]. [10,11]. This This problem problem can to the inability of of the the clustering techniques to to obtain obtain an an can can lead lead to the inability clustering techniques optimal number of centroids and hence results in poor quality of clusters [11,12]. In previous studies, optimal number of centroids and hence results in poor quality of clusters [11,12]. In previous studies, several proposed proposed approaches appr (...truncated)


This is a preview of a remote PDF: https://www.mdpi.com/2227-9717/7/9/550/pdf
Article home page: https://doaj.org/article/2c241bb5030e48238ba40a9a36cfbe1e

Hui Wen Nies, Zalmiyah Zakaria, Mohd Saberi Mohamad, Weng Howe Chan, Nazar Zaki, Richard O. Sinnott, Suhaimi Napis, Pablo Chamoso, Sigeru Omatu, Juan Manuel Corchado. A Review of Computational Methods for Clustering Genes with Similar Biological Functions, 2019, pp. 550, Volume 9, DOI: 10.3390/pr7090550