A Review of Computational Methods for Clustering Genes with Similar Biological Functions
processes
Review
A Review of Computational Methods for Clustering
Genes with Similar Biological Functions
Hui Wen Nies 1 , Zalmiyah Zakaria 1 , Mohd Saberi Mohamad 2, *, Weng Howe Chan 1 ,
Nazar Zaki 3 , Richard O. Sinnott 4 , Suhaimi Napis 5 , Pablo Chamoso 6 , Sigeru Omatu 7 and
Juan Manuel Corchado 6
1
2
3
4
5
6
7
*
School of Computing, Faculty of Engineering, Universiti Teknologi Malaysia, Skudai 81310, Johor, Malaysia
Institute for Artificial Intelligence and Big Data, Universiti Malaysia Kelantan,
Kota Bharu 16100, Kelantan, Malaysia
Department of Computer Science and Software Engineering, College of Information Technology,
United Arab Emirate University, Al Ain 15551, UAE
School of Computing and Information Systems, University of Melbourne, Parkville 3010, Victoria, Australia
Faculty of Biotechnology and Biomolecular Sciences, Universiti Putra Malaysia,
Serdang 43400, Selangor, Malaysia
BISITE Research Group, Digital Innovation Hub, University of Salamanca, Edificio I+D+i, C/ Espejos s/n,
37007 Salamanca, Spain
Division of Data-Driven Smart Systems Design, Digital Monozukuri (Manufacturing) Education and
Research Center, Hiroshima University, #210, 3-10-31 Kagamiyama, Higashi-Hiroshima 739-0046,
Hiroshima Prefecture, Japan
Correspondence:
Received: 8 July 2019; Accepted: 16 August 2019; Published: 21 August 2019
Abstract: Clustering techniques can group genes based on similarity in biological functions. However,
the drawback of using clustering techniques is the inability to identify an optimal number of
potential clusters beforehand. Several existing optimization techniques can address the issue. Besides,
clustering validation can predict the possible number of potential clusters and hence increase the
chances of identifying biologically informative genes. This paper reviews and provides examples of
existing methods for clustering genes, optimization of the objective function, and clustering validation.
Clustering techniques can be categorized into partitioning, hierarchical, grid-based, and density-based
techniques. We also highlight the advantages and the disadvantages of each category. To optimize the
objective function, here we introduce the swarm intelligence technique and compare the performances
of other methods. Moreover, we discuss the differences of measurements between internal and
external criteria to validate a cluster quality. We also investigate the performance of several clustering
techniques by applying them on a leukemia dataset. The results show that grid-based clustering
techniques provide better classification accuracy; however, partitioning clustering techniques are
superior in identifying prognostic markers of leukemia. Therefore, this review suggests combining
clustering techniques such as CLIQUE and k-means to yield high-quality gene clusters.
Keywords: gene clustering; swarm intelligence; biological functions detection; informative genes
1. Introduction
Analysis of gene expression levels is essential in studying and detecting genes functions. According
to Chandra and Tripathi [1], genes that have similar gene expression levels are likely to involve similar
biological functions. The authors showed that the clustering process was quite useful to identify
co-expressed genes in a group of genes and, in addition, to detect unique genes in different groups.
Processes 2019, 7, 550; doi:10.3390/pr7090550
www.mdpi.com/journal/processes
Processes 2019,
Processes
2019, 7,
7, 550
x FOR PEER REVIEW
of 20
18
22 of
groups. Therefore, clustering can be quite helpful to extract valuable knowledge from a large amount
Therefore, clustering can be quite helpful to extract valuable knowledge from a large amount of
of biological data [2], which could lead to prevention, prognosis, and treatment in biomedical
biological data [2], which could lead to prevention, prognosis, and treatment in biomedical research.
research.
Cai et al. [3] developed a random walk-based technique to cluster similar genes. The authors show
Cai et al. [3] developed a random walk-based technique to cluster similar genes. The authors
that the proposed method was useful in strengthening the interaction between genes by considering
show that the proposed method was useful in strengthening the interaction between genes by
the types of interactions that exist in the same group of genes. Many previous random walk-based
considering the types of interactions that exist in the same group of genes. Many previous random
methods managed to extract local information from a large graph without knowledge of the whole
walk-based methods managed to extract local information from a large graph without knowledge of
graph data [4]. In a random walk-based method, a gene is important if it interacts with many other
the whole graph data [4]. In a random walk-based method, a gene is important if it interacts with
genes [5–8]. As illustrated in Figure 1, gene 1 has a higher degree than gene 2 (two outgoing links)
many other genes [5–8]. As illustrated in Figure 1, gene 1 has a higher degree than gene 2 (two
compared to one outgoing link from gene 3 to gene 4. In this case, gene 1 is the most important gene
outgoing links) compared to one outgoing link from gene 3 to gene 4. In this case, gene 1 is the most
among the four genes shown in the hypothetical gene network.
important gene among the four genes shown in the hypothetical gene network.
A hypothetical
hypothetical gene
gene network
network to illustrate the importance of genes in a random walk.
Figure 1. A
Several previous
previous studies
studies have
have noted
noted the
the importance
importance of
of clustering
clustering to
to identify
identify co-expressed
co-expressed genes
genes
Several
in
a
cluster
and
inactive
genes
in
another
cluster
[1,9].
Clustering
can
also
discover
the
fundamental
in a cluster and inactive genes in another cluster [1,9]. Clustering can also discover the fundamental
hidden structure
structure of
of biomedical
biomedical data,
data, which
which can
can be
be used
used for
for diagnosis
diagnosis and
and treatments
treatments [9].
[9]. In
In addition,
addition,
hidden
clustering
is
extremely
vital
for
identifying
cancer
subtyping
and
the
detection
of
the
tumor.
clustering is extremely vital for identifying cancer subtyping and the detection of the tumor.
Researchers typically
typically focus
on clustering
clustering by
by assuming
assuming the
the number
number of
of clusters
clusters beforehand,
beforehand, which
which
Researchers
focus on
can be
be seen
seen in
in [10,11].
[10,11]. This
This problem
problem can
to the
inability of
of the
the clustering
techniques to
to obtain
obtain an
an
can
can lead
lead to
the inability
clustering techniques
optimal
number
of
centroids
and
hence
results
in
poor
quality
of
clusters
[11,12].
In
previous
studies,
optimal number of centroids and hence results in poor quality of clusters [11,12]. In previous studies,
several proposed
proposed approaches
appr (...truncated)