Clustering Algorithms: Their Application to Gene Expression Data.
Clustering Algorithms: Their Application to Gene
Expression Data
Jelili Oyelade1,2,*, Itunuoluwa Isewon1,2,*, Funke Oladipupo1, Olufemi Aromolaran1,
Efosa Uwoghiren1, Faridah Ameh1, Moses Achas3 and Ezekiel Adebiyi1,2
1
Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria. 2Covenant University Bioinformatics
Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria. 3Department of Computer Science and Information Technology,
Bells University of Technology, Ota, Ogun State, Nigeria. *JO and II are joint first authors.
Abstract: Gene expression data hide vital information required to understand the biological process that takes place in a particular organism in
relation to its environment. Deciphering the hidden patterns in gene expression data proffers a prodigious preference to strengthen the understanding of
functional genomics. The complexity of biological networks and the volume of genes present increase the challenges of comprehending and interpretation
of the resulting mass of data, which consists of millions of measurements; these data also inhibit vagueness, imprecision, and noise. Therefore, the use of
clustering techniques is a first step toward addressing these challenges, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. The clustering of gene expression data has been proven to be useful in making known the natural structure
inherent in gene expression data, understanding gene functions, cellular processes, and subtypes of cells, mining useful information from noisy data, and
understanding gene regulation. The other benefit of clustering gene expression data is the identification of homology, which is very important in vaccine
design. This review examines the various clustering algorithms applicable to the gene expression data in order to discover and provide useful knowledge of
the appropriate clustering technique that will guarantee stability and high degree of accuracy in its analysis procedure.
Keywords: clustering algorithm, homology, biological process, gene expression data, bioinformatics
Citation: Oyelade et al. Clustering Algorithms: Their Application to Gene Expression
Data. Bioinformatics and Biology Insights 2016:10 237–253 doi: 10.4137/BBI.S38316.
TYPE: Review
Received: May 18, 2016. ReSubmitted: September 05, 2016. Accepted for
publication: September 09, 2016.
Academic editor: J. T. Efird, Associate Editor
Peer Review: Seven peer reviewers contributed to the peer review report. Reviewers’
reports totaled 1359 words, excluding any confidential comments to the academic editor.
Funding: Authors disclose no external funding sources.
Competing Interests: Authors disclose no potential conflicts of interest.
Correspondence:
Introduction
Clustering, which is an unsupervised learning technique, has
been widely applied in diverse field of studies such as machine
learning, data mining, pattern recognition, image analysis,
and bioinformatics. However, Pirim et al.1 stated that no clustering algorithm exists with the best performance for all clustering problems. This fact makes it necessary to intelligently
apply algorithms specialized for the task at hand. Our quest
for useful information from noisy gene expression data to gain
insight and create new hypothesis is not insignificant. The first
step is creating clusters of gene expression data that are similar in expression and are dissimilar to gene expression data
in other clusters. Similarities in data are commonly measured
with distance; two or more genes are objects of a particular
cluster if they are closely related based on a given distance.
Though several clustering approaches are available, difficulty
still arises in finding a suitable clustering technique for given
experimental datasets.
Clustering can be accomplished based on genes, samples,
and/or time variable, depending on the type of dataset.2 The
significance of clustering both genes and samples cannot be
ignored in gene expression data; genes form a cluster that
Copyright: © the authors, publisher and licensee Libertas Academica Limited. This is
an open-access article distributed under the terms of the Creative Commons CC-BY-NC
3.0 License.
aper subject to independent expert blind peer review. All editorial decisions made
P
by independent academic editor. Upon submission manuscript was subject to antiplagiarism scanning. Prior to publication all authors have given signed confirmation of
agreement to article publication and compliance with all applicable ethical and legal
requirements, including the accuracy of author and contributor information, disclosure of
competing interests and funding sources, compliance with ethical requirements relating
to human and animal study participants, and compliance with any copyright requirements
of third parties. This journal is a member of the Committee on Publication Ethics (COPE).
Provenance: the authors were invited to submit this paper.
Published by Libertas Academica. Learn more about this journal.
displays related expression across conditions, while samples
form a cluster that displays related expression across all genes.
In gene-based clustering, the genes are regarded as the objects,
while the samples are regarded as the features. In samplebased clustering, the samples can be segregated into identical
groups where the genes are treated as features and the samples as objects.3 The peculiarity of gene-based clustering and
sample-based clustering is centered on different characteristics
of clustering tasks for gene expression data.4
Clustering could be partial or complete; a partial clustering does not allocate every gene to a cluster while a complete clustering does. Partial clustering has a tendency to be
more suitable for gene expressions due to the fact that the gene
expression data often comprises some irrelevant genes or samples. In gene expression, partial clustering allows some genes
in the expression data not to belong to well-defined clusters
because at most times genes in the expression data could
represent noises that allows its impact to be correspondingly
less on the outcome; in addition, by not allowing some genes
in the expression data to belong to well-defined clusters, it
aids in neglecting quite a number of irrelevant contributions.
Partial clustering thus helps in avoiding situations where an
Bioinformatics and Biology Insights 2016:10
237
Oyelade et al
interesting subgroup in a cluster is preserved by not forcing
membership of unrelated genes.5 Clustering can be categorized as hard or overlapping.5 Hard clustering assigns each
gene to a single cluster during its operation and its output,
while overlapping clusters assign degrees of membership in
several clusters to each input gene. An overlapping clustering
can be transformed to a hard clustering by assigning each gene
to the cluster with the dominant degree of membership. This
review ai (...truncated)