Clustering Algorithms: Their Application to Gene Expression Data. (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5135122/pdf/

Clustering Algorithms: Their Application to Gene Expression Data.

Clustering Algorithms: Their Application to Gene Expression Data Jelili Oyelade1,2,*, Itunuoluwa Isewon1,2,*, Funke Oladipupo1, Olufemi Aromolaran1, Efosa Uwoghiren1, Faridah Ameh1, Moses Achas3 and Ezekiel Adebiyi1,2 1 Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria. 2Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria. 3Department of Computer Science and Information Technology, Bells University of Technology, Ota, Ogun State, Nigeria. *JO and II are joint first authors. Abstract: Gene expression data hide vital information required to understand the biological process that takes place in a particular organism in relation to its environment. Deciphering the hidden patterns in gene expression data proffers a prodigious preference to strengthen the understanding of functional genomics. The complexity of biological networks and the volume of genes present increase the challenges of comprehending and interpretation of the resulting mass of data, which consists of millions of measurements; these data also inhibit vagueness, imprecision, and noise. Therefore, the use of clustering techniques is a first step toward addressing these challenges, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. The clustering of gene expression data has been proven to be useful in making known the natural structure inherent in gene expression data, understanding gene functions, cellular processes, and subtypes of cells, mining useful information from noisy data, and understanding gene regulation. The other benefit of clustering gene expression data is the identification of homology, which is very important in vaccine design. This review examines the various clustering algorithms applicable to the gene expression data in order to discover and provide useful knowledge of the appropriate clustering technique that will guarantee stability and high degree of accuracy in its analysis procedure. Keywords: clustering algorithm, homology, biological process, gene expression data, bioinformatics Citation: Oyelade et al. Clustering Algorithms: Their Application to Gene Expression Data. Bioinformatics and Biology Insights 2016:10 237–253 doi: 10.4137/BBI.S38316. TYPE: Review Received: May 18, 2016. ReSubmitted: September 05, 2016. Accepted for publication: September 09, 2016. Academic editor: J. T. Efird, Associate Editor Peer Review: Seven peer reviewers contributed to the peer review report. Reviewers’ reports totaled 1359 words, excluding any confidential comments to the academic editor. Funding: Authors disclose no external funding sources. Competing Interests: Authors disclose no potential conflicts of interest. Correspondence: Introduction Clustering, which is an unsupervised learning technique, has been widely applied in diverse field of studies such as machine learning, data mining, pattern recognition, image analysis, and bioinformatics. However, Pirim et al.1 stated that no clustering algorithm exists with the best performance for all clustering problems. This fact makes it necessary to intelligently apply algorithms specialized for the task at hand. Our quest for useful information from noisy gene expression data to gain insight and create new hypothesis is not insignificant. The first step is creating clusters of gene expression data that are similar in expression and are dissimilar to gene expression data in other clusters. Similarities in data are commonly measured with distance; two or more genes are objects of a particular cluster if they are closely related based on a given distance. Though several clustering approaches are available, difficulty still arises in finding a suitable clustering technique for given experimental datasets. Clustering can be accomplished based on genes, samples, and/or time variable, depending on the type of dataset.2 The significance of clustering both genes and samples cannot be ignored in gene expression data; genes form a cluster that Copyright: © the authors, publisher and licensee Libertas Academica Limited. This is an open-access article distributed under the terms of the Creative Commons CC-BY-NC 3.0 License. aper subject to independent expert blind peer review. All editorial decisions made P by independent academic editor. Upon submission manuscript was subject to antiplagiarism scanning. Prior to publication all authors have given signed confirmation of agreement to article publication and compliance with all applicable ethical and legal requirements, including the accuracy of author and contributor information, disclosure of competing interests and funding sources, compliance with ethical requirements relating to human and animal study participants, and compliance with any copyright requirements of third parties. This journal is a member of the Committee on Publication Ethics (COPE). Provenance: the authors were invited to submit this paper. Published by Libertas Academica. Learn more about this journal. displays related expression across conditions, while samples form a cluster that displays related expression across all genes. In gene-based clustering, the genes are regarded as the objects, while the samples are regarded as the features. In samplebased clustering, the samples can be segregated into identical groups where the genes are treated as features and the samples as objects.3 The peculiarity of gene-based clustering and sample-based clustering is centered on different characteristics of clustering tasks for gene expression data.4 Clustering could be partial or complete; a partial clustering does not allocate every gene to a cluster while a complete clustering does. Partial clustering has a tendency to be more suitable for gene expressions due to the fact that the gene expression data often comprises some irrelevant genes or samples. In gene expression, partial clustering allows some genes in the expression data not to belong to well-defined clusters because at most times genes in the expression data could represent noises that allows its impact to be correspondingly less on the outcome; in addition, by not allowing some genes in the expression data to belong to well-defined clusters, it aids in neglecting quite a number of irrelevant contributions. Partial clustering thus helps in avoiding situations where an Bioinformatics and Biology Insights 2016:10 237 Oyelade et al interesting subgroup in a cluster is preserved by not forcing membership of unrelated genes.5 Clustering can be categorized as hard or overlapping.5 Hard clustering assigns each gene to a single cluster during its operation and its output, while overlapping clusters assign degrees of membership in several clusters to each input gene. An overlapping clustering can be transformed to a hard clustering by assigning each gene to the cluster with the dominant degree of membership. This review ai (...truncated)