A highly efficient multi-core algorithm for clustering extremely large datasets
BMC Bioinformatics
A highly efficient multi-core algorithm for clustering extremely large datasets
Johann M Kraus 0 1
Hans A Kestler 0 1
0 Institute of Neural Information Processing, University of Ulm , 89069 Ulm , Germany
1 Department of Internal Medicine I, University Hospital Ulm , 89081 Ulm , Germany
Background: In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer. Results: We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization. Conclusions: Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer.
-
Background
The advent of high-throughput methods to life sciences
has increased the need for computer-intensive
applications to analyze large data sets in the laboratory.
Currently, the field of bioinformatics is confronted with
data sets containing thousands of samples and up to
millions of features, e.g. gene expression arrays and
genome-wide association studies using single nucleotide
polymorphism (SNP) chips. To explore these data sets
that are too large for manual analysis, machine learning
methods are employed [1]. Among them, cluster
algorithms partition objects into different groups that have
similar characteristics. These methods have already
become a valuable tool to detect associations between
combinations of SNP markers and diseases and for the
selection of tag SNPs [2,3]. Not only here, the size of
the generated data sets has grown up to 1000000
markers per chip. The demand for performing these
computer-intensive applications is likely to increase
even further for two reasons: First, with the popularity
of next-generation sequencing methods rising, the
number of measurements per sample will soar. Second, the
need to assist researchers in answering questions such
as How many groups are in my data? or How robust
is the identified clustering? will increase. Cluster
number estimation techniques address these types of
questions by repeated use of a cluster algorithm with slightly
different initializations or data sets, ultimately
performing a sensitivity analysis.
In the past, computing speeds doubled approximately
every 2 years via increasing clock speeds, giving software
a free ride to better performance [4]. This is now over,
and such automatic performance improvements are no
longer possible. As clock speeds are stalling, the increase
in computational power is now due to the rapid increase
of the number of cores per processor. This makes
parallel computation a necessity for the time-consuming
analyses in the laboratory. Generally, two parallelization
schemes are available. The first is based on a network of
computers or computing nodes. The idea of such a
master-slave parallelization is to parallelize independent
tasks using a network of one master and several slave
computers. While there is no possibility for
communication between the slaves, this approach best fits scenarios
where the same serial algorithm is started several times
on different relatively small data sets or different
analyses are calculated in parallel on the same data set.
Data set size matters here, as distribution of large data
sets is time consuming and requires all computers to
have the appropriate memory configuration. The second
approach called shared memory parallelization is used
to parallelize the implementation of an algorithm itself.
This is an intrinsic parallelization via different
interwoven sub-processes (threads) on a single multi-core
computer accessing a common memory, and requires a
redesign of the original serial algorithm.
Master-slave parallelization
Master-slave parallelization is heavily used by computer
clusters or supercomputers. The Message Passing
Interface (MPI) [5] protocol is the dominant model in
highperformance computing. Without shared memory the
compute nodes are restricted to process independent
tasks. As long as the load-balancing of the compute
nodes is well handled, the parallelization of a complex
simulation scales linearly with the number of compute
nodes. In contrast to massive parallel simulation runs of
complex algorithms, master-slave parallelization is also
used for parallelizing algorithms. For this task, a large
dataset is usually first split into smaller pieces. The
subsets are then distributed through a computer network
and each compute node solves a subtask for its subset.
Finally, all results are transferred back to the master
computer, which combines them to a global result. The
user interacts with the hardware cluster through the
master computer or via a web-interface. However, in
addition to hardware requirements, such as minimal
amount of memory that are imposed on each compute
node, the effort of distributing the data and
communicating with nodes of the computer network restricts the
speedup achievable with this method. An approach
similar to MPI by Kraj et al. [6] uses web-services for
parallel distribution of code, which can reduce the effort for
administrating a computer cluster, but is
platformdependent. A very popular programming environment
in the bioinformatics and biostatistics community is R
[7,8]. In recent years several packages (snow, snowfall,
nws, multicore) have been developed that enable
master-slave parallelized R programs to run on computer
cluster platforms or multi-core computers, see Hill et al.
Shared memory parallelization
Today most desktop computers and even notebooks
provide at least dual-core processors. Compared to
master-slave parallelization, developing shared-memory
software reduces the overhead of communicating through a
network. Despite its perf (...truncated)