A highly efficient multi-core algorithm for clustering extremely large datasets (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1471-2105-11-169.pdf

A highly efficient multi-core algorithm for clustering extremely large datasets

BMC Bioinformatics A highly efficient multi-core algorithm for clustering extremely large datasets Johann M Kraus 0 1 Hans A Kestler 0 1 0 Institute of Neural Information Processing, University of Ulm , 89069 Ulm , Germany 1 Department of Internal Medicine I, University Hospital Ulm , 89081 Ulm , Germany Background: In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer. Results: We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization. Conclusions: Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer. - Background The advent of high-throughput methods to life sciences has increased the need for computer-intensive applications to analyze large data sets in the laboratory. Currently, the field of bioinformatics is confronted with data sets containing thousands of samples and up to millions of features, e.g. gene expression arrays and genome-wide association studies using single nucleotide polymorphism (SNP) chips. To explore these data sets that are too large for manual analysis, machine learning methods are employed [1]. Among them, cluster algorithms partition objects into different groups that have similar characteristics. These methods have already become a valuable tool to detect associations between combinations of SNP markers and diseases and for the selection of tag SNPs [2,3]. Not only here, the size of the generated data sets has grown up to 1000000 markers per chip. The demand for performing these computer-intensive applications is likely to increase even further for two reasons: First, with the popularity of next-generation sequencing methods rising, the number of measurements per sample will soar. Second, the need to assist researchers in answering questions such as How many groups are in my data? or How robust is the identified clustering? will increase. Cluster number estimation techniques address these types of questions by repeated use of a cluster algorithm with slightly different initializations or data sets, ultimately performing a sensitivity analysis. In the past, computing speeds doubled approximately every 2 years via increasing clock speeds, giving software a free ride to better performance [4]. This is now over, and such automatic performance improvements are no longer possible. As clock speeds are stalling, the increase in computational power is now due to the rapid increase of the number of cores per processor. This makes parallel computation a necessity for the time-consuming analyses in the laboratory. Generally, two parallelization schemes are available. The first is based on a network of computers or computing nodes. The idea of such a master-slave parallelization is to parallelize independent tasks using a network of one master and several slave computers. While there is no possibility for communication between the slaves, this approach best fits scenarios where the same serial algorithm is started several times on different relatively small data sets or different analyses are calculated in parallel on the same data set. Data set size matters here, as distribution of large data sets is time consuming and requires all computers to have the appropriate memory configuration. The second approach called shared memory parallelization is used to parallelize the implementation of an algorithm itself. This is an intrinsic parallelization via different interwoven sub-processes (threads) on a single multi-core computer accessing a common memory, and requires a redesign of the original serial algorithm. Master-slave parallelization Master-slave parallelization is heavily used by computer clusters or supercomputers. The Message Passing Interface (MPI) [5] protocol is the dominant model in highperformance computing. Without shared memory the compute nodes are restricted to process independent tasks. As long as the load-balancing of the compute nodes is well handled, the parallelization of a complex simulation scales linearly with the number of compute nodes. In contrast to massive parallel simulation runs of complex algorithms, master-slave parallelization is also used for parallelizing algorithms. For this task, a large dataset is usually first split into smaller pieces. The subsets are then distributed through a computer network and each compute node solves a subtask for its subset. Finally, all results are transferred back to the master computer, which combines them to a global result. The user interacts with the hardware cluster through the master computer or via a web-interface. However, in addition to hardware requirements, such as minimal amount of memory that are imposed on each compute node, the effort of distributing the data and communicating with nodes of the computer network restricts the speedup achievable with this method. An approach similar to MPI by Kraj et al. [6] uses web-services for parallel distribution of code, which can reduce the effort for administrating a computer cluster, but is platformdependent. A very popular programming environment in the bioinformatics and biostatistics community is R [7,8]. In recent years several packages (snow, snowfall, nws, multicore) have been developed that enable master-slave parallelized R programs to run on computer cluster platforms or multi-core computers, see Hill et al. Shared memory parallelization Today most desktop computers and even notebooks provide at least dual-core processors. Compared to master-slave parallelization, developing shared-memory software reduces the overhead of communicating through a network. Despite its perf (...truncated)