Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation (pdf)

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

http://downloads.hindawi.com/journals/cin/2015/829201.pdf

Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation

Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation Xiao Sun,1,2 Tongda Zhang,3 Yueting Chai,1 and Yi Liu1 1National Engineering Laboratory for E-Commerce Technology, Tsinghua University, Beijing 100084, China 2DNSLAB, China Internet Network Information Center, Beijing 100190, China 3Electrical Engineering Department, Stanford University, Stanford, CA 94305, USA Received 10 March 2015; Accepted 28 May 2015 Academic Editor: J. Alfredo Hernandez Copyright © 2015 Xiao Sun et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Most of popular clustering methods typically have some strong assumptions of the dataset. For example, the -means implicitly assumes that all clusters come from spherical Gaussian distributions which have different means but the same covariance. However, when dealing with datasets that have diverse distribution shapes or high dimensionality, these assumptions might not be valid anymore. In order to overcome this weakness, we proposed a new clustering algorithm named localized ambient solidity separation (LASS) algorithm, using a new isolation criterion called centroid distance. Compared with other density based isolation criteria, our proposed centroid distance isolation criterion addresses the problem caused by high dimensionality and varying density. The experiment on a designed two-dimensional benchmark dataset shows that our proposed LASS algorithm not only inherits the advantage of the original dissimilarity increments clustering method to separate naturally isolated clusters but also can identify the clusters which are adjacent, overlapping, and under background noise. Finally, we compared our LASS algorithm with the dissimilarity increments clustering method on a massive computer user dataset with over two million records that contains demographic and behaviors information. The results show that LASS algorithm works extremely well on this computer user dataset and can gain more knowledge from it. 1. Introduction Background and Related Work. The fast growing Internet technologies and multidisciplinary integration, such as social network, e-commerce, and bioinformatics, have accumulated huge amounts of data, which is far beyond human beings’ processing ability from both data scalability and structure complexity [1]. For example, as scientists study the working mechanism of the cell, they would gather data about protein sequences or genomic sequences, which could be as large as tens or hundreds of terabyte and have a fairly intricate structure inside. Even the smartest person has no way to deal with such a dataset without any assistant tool. Data mining technologies [2] like semisupervised learning [3] and deep learning [4] are developed to address this problem and play an important role in a lot of fields, such as smart home [5], supporting decision system [6], biology [7], and marketing science [8]. In most of these areas, people constantly want to gain knowledge and learn structure from the data they collected. Clustering [9], as one of the most important unsupervised learning methods in data mining, is designed for finding hidden structure in unlabeled dataset, which can be used for further processing, such as data summarization [10] and compression [11]. Despite the dozens of different clustering methods from a variety of fields, they can be roughly divided into two categories, partitional method and hierarchical method [12]. Partitional clustering method tries to generate definite numbers of clusters directly. Considering the computationally prohibitive cost to optimize criterion function globally, iterative strategy is usually adopted. On the other hand, hierarchical clustering method generates a group of clustering results; different threshold parameters lead to different clustering results. Both clustering methods have limitations which make them perform badly when applying on some dataset without any change like human behaviour dataset which has various kinds of features and scales in high-dimensional space. The first limitation is the dimensionality. The dataset we are dealing with is usually with a dimension higher than 3, which makes it almost impossible for people to have a clear intuition of the data distribution. Current clustering methods typically need a given parameter to decide the number of generated clusters. For example, in -means [13], a predetermined parameter which represents the number of clusters to be generated is required to run the algorithm. In single link and complete link [8], threshold parameter plays a similar role. In such cases, the selection of parameter is highly subjective judgement and will become harder as the dimension goes up. Also, high dimensionality makes traditional Euclidean density notion meaningless, since the density tend (...truncated)