Checkpoint evolution for volatile correlation computing (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007%2Fs10994-010-5204-9.pdf

Checkpoint evolution for volatile correlation computing

Wenjun Zhou Hui Xiong Given a set of data objects, the problem of correlation computing is concerned with efficient identification of strongly-related ones. Existing studies have been mainly focused on static data. However, as observed in many real-world scenarios, input data are often dynamic and analytical results have to be continually updated. Therefore, there is the critical need to develop a dynamic solution for volatile correlation computing. To this end, we develop a checkpoint scheme, which can help us capture dynamic correlation values by establishing an evolving computation buffer. In this paper, we first provide a theoretical analysis of the properties of the volatile correlation, and derive a tight upper bound. Such tight and evolving upper bound is used to identify a small list of candidate pairs, which are maintained as new transactions are added into the database. Once the total number of new transactions goes beyond the buffer size, the upper bound is re-computed according to the next checkpoint, and a new list of candidate pairs is identified. Based on such a scheme, a new algorithm named CHECK-POINT+ has been designed. Experimental results on realworld data sets show that CHECK-POINT+ can significantly reduce the computation cost in dynamic data environments, and has the advantage of compacting the use of memory space. 1 Introduction Recent years have witnessed increased interest in computing strongly related data objects (e.g., strongly correlated item pairs, as measured by Pearsons correlation coefficient). Many important applications in science and business (e.g. Alexander 2001; Cohen et al. 2002; Kuo et al. 2002) depend on efficient and effective correlation computing techniques to discover relationships within large collections of data. Despite the development of correlation computing techniques (e.g. Brin et al. 1997; Jermaine 2001; DuMouchel and Pregibon 2001; Jermaine 2003; Ilyas et al. 2004; Xiong et al. 2004, 2006), researchers and practitioners are still facing increasing challenges to measure associations among data objects produced by emerging data-intensive applications, particularly when the data are dynamic and analytical results need to be continually updated. Indeed, with such large and growing data sets, research efforts are needed to develop a dynamic solution for volatile correlation computing. To that end, in this paper we provide a pilot study of dynamically finding all strongly related item pairs, whose correlation values are above a user-specified minimum threshold, as new data are constantly being collected. As motivating examples, let us consider the following potential application scenarios. First, consider an e-commerce Web site that would like to promote sales by making recommendations to customers. In order to automate this process, the computerized system can recommend items most highly correlated to those being purchased, according to past transactions. As new orders are being placed, the recommendations should be updated automatically, and reflect recent interests in a timely manner. In this case, the underlying computation would involve finding the strongly correlated item pairs in a real-time fashion. A second example can be found in automatic stock picking. In order to monitor stocks with correlated price movements, a portfolio manager might be interested in knowing highly correlated stocks, whose prices tend to move in the same direction (either going up or going down). Despite the large number of stocks on the market, and the number of days with price quotes, the portfolio manager may want to maintain the up-to-date list of strong pairs as his decision support. Both the above application scenarios require efficient computation of strongly correlated pairs in a dynamic fashion. A straightforward solution is to recompute the correlations of all item pairs every time when new data become available. However, for large data sets, this approach is not practical, particularly if the application needs the results in a timely fashion. An alternative method is to use more space to save time. Along this line, we present a SAVE-ALL algorithm, which saves the intermediate results for all item pairs. When new transactions are added into the database, SAVE-ALL only updates the stored values corresponding to each item pair, and computes the correlation query results with the intermediate values. Obviously, the SAVEALL method compromises space for time. If the number of items in the data set becomes considerably large, the number of pairs grow even larger, to the extent that it is impossible to save the intermediate computing results of all item pairs in the memory space. This motivates our interest in volatile correlation computing. In our preliminary work (Zhou and Xiong 2008), we proposed a CHECK-POINT algorithm that makes a time-space tradeoff and can efficiently incorporate new transactions for correlation computing as they become available. In CHECK-POINT, we set a checkpoint to establish a computation buffer, which can help us determine a correlation upper bound. This checkpoint bound can be exploited to identify a list of candidate pairs, whose frequencies are maintained and correlations are computed, as new transactions are being added into the database. However, if the total number of new transactions exceeds the buffer size, a new upper bound is computed according to the next checkpoint and a new list of candidate pairs is identified. The rationale behind CHECK-POINT is that, if the number of new transactions is much smaller than the total number of transactions in the database, the correlation coefficients of most item pairs will not change substantially. In other words, we only need to establish a very short list of candidate pairs at the checkpoint and maintain this candidate list in the memory as new transactions are added into the database. Unlike SAVE-ALL, CHECK-POINT only maintains the intermediate computing results of a very small portion of the item pairs. This can greatly compact the use of the memory space, using slightly more time. In this paper, we derive a tight upper bound for the evolving correlation. The tight bound is exploited to identify a more compact candidate list than CHECK-POINT, so that better correlation computing performances can be achieved. Also, we identify the local monotone property of this new upper bound. Such property can be used for searching the optimal points effectively. In addition, based on this new upper bound, we exploit the local monotone property and design a CHECK-POINT+ algorithm for volatile correlation computing. As demonstrated by our experimental results on several real-world data sets, CHECKPOINT+ has a much better computational performance than CHECK-POINT. Both CHECK-POINT and CHECK-POINT+ can significantly reduce the computational cost compared to existing correlation computing benchmark algorithms, e.g. TAPER, in dynamic data environments. Also, comp (...truncated)