Identifying High-Number-Cluster Structures in RFID Ski Lift Gates Entrance Data (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007%2Fs40745-015-0038-8.pdf

Identifying High-Number-Cluster Structures in RFID Ski Lift Gates Entrance Data

Ann. Data. Sci. (2015) 2(2):145–155 DOI 10.1007/s40745-015-0038-8 Identifying High-Number-Cluster Structures in RFID Ski Lift Gates Entrance Data Boris Delibašić1 · Zoran Obradović2 Received: 25 June 2015 / Revised: 27 June 2015 / Accepted: 29 June 2015 / Published online: 9 July 2015 © Springer-Verlag Berlin Heidelberg 2015 Abstract In this paper we identify skier groups in data from RFID ski lift gates entrances. The ski lift gates’ entrances are real-life data covering a 5-year period from the largest Serbian skiing resort with a 32,000 skier per hour ski lift capacity. We utilize three representative algorithms from three most widely used clustering algorithm families (representative-based, hierarchical, and density based) and produce 40 algorithm settings for clustering skiing groups. Ski pass sales data was used to validate the produced clustering models. It was assumed that persons who bought ski tickets together are more likely to ski together. AMI and ARI clustering validation measures are reported for each model. In addition, the applicability of the proposed models was evaluated for ski injury prevention. Each clustering model was tested on whether skiing in groups increases risk of injury. Hierarchical clustering algorithms showed to be very efficient in terms of finding the high-number-cluster structure (skiing groups) and for detecting models suitable for injury prevention. Most of the tested clustering algorithms models supported the hypothesis that skiing in groups increases risk of injury. Keywords Skiing groups · Ski lift gates RFID data · Hierarchical clustering · K-means · OPTICS · Ski injury B Boris Delibašić Zoran Obradović 1 Faculty of Organizational Sciences, University of Belgrade, 154 Jove Ilića St., Belgrade, Serbia 2 Center for Data Analytics and Biomedical Informatics, Temple University, 1925 N. 12th Street (SERC 035-02), Philadelphia, PA 19122-1801, USA 123 146 Ann. Data. Sci. (2015) 2(2):145–155 1 Introduction Clustering algorithms that are automatically looking for the right number of clusters in data tend to detect fewer clusters, so a high number cluster structure is hard to reveal [20]. These algorithms work soundly in finding k clusters in data when k (number of clusters) is much smaller than n (number of objects). When this is not satisfied, clustering algorithms have difficulties in identifying the hidden high-number-clustering structure. These high-number-clustering structures can be often found in real-life applications (e.g. disease prediction with microarray data [20], clustering of human activity patterns [9]). Extracting knowledge from sensor data (sensor data mining) is a new and important research field in the big data research field (e.g. [1]). Sensor data is available more and more and due to the large amount of this data there is a huge area of analysis for this kind of data. Sensor data comes from wireless sensor networks, sensor streams, sensor networks, mobile objects, RFID tags and similar. RFID data is playing a more and more important role in our lives. How to analyze and discover knowledge from RFID data sets is an urgent and challenging research field [7]. Although most of the literature employs hierarchical clustering to find natural groups in data [9], D’Urso and Massari [9] propose a fuzzy approach for clustering path data, since sequences of human activities are typically characterized by switching behaviors, which are likely to produce overlapping clusters. They use two modifications of the fuzzy c-medoids algorithms to cluster human path data. Lv et al. [15] claim that trajectory clustering is usually performed with three approaches: partioning (k-means) clustering, density-based clustering and time-based clustering. The same authors modeled mobile users similarity based on a proposed hierarchical clustering algorithm that uses the cosine distance for measuring similarity. Based on the literature review on clustering RFID data, in this paper the most frequently used clustering approaches and their representatives are used, i.e. K-means [12], hierarchical clustering [12], and OPTICS [2] as representatives of three clustering algorithm families (representative-based, hierarchical, and density based). The three algorithms were set up (with varying similarity measures, and stopping criteria) so they produce 40 different algorithm settings for clustering high-number-cluster structures. The 40 algorithm settings were applied on a large real-life ski lift gates entrance dataset covering a five year period. A potential application of the produced models was shown for the case of ski injury prevention. The question of whether skiing in groups (i.e. |cluster| > 1) increases risk of injury was tested. This paper makes a twofold contribution. On the one hand it proposes algorithm settings that can be used for mining high-number-cluster structures (here skiing groups) for real-life applications. On the other hand it contributes to ski injury prevention, as currently there is no research on the influence of skiing groups on ski injury, and methods for identifying groups from RFID ski lift gate date are missing. The rest of the paper is structured as follows: In Sect. 2 the big data for analysis are presented. Section 3 explains the algorithms and their settings to analyze the data. Section 4 presents a background on ski injury research, for which the results of the clustering models could be valuable. Section 5 discusses the results. The conclusion of the paper and directions for further research are given in Sect. 6. 123 Ann. Data. Sci. (2015) 2(2):145–155 147 2 The Data The data is from the largest Serbian ski resort, Mt. Kopaonik. The data spans five consecutive seasons (2006–2010). Regulations on Mt. Kopaonik are that each person must buy a ski pass in order to use ski lifts. The Radio-frequency identification (RFID, i.e. wireless non-contact use of radio-frequency electromagnetic fields to transfer data, for the purposes of automatically identifying and tracking tags attached to objects) ski pass is used each time a person wants to enter a ski lift through a ski lift gate. Therefore, for all skiers (in this papers skiers will be used as a generic term for all persons using ski lift gates to enter ski lifts, i.e. skiers, snowboarders, etc.), motion data is collected on ski lift gates and stored in the central database. Databases used in this research are the ski lift gates’ entrance database with spatiotemporal data from skiers’ movements, ski patrol injury records, and ski pass sales data. The following attributes were used for the analysis: 1. From the ski lift gates database: • Ski pass id, • Ski lift entered using a ski gate, and • Date and time of entering the ski gate. 2. From the ski patrol injury records: • Ski pass id from injured skier. 3. From the ski pass sales database: • Sales transaction id, • Date and time of sales transaction, • Ski pass id(i) (i=1,…,m where m is the number (...truncated)