Identifying High-Number-Cluster Structures in RFID Ski Lift Gates Entrance Data
Ann. Data. Sci. (2015) 2(2):145–155
DOI 10.1007/s40745-015-0038-8
Identifying High-Number-Cluster Structures in RFID
Ski Lift Gates Entrance Data
Boris Delibašić1 · Zoran Obradović2
Received: 25 June 2015 / Revised: 27 June 2015 / Accepted: 29 June 2015 / Published online: 9 July 2015
© Springer-Verlag Berlin Heidelberg 2015
Abstract In this paper we identify skier groups in data from RFID ski lift gates
entrances. The ski lift gates’ entrances are real-life data covering a 5-year period
from the largest Serbian skiing resort with a 32,000 skier per hour ski lift capacity.
We utilize three representative algorithms from three most widely used clustering
algorithm families (representative-based, hierarchical, and density based) and produce
40 algorithm settings for clustering skiing groups. Ski pass sales data was used to
validate the produced clustering models. It was assumed that persons who bought ski
tickets together are more likely to ski together. AMI and ARI clustering validation
measures are reported for each model. In addition, the applicability of the proposed
models was evaluated for ski injury prevention. Each clustering model was tested on
whether skiing in groups increases risk of injury. Hierarchical clustering algorithms
showed to be very efficient in terms of finding the high-number-cluster structure (skiing
groups) and for detecting models suitable for injury prevention. Most of the tested
clustering algorithms models supported the hypothesis that skiing in groups increases
risk of injury.
Keywords Skiing groups · Ski lift gates RFID data · Hierarchical clustering ·
K-means · OPTICS · Ski injury
B Boris Delibašić
Zoran Obradović
1
Faculty of Organizational Sciences, University of Belgrade,
154 Jove Ilića St., Belgrade, Serbia
2
Center for Data Analytics and Biomedical Informatics, Temple University,
1925 N. 12th Street (SERC 035-02), Philadelphia, PA 19122-1801, USA
123
146
Ann. Data. Sci. (2015) 2(2):145–155
1 Introduction
Clustering algorithms that are automatically looking for the right number of clusters
in data tend to detect fewer clusters, so a high number cluster structure is hard to reveal
[20]. These algorithms work soundly in finding k clusters in data when k (number of
clusters) is much smaller than n (number of objects). When this is not satisfied, clustering algorithms have difficulties in identifying the hidden high-number-clustering
structure. These high-number-clustering structures can be often found in real-life applications (e.g. disease prediction with microarray data [20], clustering of human activity
patterns [9]).
Extracting knowledge from sensor data (sensor data mining) is a new and important
research field in the big data research field (e.g. [1]). Sensor data is available more
and more and due to the large amount of this data there is a huge area of analysis for
this kind of data. Sensor data comes from wireless sensor networks, sensor streams,
sensor networks, mobile objects, RFID tags and similar.
RFID data is playing a more and more important role in our lives. How to analyze
and discover knowledge from RFID data sets is an urgent and challenging research
field [7]. Although most of the literature employs hierarchical clustering to find natural
groups in data [9], D’Urso and Massari [9] propose a fuzzy approach for clustering
path data, since sequences of human activities are typically characterized by switching
behaviors, which are likely to produce overlapping clusters. They use two modifications of the fuzzy c-medoids algorithms to cluster human path data. Lv et al. [15]
claim that trajectory clustering is usually performed with three approaches: partioning
(k-means) clustering, density-based clustering and time-based clustering. The same
authors modeled mobile users similarity based on a proposed hierarchical clustering
algorithm that uses the cosine distance for measuring similarity.
Based on the literature review on clustering RFID data, in this paper the most
frequently used clustering approaches and their representatives are used, i.e. K-means
[12], hierarchical clustering [12], and OPTICS [2] as representatives of three clustering
algorithm families (representative-based, hierarchical, and density based). The three
algorithms were set up (with varying similarity measures, and stopping criteria) so they
produce 40 different algorithm settings for clustering high-number-cluster structures.
The 40 algorithm settings were applied on a large real-life ski lift gates entrance
dataset covering a five year period. A potential application of the produced models
was shown for the case of ski injury prevention. The question of whether skiing in
groups (i.e. |cluster| > 1) increases risk of injury was tested.
This paper makes a twofold contribution. On the one hand it proposes algorithm
settings that can be used for mining high-number-cluster structures (here skiing groups)
for real-life applications. On the other hand it contributes to ski injury prevention, as
currently there is no research on the influence of skiing groups on ski injury, and
methods for identifying groups from RFID ski lift gate date are missing.
The rest of the paper is structured as follows: In Sect. 2 the big data for analysis
are presented. Section 3 explains the algorithms and their settings to analyze the data.
Section 4 presents a background on ski injury research, for which the results of the
clustering models could be valuable. Section 5 discusses the results. The conclusion
of the paper and directions for further research are given in Sect. 6.
123
Ann. Data. Sci. (2015) 2(2):145–155
147
2 The Data
The data is from the largest Serbian ski resort, Mt. Kopaonik. The data spans five
consecutive seasons (2006–2010). Regulations on Mt. Kopaonik are that each person
must buy a ski pass in order to use ski lifts. The Radio-frequency identification (RFID,
i.e. wireless non-contact use of radio-frequency electromagnetic fields to transfer data,
for the purposes of automatically identifying and tracking tags attached to objects) ski
pass is used each time a person wants to enter a ski lift through a ski lift gate. Therefore,
for all skiers (in this papers skiers will be used as a generic term for all persons using
ski lift gates to enter ski lifts, i.e. skiers, snowboarders, etc.), motion data is collected
on ski lift gates and stored in the central database.
Databases used in this research are the ski lift gates’ entrance database with spatiotemporal data from skiers’ movements, ski patrol injury records, and ski pass sales
data. The following attributes were used for the analysis:
1. From the ski lift gates database:
• Ski pass id,
• Ski lift entered using a ski gate, and
• Date and time of entering the ski gate.
2. From the ski patrol injury records:
• Ski pass id from injured skier.
3. From the ski pass sales database:
• Sales transaction id,
• Date and time of sales transaction,
• Ski pass id(i) (i=1,…,m where m is the number (...truncated)