Bus-OLAP: A Data Management Model for Non-on-Time Events Query Over Bus Journey Data

Data Science and Engineering, Mar 2018

Increasing the on-time rate of bus service can prompt the people’s willingness to travel by bus, which is an effective measure to mitigate the city traffic congestion. Performing queries on the bus arrival can be used to identify and analyze various kinds of non-on-time events that happened during the bus journey, which is helpful for detecting the factors of delaying events, and providing decision support for optimizing the bus schedules. We propose a data management model, called Bus-OLAP, for querying bus journey data, considering the characteristics of bus running and the scenarios of non-on-time analysis. While fulfilling typical requirements of bus journey data queries, Bus-OLAP not only provides a flexible way to manage the data and to implement multiple granularity data query and update, but it also supports distributed queries and computation. The experiments on real-world bus journey data verify that Bus-OLAP is effective and efficient.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:


Bus-OLAP: A Data Management Model for Non-on-Time Events Query Over Bus Journey Data

Data Science and Engineering Bus-OLAP: A Data Management Model for Non-on-Time Events Query Over Bus Journey Data Lei Duan 0 1 2 3 4 Tinghai Pang 0 1 2 3 4 Jyrki Nummenmaa 0 1 2 3 4 Jie Zuo 0 1 2 3 4 Peng Zhang 0 1 2 3 4 Changjie Tang 0 1 2 3 4 Lei Duan 0 1 2 3 4 Jyrki Nummenmaa 0 1 2 3 4 0 Changjie Tang 1 Tinghai Pang 2 Sino-Finnish Centre, Tongji University , Shanghai , China 3 Faculty of Natural Sciences, University of Tampere , Tampere , Finland 4 School of Computer Science, Sichuan University , Chengdu, Sichuan , China Increasing the on-time rate of bus service can prompt the people's willingness to travel by bus, which is an effective measure to mitigate the city traffic congestion. Performing queries on the bus arrival can be used to identify and analyze various kinds of non-on-time events that happened during the bus journey, which is helpful for detecting the factors of delaying events, and providing decision support for optimizing the bus schedules. We propose a data management model, called Bus-OLAP, for querying bus journey data, considering the characteristics of bus running and the scenarios of nonon-time analysis. While fulfilling typical requirements of bus journey data queries, Bus-OLAP not only provides a flexible way to manage the data and to implement multiple granularity data query and update, but it also supports distributed queries and computation. The experiments on real-world bus journey data verify that Bus-OLAP is effective and efficient. Data management; OLAP; Parallel computing 1 Introduction Public bus service plays an important and irreplaceable role in the traffic system of a city. On one hand, public bus is one of the most convenient and cost-efficient ways for people to travel. On the other hand, public bus service, as an alternative to the use of private cars, is an effective way to reduce carbon dioxide emissions. Promoting the bus service not only provides convenience for people but it also improves the urban living conditions and helps in the fight against the climate change. On-time bus is an emerging bus running mode where the bus arrives at each bus stop according to the time table strictly, which is helpful for passengers to avoid wasting too much time at bus stops. There are several countries, e.g., the USA, Finland, and Japan, where efforts have been made to implement on-time bus running mode to improve the bus service quality. Clearly, a reasonable design of bus running routes and time tables is the key to carry out the on-time bus running mode. To initiate this development, practical bus journey data including the arrival information of a bus at every stop should be collected at first. The developments of sensors and Internet of Things enable the automatic collection of bus journey data. For example, the local government of Tampere, Finland, started in 2015 to record and publish as open data the location information of each running bus real-time every second. Particularly, the essential bus journey data include information on, for each running bus at each bus stop location, whether the bus was on schedule at a stop and how long was the delay if it was not on time. Table 1 lists several samples of Tampere bus journey data. There, each record contains the information of a bus arriving at a bus stop. Example 1 In Table 1, ‘‘Stop’’ represents the bus stop ID, ‘‘Name’’ is the name of the bus stop, ‘‘Latitude’’ and ‘‘Longitude’’ indicate the location of a bus stop, ‘‘Line’’ is the bus line number, ‘‘Date’’ and ‘‘Arrival’’ are the date and time of bus arrival, respectively, and ‘‘Delay’’ is the difference between the arrival time and the scheduled time (‘-’ indicates that bus arrives ahead of scheduled time). For instance, from the first record in Table 1, we can see that a bus on Line 13 arrived at Stop Ammattikoulu (ID: 1500, location: 61.498600N, 23.735500E) at 16:40:15 PM on August 3, 2016, being 210 s behind the schedule. A bus non-on-time query provides statistics on the nonon-time arrivals of buses that happened under given conditions. Bus non-on-time queries are useful for the bus route design, online information systems on bus transportation, and the bus running schedule optimization. In general, the non-on-time events can be divided into the early case (the arrival time is ahead of the scheduled time) and the delaying case (the arrival time is later than the scheduled time). Please note that the main concern of bus non-on-time query is the delaying case, since (i) the early case can be easily avoided by slowing or stopping buses in the real-life situation, and (ii) the techniques supporting the query for delaying cases can be used for early cases. Thus, for the sake of simplicity, we just discuss the delaying case hereafter. However, the results generalize to the early case. Typical bus non-on-time queries [ 15, 18 ] include: ‘‘How many bus delaying events happened over a given time period?’’, ‘‘Which are the routes where most delaying events happen?’’, and ‘‘Do the bus delaying events aggregate and where?’’, etc. Based on the query conditions, there are three kinds of bus non-on-time queries: – Temporal queries, in which the query conditions are related to bus running time. For example, how many buses were delayed during 17 : 00 19 : 00? Delay 210 610 - 34 340 – – Spatial queries, in which the query conditions are related to bus stop locations. For example, given a bus stop ID ‘‘560’’, which are the nearest neighboring bus stops where delaying events happen? Spatial–temporal queries, in which the query conditions are related to both bus running time and bus stop locations. For example, which are the spatial–temporal zones with significant aggregation of delaying events? To the best of our knowledge, there is no data management model dealing with the bus non-on-time queries.1 Since non-on-time query over bus journey data can be used to improve bus service, thus improving comfortable travel experience, we should provide an efficient model to manage bus journey data. However, there are some challenges that we need to address: (i) How to manage the bus journey data? (ii) How to fuse data from different sources? (iii) How to perform queries efficiently? In this paper, to tackle these challenges related to non-ontime queries, we propose a model named Bus-OLAP to manage bus journey data. In this paper, we make the following concrete contributions: (i) We build indexes of bus journey data by bitvectors according to the application scenarios, taking into account particularly the temporal and spatial factors. (ii) We introduce efficient index operations to support the nonon-time queries on bus journey data. (iii) We implement the related computations on Spark to support distributed computing for the processing of the queries, thus enabling real-time response for the analysis for massive datasets. The rest of the paper is organized as follows. We review the related work in Sect. 2 and present the critical techniques of Bus-OLAP in Sect. 3. In Sect. 4, we report a systematic empirical study using real-world bus data and we conclude the paper in Sect. 5. 2 Related Work Our study is related to previous studies on urban traffic analysis, spatial–temporal data management, and parallel computing. 1 We tackled the problem of bus non-on-time query in Pang et al [ 11 ], a preliminary version of this paper. The difference between these two works is presented in related works. We will review the related works in Sect. 2. 2.1 Urban Traffic Analysis The analysis on urban traffic plays an important role and attracts extensive attention from public [ 25 ]. Traffic analysis helps to alleviate urban traffic congestion thereby improving environmentally friendly traffic for people to travel. Traffic analysis has been recently under active research. Yuan et al. [ 22 ] analyze the movements of taxis and passengers and provide an easy way for passengers to pick taxis. Pang et al. [ 10 ] detect the anomalous behavior of taxis in Beijing metropolitan area. Chen et al. [ 2 ] propose a model given the past trajectories and predict the next station of an object. Kong et al. [ 8 ] aim to deal with the problem of traffic congestion and propose a method to predict the traffic congestion using the floating car trajectories data. Han and Moutarde [ 6 ] focus on traffic dynamic prediction in urban transportation network and find out the evolution of traffic states, which contributes to the adjustment of traffic management. Zhang et al. [ 24 ] apply deep learning to predict the traffic of crowds in different regions of a city. Bao et al. [ 1 ] propose a data-driven approach to develop bike lane construction plans satisfying the constraints and objectives requested by the user. There are several works on predicting the urban traffic flow by the traffic state [ 4, 13, 17 ]. Wu et al. [ 18 ] detect the temporal– spatial aggregation of bus delay but do not consider the management of bus data. To the best of our knowledge, the existing work about traffic analysis concentrates on the problems with vehicles with stochastic trajectories. However, the analysis on bus data is essentially different from on other vehicles. On one hand, the trajectory and schedule of each bus are fixed. On the other hand, the main concern on bus journey data is non-on-time analysis. Clearly, the existing methods proposed by previous work on traffic analysis are not applicable to online bus journey data analysis, and, in particular, they do not give an efficient method to manage large bus data sets for efficient online analysis. We tackled the problem of bus journey data management in Pang et al. [ 11 ], a preliminary version of this paper. In comparison with that work, we have now extended the paper in the following aspects: (i) adding more introduction to the background of the research on bus journey data collection and analysis; (ii) including the necessary background knowledge about bus journey data preprocessing in Sect. 3.1; (iii) presenting a method of querying data from different sources; (iv) providing a more detailed description of key techniques for bus non-on-time query in Bus-OLAP; and (v) performing more extensive empirical evaluations. 2.2 Spatial–Temporal Data Management Urban traffic data have spatial–temporal characteristics. The bus data management can improve the efficiency of storage, indexing, and querying of the data. Sistla et al. [ 12 ] propose a model, called MOST, to model the moving object with a time-related function, which improves the efficiency of storage and querying of moving objects in a database. Ting et al. [ 16 ] present a simplistic network model for moving objects. Some index structures such as R-tree [ 5 ] and Bþ-tree [ 7 ] are also used to optimize spatial–temporal queries. Yu et al. [ 21 ] propose an algorithm, named YPK-CNN, to monitor KNN queries. The works discussed above can improve the query efficiency by different ways. However, we want to provide a model by which all bus running related factors can be considered. For example, besides the factors listed in Table 1, some external factors, such as weather activity and temperature, also affect the on-time rate of bus service. In particular, we need a data management model that is suitable for a distributed computing environment. 2.3 Parallel Computing Considering the requirement of analyzing large-scale bus journey data, parallel computing techniques are necessary to speed up the efficiency of spatial–temporal analysis [ 9 ]. A lot of work has been done on adapting parallel computing methods to support efficient queries over large-scale data. Xia et al. [ 19 ] design a method to conduct KNN queries based on Map-Reduce and apply it to real-time prediction of traffic flow. Eldawy et al. [ 3 ] have built a system, named GeoSpark, based on Spark to improve the efficiency of spatial–temporal data analysis. Xie et al. [ 20 ] apply R-tree indexing to parallel queries based on Spark, to deal with the efficiency issues of large-scale data. In this work, we use Spark to build a parallel processing model based on bit-vector operations to conduct bus nonon-time queries. 3 Design of Bus-OLAP In this section, we present our Bus-OLAP data management model for bus non-on-time queries. The framework of Bus-OLAP (Fig. 1) consists of data cleaning, transforming, loading, and querying. The preprocessing of bus journey data has been well studied in [ 14 ]. However, for the sake of self-completeness of our presentation, we briefly present the key points for data preprocessing here. Then, we discuss the most important techniques used in Bus-OLAP: Bus Record Transform Data Load Spark Query Clean Bus Stop Data Source Bus Line (i) data indexing, (ii) index operations, and (iii) parallel query computation. 3.1 Data Preprocessing The raw bus journey data suffer from noise, inconsistencies, and missing observations [ 15 ]. This is due to the realworld situations, such as the location measurement inaccuracies of the global navigation satellite system, imprecise knowledge of the real observation time, and errors in data identification. Several methods had been used to clean the data. The noisy data can be identified as outliers by comparing with other data. The inconsistent data can be discarded with the help of known scheduled bus stop sequences. The missing data can be ignored, since there is typically enough data available for the analysis purposes. Millions of observations are stored every day. However, the observations are not useful as such for statistical analysis. As a result, they are grouped into journeys, so that the essential bus running information, e.g., the bus route and the arrival time at each bus stop during one journey, can be studied. Furthermore, all bus journeys are segmented into links between subsequent bus stops, since the bus stop sequences and the location of each bus stop are available, and the segments between every two bus stops are useful for bus traffic analysis. Example 2 In Fig. 2, there are two buses v1 and v2, three bus stops and four observations. Bus v1 arrived at the bus stop ‘‘Rautatieasema’’ at 12:00 and then arrived at ‘‘Tammelan puistokatu’’ 3 min later. Thus, we grouped the first User and third observation together because they are in the same journey of bus v1. Similarly, the second and last observations in Fig. 2 belong to a journey of bus v2. Furthermore, the link between bus stops ‘‘Rautatieasema’’ and ‘‘Tammelan puistokatu’’ is a segment of v1’s journey. 3.2 Data Indexing To support efficient bus non-on-time queries, we implement an index based on attribute domain partition in BusOLAP. Specifically, we divide the domain of each attribute into several disjoint partitions. Then, for each attribute value of a record, we build a bit-vector to indicate the partition it belongs to. Thus, we can apply bit operations to perform bus non-on-time queries, which improves the query efficiency. Next, we introduce the details of the index building. For attribute A, the domain of A, denoted by DðAÞ, is the set of all available values on attribute A. A partition of DðAÞ, denoted by PDðAÞ, is a collection of non-empty subsets of DðAÞ such that (i) DðAÞ ¼ Sdi2PDðAÞ di, and (ii) for 8di; dj 2 PDðAÞ, i 6¼ j, di \ dj ¼ ;. Please note that, in BusOLAP, the order of elements in PDðAÞ is predetermined and fixed during performing queries. We denote by PDðAÞ½i the ith element in PDðAÞ. Furthermore, there may be several different partitions for an attribute. For example, the partitioning of Dð‘‘Date’’Þ can be by month or by season. For a record r, we denote by r.A the value of r on attribute A. Given a partition PDðAÞ on attribute A, the index of r.A is a bit-vector \b1; b2; . . .; bjPDðAÞj [ satisfying bi ¼ ( 1; r:A 2 PDðAÞ½i 0; r:A 62 PDðAÞ½i Example 3 A partition on attribute ‘‘Date’’ according to day of the week is illustrated in Fig. 3. The domain of ‘‘Date’’ is partitioned into seven subdomains, List of bus observa ons ID Bus Arrival 1 2 3 4 v1 v2 v1 v2 12:00 12:01 12:03 12:04 Stop Rauta easema Rauta easema Tammelan puistokatu Itsenäisyydenkatu ð1Þ 2016-08-03 2016-08-04 2016-08-05 2016-08-06 corresponding to the seven days of a week. The day ‘‘201608-03’’ in Table 1 represents August 3, 2016, and it is Wednesday. Therefore, the index of first record in Table 1 is \0; 0; 1; 0; 0; 0; 0 [ under partition PDð‘‘Date’’Þ. In Bus-OLAP, the selection of attribute partitions depends on: The query requirements. For example, we partition the domain of ‘‘Date’’ by day of the week and the domain of ‘‘Arrival’’ by time interval, so that we can find stops where most delaying events occur during the morning rush hour on Monday. The partition size. For example, we do not divide the domain of ‘‘Date’’ by date because the statistics on the non-on-time events within one day has little significance, and the space requirement following from the amount of subdomains (i.e., jPDð‘‘Date’’Þj) would be large. Table 2 lists the domain partition criteria for the main attributes listed in Table 1 by considering the application requirements. For a partition PDðAÞ on attribute A, the index of all records is the set VA ¼ fv1; v2; . . .; vjPDðAÞjg, where vi 2 VA (1 i n) is the column vector with respect to PDðAÞ½i . If the value of the kth element of vi is 1, the value of the kth record on attribute A belongs to PDðAÞ½i . Example 4 In Fig. 3, the column vector corresponding to ‘‘Wed’’ is v ¼ ð1; 0; 0; 0ÞT. The first element of v (i.e., 1) means that the value of the first record in Table 1 on attribute ‘‘Date’’ belongs to ‘‘Wed’’. As discussed above, the non-on-time query over bus journey data can be easily implemented by vector operations. For a bitmap VA ¼ fv1; v2; . . .; vng, we store vi 2 VA as a bit-vector. However, instead of loading all indexes, Bus-OLAP only loads the indexes (i.e., bit-vectors) of query-concerned attributes into memory. Then, a query can be converted into bit-vector operations. Week For a new bus journey record, the index can be efficiently updated by appending a new bit at the end of each bit-vector. 3.3 Index Operations For the sake of efficiency, we introduce some operations on indexes (bit-vectors) in Bus-OLAP. For a partition PDðAÞ, we define the subpartition of PDðAÞ as P^DðAÞ, where PDðAÞ. The set of bit-vectors corresponding to P^DðAÞ PDðAÞ is V^A ¼ fvi 2 VA j P^DðAÞ½i 2 PDðAÞg. We define ^ the OR operation on V^A as ORðV^AÞ ¼ v01 j v02 j . . . j v0m; where v0i 2 V^A (1 i m) and ‘‘j’’ denotes the bitwise OR operation between every two bit-vectors. Clearly, the result of ORðV^AÞ is also a bit-vector. The records satisfying subpartition P^DðAÞ can be obtained by index operations. Example 5 Consider Fig. 3, and suppose the query target is finding the records generated in weekdays (from Monday to Friday). Firstly, we fetch the subpartition P^Dð‘‘Date’’Þ ¼ fMon; Tue; Wed; Thu; Frig from PDð‘‘Date’’Þ. As a result, the bit-vector set with respect to P^Dð‘‘Date’’Þ is V^‘‘Date00 ¼ fv1; v2; v3; v4; v5g, where v1 ¼ ð0; 0; 0; 0ÞT, v2 ¼ ð0; 0; 0; 0ÞT, v3 ¼ ð1; 0; 0; 0ÞT, v4 ¼ ð0; 1; 0; 0ÞT, and v5 ¼ ð0; 0; 1; 0ÞT. Then, ORðV‘‘Date00 Þ ¼ v1 j v2 j v3 j v4 j v5 ¼ ð1; 1; 1; 0ÞT. Thus, the records satisfying the query conditions are the first three records in Table 1. In addition, we define the bitwise AND operation (denoted by&) for two bit-vectors belonging to different subpartitions, so that the query involving conditions on different attributes can be performed by index operations. Example 6 Consider the example in Fig. 4. Suppose that the query target is the set of records whose ‘‘Longitude’’ ranges from 23.73 to 23.74, ‘‘Latitude’’ ranges from 61.48 to 61.50, and ‘‘Date’’ is Wednesday or Thursday. Then, the subpartitions of ‘‘Longitude,’’ ‘‘Latitude,’’ and ‘‘Date’’ are fo1g, fa3; a4g, and fWed; Thug, respectively. The query result illustrated as the red area in Fig. 4 can be obtained by the following index operation: ORðfa3; a4gÞ&ORðfo1gÞ &ORðfWed; ThugÞ. 3.4 Parallel Query Computation Bus non-on-time query is computation intensive and response time sensitive. Figure 5 shows the framework of parallel query computation by bit-vectors using Spark [ 23 ]. Week Thu Wed Tue Mon o1 o2 o3 o4 Lon Lat a4 a3 a2 a1 Fig. 4 An example of attribute combination query Next, we introduce the details of parallel query computation by three typical kinds of queries. Please note that the queries to be discussed below are typical application scenarios for Bus-OLAP. However, it is easy to perform more complex queries using index operations with Spark. All bus non-on-time queries are performed using the framework shown in Fig. 5. Specifically, Bus-OLAP takes the query condition as input. In the query process, Bus-OLAP firstly Start End Load vectors Map conditions to tuples <c1, c2…cn> Output Spark actions Fig. 5 Framework of distributed query and computing. There are four steps: loading vectors into memory according to the query conditions; ` mapping the conditions into tuples \c1; c2; . . .; cn [ ; where ci, 0 i n, is the query condition; ´ partitioning the data into nodes and performing the bit operations on vectors by Spark; ˆ writing the output of the query result maps the conditions into tuples. Secondly, for each condition tuple, Bus-OLAP conducts the bit operations on vectors using parallel computation with Spark. Finally, Bus-OLAP returns the results that satisfy the query conditions. 3.4.1 Temporal Queries The typical temporal queries include querying the number of non-on-time events over a period of time and querying the routes that have the most non-on-time events over a period of time. Consider a query on the number of non-on-time events over a period of time, given time interval t, and non-ontime condition z (ahead of time, on-time, or delayed). The corresponding condition tuple in Fig. 5 is \t; z [ . Please note that there may be many condition tuples due to the number of time intervals. We compute each condition tuple in a parallel, using the computational model of Spark. For each condition tuple, the query result is obtained by index operations on bit-vectors. The number of non-on-time events equals the number of bits set to 1 in the result vector. Similarly, the query on the route that has the most nonon-time events over a period of time can be easily implemented. Given query conditions with respect to time interval t, non-on-time condition z, and bus line l, the computation on each condition tuple \t; z; l [ returns the number of delaying events. Then, the query result is the bus line that has the maximum number of delaying events over all tuple results. 3.4.2 Spatial Queries There are some typical spatial queries, such as KNN and RKNN, that can be applied in the bus data in the following way. Given a bus stop q with a number of delaying events, the analyst may want to know the bus stops close to q and analyze the factors related to the delays. After partitioning attributes ‘‘Longitude’’ and ‘‘Latitude,’’ we get a division of space into a grid. Each bus stop is located in a grid. For a bus stop q, the main steps to find the k nearest bus stops are: (i) search the candidate bus stops by extending a q-centered rectangular space that consists of grids; (ii) once there are at least k bus stops contained in the rectangular space, calculate the distance from q to the kth nearest stop p (denoted by disðq; pÞ); (iii) find k nearest neighboring stops within a q-centered circle space with radius disðp; qÞ. Example 7 Consider bus stop q in Fig. 6. To find five bus stops nearest to q, the first step is extending the q-centered rectangle. As R1 is the first (smallest) rectangle containing 5 bus stops, and p5 is the fifth nearest bus stop to q within R1, stops located in the q-centered circle with radius disðq; p5Þ are the candidates for the query results (5 nearest neighboring stops to q). Considering the characteristics of index operations, it is easy to find bus stops located within a rectangle area compared to a circle one. Thus, in Bus-OLAP, for stop q in Fig. 6, we find the KNN stops of q by checking the stops located within the smallest rectangle containing the qcentered circle with radius disðq; p5Þ, i.e., R2. Please note that it is easy to find stops located in R2 by index operations based on R1. For rectangle R1 in Fig. 6, the index operation on Longitude oplon ¼ ORðfo2; o3; o4; o5; o6gÞ and on Latitude oplat ¼ ORðfa2; a3; a4; a5; a6gÞ. After extending the rectangular area from R1 to R2, two partitions o1 and o7 are added into the Longitude of R1. Similarly, the two partitions a1 and a7 are added into the Latitude of R1. Consequently, the new result on Longitude is calculated as oplon j ORðfo1; o7gÞ and the new result on Latitude is calculated as oplat j ORðfa1; a7gÞ, respectively. Obviously, the search area can be efficiently extended by performing index operations on bit-vectors iteratively. a7 a6 a5 a4 a3 a2 a1 R2 R1 The RKNN query can be used to find all bus stops whose KNN stops include the given bus stop q. We implement the query using parallel computation on Spark. In the framework of Fig. 5, the condition tuple is \q; p; k [ ; where q is the given bus stop. Spark is used to compute each tuple, returning the result indicating whether q is one of the KNN stops of p. The RKNN query result is a set of bus stops that satisfy the query condition. 3.4.3 Spatial–Temporal Queries Detecting the spatial–temporal zone of bus delay aggregation can be done by querying the zone where delay aggregation occurs in a period of time. We measure the significance of spatial–temporal aggregation delay by loglikelihood ratio (LLR). Given a zone S and a time interval T, the log-likelihood of bus delay taking place, denoted as LðS; T Þ, is > > > > : 0; r1 r2 8 >> DðS; TÞ logðr1Þ þ ðDðeS; TeÞ > > < DðeS; TeÞ log DðeS; TeÞ ; r1 [ r2 N ðeS; TeÞ DðS; TÞÞ logðr2Þ ð2Þ LðS; TÞ ¼ where r1 ¼ DðS; TÞ ; r2 ¼ N ðS; TÞ DðeS; Te Þ N ðeS; Te Þ DðS; T Þ N ðS; T Þ where eS and Te are the maximal zone and maximal time interval, respectively; S and T are the observed zone and time interval, respectively; N ðS; TÞ denotes the total number of bus arrivals within S and T; and DðS; TÞ denotes the number of delaying events within S and T. Our target is to find the zone during a time interval that has maximal LLR in bus delay aggregation analysis. Figure 7 illustrates the processing of the delay aggregation query with Spark. Step 1 in Fig. 7 corresponds to Step 2 in Fig. 5, and Steps 2 and 3 in Fig. 7 correspond to Step 3 in Fig. 5. BusOLAP takes query conditions as input. Firstly, Step 1 joins the query conditions into condition tuples \o; a; d; t; de [ , where the variables denote ‘‘Longitude,’’ ‘‘Latitude,’’ ‘‘Date,’’ ‘‘Time,’’ and ‘‘Delay,’’ respectively. Secondly, Step 2 partitions tuples to nodes of the Spark cluster and gets the new tuples of LLR. Finally, Step 3 searches the tuple with maximal LLR, which is the query result. 3.5 Query with Data from Different Sources The on-time rate of bus service is vulnerable to the impact of external factors. For example, the delaying events happen more frequently in rainy days and foggy days. Clearly, <o3, a3, d3, t3, de3> … Join Map it is interesting and useful to consider various external factors, e.g., weather, when performing bus non-on-time analysis. To this end, we introduce a flexible implementation in Bus-OLAP such that queries containing conditions on external factors can be efficiently performed. It is challenging to query with external factors, since the external factors are extracted from different sources of data. For example, the bus journey records shown in Table 1 do not contain any weather information. If users want to investigate the impact of weather on bus delay, weather information should be collected at first. Technically, there are two steps to support queries containing data from different sources. The first step is fusing external factors with bus journey records. In other words, more attributes with respect to – – Stop 3084 external factors are added to bus journey records. The second step is building indexes for the external factors in order to support efficient query. Considering that changes in weather may significantly impact the bus on-time rate, we collect weather data from Weather Underground, which is a Web site sharing weather information of global cities.2 Next, we introduce our method to fuse external factors with bus journey records for bus non-on-time queries by taking weather data as an example. To give an example, we collect Tampere’s local average daily temperature and daily weather activity from Weather Underground. The reasons we collect these two factors are that (i) average daily temperature and daily weather activity are two main features of weather status, and (ii) the data types of these two factors are typical. The domain of 2 https://www.wunderground.com. average daily temperature is continuous, while the domain of daily weather activity containing ‘‘rain,’’ ‘‘fog,’’ and ‘‘thunder’’ is enumerable. (Here, we only consider the weather activities that affect the bus on-time rate negatively.) Please recall that, as listed in Table 1, the bus journey records contain temporal information (i.e., date and bus arrival time). As both average daily temperature and daily weather activity are time related, it is easy to associate each bus journey record with the average temperature and weather activity in that moment. Example 8 On August 5, 2015, at Tampere city, the average daily temperature was 15 C and the weather was rainy. Then, for the 3rd record listed in Table 1, we fuse it with weather factors (in bold font) as follows. For the sake of query efficiency, it is necessary to build indexes for the external factors. Similar to building indexes for bus journey data (Sect. 3.2), for each external factor, we build corresponding indexes based on the domain partition. For an external factor whose data type is continuous (e.g., average daily temperature), the index is built like building the index on ‘‘Delay’’. Specifically, for each bus journey record, we construct a bit-vector indicating the interval that the value of average daily temperature locates in. For an external factor whose data type is enumerable (e.g., daily weather activity), the index is built like building the index on ‘‘day of the week’’. Specifically, we use a set of bit-vectors to indicate the values of daily weather for all bus journey records. 4 Empirical Evaluation In this section, we report a systematic empirical study on a real-world bus data set from Tampere city.3 The data set includes the bus stop information of Tampere, 43 bus routes, 75 bus stops of a line on average, and 6,297,520 records generated from all weekdays from August 1, 2015, to October 30, 2015. We also collected daily weather activity and average daily temperature for each record from Weather Underground Web site. All algorithms are implemented in Java and compiled using JDK 7. The distributed environment is built using Spark 1.6.2 and includes eight nodes. Each node is a computer with an Intel Core i7-6700 3.40 GHz CPU, and 64 GB main memory, running Ubuntu 14.04 operating system. Figure 8 illustrates the quantitative relationships between the numbers of non-on-time arrivals with different time intervals (measured in minutes) ahead or delayed. From Fig. 8, it is interesting to see that most (over 75%) of non-on-time arrivals happened within 3 min compared to 3 http://trafficdata.sis.uta.fi. the scheduled time. Specifically, for the early case, there are over 87% arrivals arriving slightly ahead of schedule. And for the delaying case, there are 77% arrivals arriving slightly behind the schedule. Considering the complex situations in the real world, it is impractical to expect that every bus arrives on-time exactly. Naturally, we treat bus arrivals with slight time ahead or delayed as normal (ontime) cases. Thus, in our empirical study, we label each bus arrival as a non-on-time event including ‘‘early case’’ or ‘‘delaying case’’, if the value of ‘‘Delay’’ is \ 3 min (‘‘early case’’) or [ 3 min (‘‘delaying case’’). Figure 9a illustrates the number of days with respect to certain weather. The daily weather contained in weather data include: rain, fog, and thunder. We can see that the rainy weather is more frequent than other weather activities. As the number of days with thunder (i.e., 3) is too small, we only consider rain and fog in this empirical study. Figure 9b shows the change of daily temperature. The highest temperature is 19 C on August 15, and the lowest temperature is 2 C on October 9, 28, and 30. There are 51 such days that the average daily temperature is above 10 C. 4.1 Effectiveness We verify the effectiveness of Bus-OLAP using three kinds of typical queries: temporal queries, spatial queries, and spatial–temporal queries. For the temporal queries, there are two frequent queries: (1) What is the number of delaying events in every weekday? (2) Which are the routes that have the most delaying events? Figure 10 illustrates the numbers of non-on-time arrival events on weekdays in August, September, and October, respectively. As we stated in Sect. 1, the main concern of bus non-on-time query is the delaying case, and in Fig. 10, early delaying early delaying 6 we can see that the number of delaying cases is much larger than the number of early cases. It is interesting to see that the weekday with the maximum number of non-on-time events keeps changing from August to October. For example, in August, the number of delaying cases on Monday is the largest, in September, the largest number is on Tuesday, while in October, the largest one is on Thursday. Table 3 lists the top 3 routes where delaying events occur frequently during different time intervals. It is interesting to see that Lines 13, 8, and 3 are the routes with the largest numbers of delaying events, especially in time interval [14:00, 18:00), over the three months. We also find that the lines that have frequently delaying events during different periods are different. One possible explanation is that the direction of crowd flow in the morning rush hour is different from that in the evening rush hour. The buses were delayed due to the getting in and getting off time by large numbers of passengers. Figures 11 and 12 illustrate the query results of spatial queries, i.e., KNN and RKNN, on two bus stops, respectively. The two target bus stops are ‘‘Yliopistonkatu’’(ID: 560) and ‘‘Villa Viola’’ (ID: 700). In Fig. 11, the red points are the query bus stops, while the blue points are the eight bus stops nearest to the query stops. Similarly, in Fig. 12, the red points are the query bus stops, while the blue points are the RKNN bus stops for the query stops. One typical bus non-on-time query is detecting bus delay aggregation. Please recall that, as listed in Table 2, we partitioned the space of Tampere city into 2000 1000 equal-sized grids in Bus-OLAP. Let the maximum search range be 100 100 grids. Then, by Bus-OLAP, we can answer following questions with different given time intervals: (1) ‘‘On which weekday the most significant delay aggregations took place?’’, (2) ‘‘In which time interval the most significant delay aggregation occur?’’, (3) ‘‘Which day and time interval has the most significant bus delay aggregations.’’ Table 4 Zone with spatial– temporal delay aggregation of every weekday Weekday Mon Tue Wed Thu Fri Table 4 lists the zones of spatial–temporal delay aggregation with different time intervals. We present a zone in the form of (minLon, maxLon, minLat, maxLat), where minLon and maxLon denote, respectively, the minimal and maximal longitude of a zone, minLat and maxLat denote, respectively, the minimal and maximal latitude of a zone. As listed in Table 4, there are more significant delay aggregations during [14:00, 18:00) compared to other time intervals. Furthermore, period [14:00, 18:00) on Friday is the worst time interval as it has the largest number of significant delay aggregations. We guess it is because the traffic flow is heavy in the afternoon rush hour of Friday. Figure 13 shows the zones with significant bus delay aggregation in the morning (a) and in the afternoon (b), respectively. We find that the zones where bus delay aggregations happened are non-overlapping. We think one possible reason is that the zones where buses are full of people are different in the morning and afternoon. To verify the correctness of the detected zones with bus delay aggregation, we consider the heat delaying cases in Fig. 14. It is easy to see that the detected zones are consistent with the heat distributions. Intuitively, the bus delay aggregation is closely related to the traffic flow, detecting the zones with significant bus delay LLR respectively, indicated by rectangles with red solid line, blue dot line, and dark dash line.), a [07:00, 08:00), b [08:00, 09:00), c [16:00, 17:00), d [17:00, 18:00) aggregation can reflect the change of traffic flow. In addition, it is interesting to investigate the impacts of weather conditions on the bus delay aggregation. To this end, we perform queries taking weather factor into consideration over morning and evening rush hours, respectively. In Fig. 15, we use rectangles with red solid line, blue dot line, and dark dash line to indicate the zones where significant bus delay aggregation happened in normal weather, rainy weather, and foggy weather, respectively. We can observe the flow of traffic changes from the east side of Fig. 16 Zones with the most significant busy delay aggregation w.r.t. average daily temperature in morning and evening rush hours. (Rectangles with red solid line are the zones when the average daily temperature [ 10 C, and rectangles with blue dash line are the zones when the average daily temperature 10 C.), a [07:00, 08:00), b [08:00, 09:00), c [16:00, 17:00), d [17:00, 18:00) Tampere city to the west side in both normal weather and foggy weather from 16:00 to 18:00. We also see that the zone with the most significant bus delay aggregation in foggy weather is clearly different from the ones in normal and rainy weather during 8:00 9:00 am. In addition, we can see that the zone with the most significant bus delay aggregation in rainy weather is clearly different from the ones in normal and foggy weather during 17:00 18:00. We guess the reason lies in that, firstly, the impact of fog on bus delay aggregation mainly exists in the morning, since fog disperses in the afternoon, and secondly, the impact of rain mainly exists in the evening due to the accumulation of rainwater on the road surface. In Fig. 16, we illustrate the zones with the most significant busy delay aggregations when the average daily temperature is above 10 C (rectangle with red solid line) and is not above 10 C (rectangle with blue dash line), respectively. Clearly, the zones detected under different temperature ranges are different. As shown in Figs. 15 and 16, taking weather factors into bus non-on-time analysis is necessary and reasonable. 4.2 Efficiency In Bus-OLAP, we build the indexes based on bit-vectors and perform bus non-on-time queries by index operations. In order to verify the effectiveness of the index building, we first test the runtime for building the indexes with respect to the number of records in Bus-OLAP. Then, we compare the query efficiency using Bus-OLAP against the query using relational database MySQL 5.5. The results of efficiency test are shown in Fig. 17. We test the efficiency by three most frequently used queries, querying the time interval with the most delaying events for temporal query shown in Fig. 17b, searching the RKNN bus stops for spatial query illustrated in Fig. 17c and finding the zones with significant bus delay aggregation from Monday to Friday for temporal–spatial query shown in Fig. 17d. Clearly, as shown in Fig. 17a, we can see that the index building in Bus-OLAP is efficient, and the building time increases linearly with the size of bus journey records. Moreover, performing bus non-on-time query by BusOLAP is significantly more efficient than by MySQL. To satisfy the requirement of large number of queries, we have implement Bus-OLAP on Spark, so that parallel query computation can be supported. Again, we test the query efficiency with respect to the number of computing nodes by finding the zones with significant bus delay Bus−OLAP MySQL 10 100 Records Size (104) (b) 1000 Bus−OLAP MySQL aggregation during a certain time interval in weekday. Figure 18a shows the runtime for computing the LLR of every grid with respect to the number of computing nodes. Figure 18b shows the runtime for querying delay aggregation zones during time interval [16:00, 17:00) on weekdays. It is clear that in both LLR computation and aggregation queries, the runtime of Bus-OLAP decreases when more computing nodes are used. In summary, by our proposed index building method and the parallel query framework, Bus-OLAP is efficient for bus non-on-time query. 5 Conclusions In this paper, we tackled the novel and interesting problem of non-on-time query over bus journey data. We designed a model, named Bus-OLAP, to support non-on-time queries over the data. For the sake of efficiency, we built the index of bus journey data using bit-vectors and introduced index operations to convert the queries into bitwise operations. In addition, we implemented the distributed query computation based on the Spark framework. Our experiments verified the effectiveness and efficiency of Bus-OLAP. There are several interesting issues that deserve research effort in the future. First, we will consider more complex scenario applications of non-on-time analysis and fuse some implicit factors (e.g., road conditions, number of passengers) that affect the bus. Second, there are large bus data sets in reality, and we try to apply the Bus-OLAP to more real data and more data sources of other cities. Then, we will continue on optimizing the frequently used non-ontime queries. It is also interesting to consider dynamic computation of the current traffic situation, that is, analyzing if the current situation is different from ‘‘normal.’’ Moreover, we plan to combine the bus data analysis with the management of urban traffic and study the relationships between them. Acknowledgements The authors are grateful to Dr. Paula Syrja¨rinne for her help on bus journey data preparation and to the editor and the anonymous reviewers for their constructive comments, which have helped to improve this paper. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://crea tivecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. 1. Bao J , He T , Ruan S , Li Y , Zheng Y ( 2017 ) Planning bike lanes based on sharing-bikes' trajectories . In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining , pp 1377 - 1386 2. Chen M , Liu Y , Yu X ( 2015 ) Predicting next locations with object clustering and trajectory clustering . In: Proceedings of the 19th Pacific-Asia conference on knowledge discovery and data mining, part II , pp 344 - 356 3. Eldawy A , Mokbel MF ( 2013 ) A demonstration of spatialhadoop: an efficient mapreduce framework for spatial data . Proc VLDB Endow 6 ( 12 ): 1230 - 1233 4. Ghosh B , Basu B , O'Mahony M ( 2009 ) Multivariate short-term traffic flow forecasting using time-series analysis . IEEE Trans Intell Transp Syst 10 ( 2 ): 246 - 254 5. Guttman A ( 1984 ) R-trees: a dynamic index structure for spatial searching . In: Proceedings of ACM SIGMOD'84 , pp 47 - 57 6. Han Y , Moutarde F ( 2016 ) Analysis of large-scale traffic dynamics in an urban transportation network using non-negative tensor factorization . Int J Intell Transp Syst Res 14 ( 1 ): 36 - 49 7. Jagadish HV , Ooi BC , Tan KL , Yu C , Zhang R ( 2005 ) iDistance: an adaptive Bþ-tree based indexing method for nearest neighbor search . ACM Trans Database Syst 30 ( 2 ): 364 - 397 8. Kong X , Xu Z , Shen G , Wang J , Yang Q , Zhang B ( 2016 ) Urban traffic congestion estimation and prediction based on floating car trajectory data . Future Gener Comput Syst 61 : 97 - 107 9. Liu D , Chen H , Qi H , Yang B ( 2013 ) Advances in spatiotemporal data mining . J Comput Res Dev 50 ( 2 ): 225 - 239 10. Pang LX , Chawla S , Liu W , Zheng Y ( 2011 ) On mining anomalous patterns in road traffic streams . In: Proceedings of the 7th international conference on advanced data mining and applications , part II , pp 237 - 251 11. Pang T , Duan L , Nummenmaa J , Zuo J , Zhang P ( 2017 ) BusOLAP: a bus journey data management model for non-on-time events query . In: Proceedings of the 1st international joint conference on web and big data , part II , pp 185 - 200 12. Sistla AP , Wolfson O , Chamberlain S , Dao S ( 1997 ) Modeling and querying moving objects . In: Proceedings of the 13th international conference on data engineering , pp 422 - 432 13. Stathopoulos A , Karlaftis MG ( 2003 ) A multivariate state space approach for urban traffic flow modeling and prediction . Transp Res Part C: Emerg Technol 11 ( 2 ): 121 - 135 14. Syrja¨rinne P, Nummenmaa J ( 2015 ) Improving usability of open public transportation data . In: Proceedings of the 22nd ITS world congress , pp 5 - 9 15. Syrja¨rinne P, Nummenmaa J , Thanisch P , Kerminen R , Hakulinen E ( 2015 ) Analysing traffic fluency from bus data . IET Intell Transp Syst 9 ( 6 ): 566 - 572 16. Ting RH , De Almeida T , Ding Z ( 2006 ) Modeling and querying moving objects in networks . VLDB J 15 ( 2 ): 165 - 190 17. Wang Y , Papageorgiou M , Messmer A ( 2007 ) Real-time freeway traffic state estimation based on extended Kalman filter: a case study . Transp Sci 41 ( 2 ): 167 - 181 18. Wu X , Duan L , Pang T , Nummenmaa J ( 2016 ) Detection of statistically significant bus delay aggregation by spatial-temporal scanning . In: Proceedings of APWeb 2016 workshops , pp 277 - 288 19. Xia D , Li H , Wang B , Li Y ( 2016 ) A map reduce-based nearest neighbor approach for big-data-driven traffic flow prediction . IEEE Access 4 : 2920 - 2934 20. Xie X , Xiong Z , Hu X , Zhou G , Ni J ( 2014 ) On massive spatial data retrieval based on spark . In: Proceedings of WAIM 2014 international workshops , pp 200 - 208 21. Yu X , Pu KQ , Koudas N ( 2005 ) Monitoring k-nearest neighbor queries over moving objects . In: Proceedings of the 21st international conference on data engineering , pp 631 - 642 22. Yuan J , Zheng Y , Zhang L , Xie X , Sun G ( 2011 ) Where to find my next passenger . In: Proceedings of the 13th international conference on ubiquitous computing , pp 109 - 118 23. Zaharia M , Xin RS , Wendell P , Das T , Armbrust M , Dave A , Meng X , Rosen J , Venkataraman S , Franklin MJ , Ghodsi A , Gonzalez J , Shenker S , Stoica I ( 2016 ) Apache spark: a unified engine for big data processing . Commun ACM 59 ( 11 ): 56 - 65 24. Zhang J , Zheng Y , Qi D ( 2017 ) Deep spatio-temporal residual networks for citywide crowd flows prediction . In: Proceedings of the 31st AAAI conference on artificial intelligence , pp 1655 - 1661 25. Zheng Y , Capra L , Wolfson O , Yang H ( 2014 ) Urban computing: concepts, methodologies, and applications . ACM Trans Intell Syst Technol 5 ( 3 ): 38

This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1007%2Fs41019-018-0061-9.pdf

Lei Duan, Tinghai Pang, Jyrki Nummenmaa, Jie Zuo, Peng Zhang, Changjie Tang. Bus-OLAP: A Data Management Model for Non-on-Time Events Query Over Bus Journey Data, Data Science and Engineering, 2018, 52-67, DOI: 10.1007/s41019-018-0061-9