A machine learning-based approach for classifying tourists and locals using geotagged photos: the case of Tokyo (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s40558-021-00208-3.pdf

A machine learning-based approach for classifying tourists and locals using geotagged photos: the case of Tokyo

Information Technology & Tourism https://doi.org/10.1007/s40558-021-00208-3 ORIGINAL RESEARCH A machine learning‑based approach for classifying tourists and locals using geotagged photos: the case of Tokyo Ahmed Derdouri1 · Toshihiro Osaragi1 Received: 17 November 2020 / Revised: 16 June 2021 / Accepted: 27 August 2021 © The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2021 Abstract In tourism-dependent cities, investigating the spatiotemporal distribution and dynamics of tourist flows is crucial for better urban planning in both steady and perturbed states. In recent years, researchers have started relying more on photo-based, geotagged social data, which offer insights about tourists, popular hotspots, and mobility patterns. However, distinguishing between tourists and locals from this data is problematic since residence information is often not provided. While previous studies rely on heuristic (e.g., period of stay) and probabilistic (Shannon entropy) approaches, this paper proposes a method for classifying tourists and residents based on machine learning (ML) algorithms and considering parameters that could explain the variability between the two (e.g., weather, mobility, and photo content). This approach was applied to Flickr users’ geotagged photos taken in Tokyo’s 23 special wards from July 2008 to December 2019. The results show that stacked ensemble (SE) models are superior to models based on five supervised-learning algorithms, including gradient boosting machine (GBM), generalized linear model (GLM), distributed random forest (DRF), deep learning (DL), and extremely randomized trees (XRT). Temporal entropy (TEN), mobility on workdays, and frequent visits to amusement venues and crowded places influenced how users were classified. While temporal distribution showed similar monthly/hourly patterns, spatial distribution varied. The proposed approach might pave the way for scholars to carry out future tourism research on different topics and subsequently support policymakers in the decision-making process, specifically in urban settings. Keywords Classification · Tourists · Locals · Big data · Machine learning · Flickr · Geotagged photos · Tokyo * Ahmed Derdouri 1 School of Environment and Society, Tokyo Institute of Technology, 2‑12‑1‑M1‑25 Ookayama, Meguro‑ku, Tokyo 152‑8550, Japan 13 Vol.:(0123456789) A. Derdouri, T. Osaragi 1 Introduction In tourism research, big data sources fall into three principal categories: (1) usergenerated content (UGC), mainly consisting of online text- and photo-based social media records; (2) device data, including mobile phone and global positioning system (GPS) information; and (3) transaction data, for example, web searches and booking data (Li et al. 2018). With the rapid spread of social media tools and their popularity among travelers for documenting trips, recent literature reviews reveal that social media-based UGC is the most popular data source among researchers (Li et al. 2018; Li and Law 2020), who tend to choose socialmedia UGC over conventional sources of small data to analyze tourists and tourism. Major topics include examining mobility patterns by reconstructing trajectories (Paraskevopoulos and Palpanas 2018; Straumann et al. 2014; Yang et al. 2017; Yuan and Medel 2016; Zeng et al. 2012), identifying tourist landmarks and hotspots (Kim et al. 2017; Samany 2019), analyzing tourists’ sentiments and behaviors (Jang and Moutinho 2019; Zhang et al. 2019, 2020), and recommending routes or planning trips (Kurashima et al. 2013; Lu et al. 2010). The low cost of and easy access to UGC datasets are key factors behind their popularity among researchers. Unlike other data types, UGC is updated regularly and covers a long time span and large geospace, resulting in bigger datasets with rich metadata. Such data could generally be purchased from telecommunication companies at high prices, depending on the breadth of the target area and the span of the study period. An increasing number of researchers have begun conducting in-depth studies to analyze, or even predict, human mobility patterns based on low-cost, geotagged records collected from social networks (e.g., Twitter, Weibo). These data have proven useful, although records with spatial attributes only account for a small percentage of all social media data (3.33% for Twitter) (Chen et al. 2019a). In addition to text-based social networks, photo-sharing platforms (e.g., Flickr) serve as sources of geotagged data, and they are useful for analyzing tourism issues from different viewpoints. For instance, Xu et al. (2020) note that large-scale datasets with abundant metadata can be used in longitudinal studies concerning unsustainable tourism, principally resulting from the effects of long-term, accumulated behavior. The advantages of UGC data over device and transaction data are multiple. For example, mobile phone data, a type of device data, are usually expensive to obtain, and this cost greatly depends on the spatial scale of the study area or the research period. Moreover, due to privacy concerns, mobile phone data are usually provided as macro-scale aggregated statistics instead of micro-scale samples, and such data do not include useful metadata. Thus, their application to tourism research is limited. Another form of device data is global positioning system (GPS) data. According to Li et al. (2018), two sources of GPS data are recognized in tourism studies: GPS loggers carried voluntarily by participants and GPS-enabled mobile applications owned by third parties. Despite the high accuracy and continuity of collection (Shoval et al. 2014), GPS data, when collected by volunteers, may suffer from biased results due to sample size and choice. 13 A machine learning-based approach for classifying tourists… Distinguishing between tourists and locals is crucial because the groups are dissimilar in many ways, including size and mobility patterns (Hasnat and Hasan 2018) during steady conditions and, most importantly, during perturbed states, such as natural disasters or large events (e.g., the Olympic Games). Previous human geographyrelated studies have focused mainly on understanding and modeling locals’ travel choices and behaviors, while ignoring those of tourists because they based their analysis on official survey data that includes only locals (e.g. Osaragi 2004; Osaragi and Hoshino 2012; Osaragi and Kudo 2019) or because they consider both groups as homogenous (e.g. Ma et al. 2020). Consequently, city planners and decision-makers know little about tourists’ travel choices. However, ignoring this population group may lead to serious environmental, economic, and socioeconomic consequences, especially in cities largely reliant on tourism. Saenz-de-Miera and Rosselló (2014) simulate tourists’ contribution to air pollution on the Spanish island of Mallorca, a top Mediterranean destination. They report that particulate (...truncated)