Taking a ‘Big Data’ approach to data quality in a citizen science project (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007%2Fs13280-015-0710-4.pdf

Taking a ‘Big Data’ approach to data quality in a citizen science project

Ambio 2015, 44(Suppl. 4):S601–S611 DOI 10.1007/s13280-015-0710-4 Taking a ‘Big Data’ approach to data quality in a citizen science project Steve Kelling, Daniel Fink, Frank A. La Sorte, Alison Johnston, Nicholas E. Bruns, Wesley M. Hochachka Abstract Data from well-designed experiments provide the strongest evidence of causation in biodiversity studies. However, for many species the collection of these data is not scalable to the spatial and temporal extents required to understand patterns at the population level. Only data collected from citizen science projects can gather sufficient quantities of data, but data collected from volunteers are inherently noisy and heterogeneous. Here we describe a ‘Big Data’ approach to improve the data quality in eBird, a global citizen science project that gathers bird observations. First, eBird’s data submission design ensures that all data meet high standards of completeness and accuracy. Second, we take a ‘sensor calibration’ approach to measure individual variation in eBird participant’s ability to detect and identify birds. Third, we use species distribution models to fill in data gaps. Finally, we provide examples of novel analyses exploring population-level patterns in bird distributions. Keywords Biodiversity monitoring Citizen science eBird Data quality Species distribution models INTRODUCTION The conservation of species begins with an understanding of the patterns of distribution, abundance, and movements of individuals. These patterns are driven by an interacting series of climatic, geological, ecological, and anthropogenic processes operating simultaneously across a range of spatial and temporal scales (Bell 2012). Only by comparing these patterns across a range of spatial and temporal scales can we begin to identify the interacting role of these processes. For example, if a species–habitat association does not vary across a wide geographical area, we can gather data within a limited spatial extent and make inferences and predictions well outside the area of data collection. When species–habitat associations change across spatial or temporal scales, as they often do (Gaston and Spicer 2013), then making predictions requires a broader spatio-temporal perspective. In general, to study and understand entire ecological systems, data must be collected at fine resolutions over broad spatial and temporal extents, particularly for wideranging species. However, the cost and availability of experts needed to collect sufficient quantities of ecological data do not scale readily across broad spatial or temporal extents. Citizen science projects have emerged as an efficient way to gather such data by engaging a large number of people and compiling their ecological observations, and the fastest growth in species’ distribution data comes from volunteers participating in citizen science projects (Pimm et al. 2014). Nevertheless, data gathered by citizen science projects are often highly variable due to the opportunistic approach for data collection, which poses several challenges to its analysis and interpretation. First, engaging the large numbers of volunteers needed to collect data across broad extents requires data collection protocols that are straightforward and enjoyable, instead of complex and tedious (Bonney et al. 2009). The drawback to this approach is that it gives volunteers the choice of how, where, and when they make observations. In general, this results in more heterogeneous data that are less informative than data collected under more constrained data collection protocols (Hochachka et al. 2012). Second, open participation of a broad public will attract participants with varied skill levels at detecting and identifying organisms. Third, many citizen science projects fall into the category of surveillance monitoring, Ó The Author(s) 2015. This article is published with open access at Springerlink.com www.kva.se/en 123 S602 Ambio 2015, 44(Suppl. 4):S601–S611 which is motivated by general data collection for many uses and lacks strong a priori hypotheses to shape the data collection protocol; this has, for example, led to criticism of surveillance monitoring for its lack of management-oriented hypotheses (Nichols and Williams 2006). The lack of well-defined hypotheses to define data collection protocols and individual variability in skill levels and data collection processes are major obstacles in accurately interpreting citizen science data that must be recognized and addressed during analysis. Recent advances in Big Data, a broadly defined field encompassing the access, management, and computational processing of extremely large data sets to reveal associations, patterns, and trends (Manyika et al. 2011), are increasingly being integrated into ecological studies (Hampton et al. 2013). Big Data is not just about ‘‘a lot of data’’ but includes developing methods to handle the constant acquisition of new data, integrating disparate data from multiple sources, and most importantly addressing issues of data quality across the various sources of data (Lagoze 2014). One citizen science project that is collecting large volumes of data across broad spatial and temporal extents is eBird (Sullivan et al. 2009), which uses Big Data techniques to curate, access, and analyze data. The goal of collecting information about birds’ distributions across huge regions and throughout the year requires eBird to engage a large number of regular participants; this is only possible when participants are not highly constrained in how they make their observations (Wood et al. 2011). However, these same protocols impose a cost during the data analysis because it is easier to analyze data that conform to more standardized protocols that remove potential sources of variation in counts of birds by constraining aspects of the observation process, for example, locations, times of day, and durations of observation. The analysis challenge is the need to identify and model important aspects of the observation process that were not controlled during data collection. Once sources of variation in the observation process can be modeled, they can be predicted, which provides the avenue for post-collection analytical data quality control. In this paper, we address the data quality challenges inherent in surveillance monitoring projects such as eBird. First, we describe how we improve data quality during data submission. Second, we show how we can model variability among individual participants. Third, we describe how we use species distribution models to fill in data gaps while also modeling the data collection process in order to control for biases in where, when, and how eBird data are collected. 123 MATERIALS AND METHODS eBird Data for this study came from eBird, which engages volunteers via the Internet and mobile apps to collect bird observations (Sullivan et al. 2014). Presently, more than 250 000 participants have submitted more than 17 (...truncated)