Taking a ‘Big Data’ approach to data quality in a citizen science project
Ambio 2015, 44(Suppl. 4):S601–S611
DOI 10.1007/s13280-015-0710-4
Taking a ‘Big Data’ approach to data quality in a citizen science
project
Steve Kelling, Daniel Fink, Frank A. La Sorte, Alison Johnston,
Nicholas E. Bruns, Wesley M. Hochachka
Abstract Data from well-designed experiments provide
the strongest evidence of causation in biodiversity studies.
However, for many species the collection of these data is
not scalable to the spatial and temporal extents required to
understand patterns at the population level. Only data
collected from citizen science projects can gather sufficient
quantities of data, but data collected from volunteers are
inherently noisy and heterogeneous. Here we describe a
‘Big Data’ approach to improve the data quality in eBird, a
global citizen science project that gathers bird
observations. First, eBird’s data submission design
ensures that all data meet high standards of completeness
and accuracy. Second, we take a ‘sensor calibration’
approach to measure individual variation in eBird
participant’s ability to detect and identify birds. Third,
we use species distribution models to fill in data gaps.
Finally, we provide examples of novel analyses exploring
population-level patterns in bird distributions.
Keywords Biodiversity monitoring Citizen science
eBird Data quality Species distribution models
INTRODUCTION
The conservation of species begins with an understanding
of the patterns of distribution, abundance, and movements
of individuals. These patterns are driven by an interacting
series of climatic, geological, ecological, and anthropogenic processes operating simultaneously across a range
of spatial and temporal scales (Bell 2012). Only by comparing these patterns across a range of spatial and temporal
scales can we begin to identify the interacting role of these
processes. For example, if a species–habitat association
does not vary across a wide geographical area, we can
gather data within a limited spatial extent and make
inferences and predictions well outside the area of data
collection. When species–habitat associations change
across spatial or temporal scales, as they often do (Gaston
and Spicer 2013), then making predictions requires a
broader spatio-temporal perspective.
In general, to study and understand entire ecological
systems, data must be collected at fine resolutions over
broad spatial and temporal extents, particularly for wideranging species. However, the cost and availability of
experts needed to collect sufficient quantities of ecological
data do not scale readily across broad spatial or temporal
extents. Citizen science projects have emerged as an efficient way to gather such data by engaging a large number
of people and compiling their ecological observations, and
the fastest growth in species’ distribution data comes from
volunteers participating in citizen science projects (Pimm
et al. 2014).
Nevertheless, data gathered by citizen science projects
are often highly variable due to the opportunistic
approach for data collection, which poses several challenges to its analysis and interpretation. First, engaging
the large numbers of volunteers needed to collect data
across broad extents requires data collection protocols
that are straightforward and enjoyable, instead of complex and tedious (Bonney et al. 2009). The drawback to
this approach is that it gives volunteers the choice of
how, where, and when they make observations. In general, this results in more heterogeneous data that are less
informative than data collected under more constrained
data collection protocols (Hochachka et al. 2012). Second, open participation of a broad public will attract
participants with varied skill levels at detecting and
identifying organisms. Third, many citizen science projects fall into the category of surveillance monitoring,
Ó The Author(s) 2015. This article is published with open access at Springerlink.com
www.kva.se/en
123
S602
Ambio 2015, 44(Suppl. 4):S601–S611
which is motivated by general data collection for many
uses and lacks strong a priori hypotheses to shape the
data collection protocol; this has, for example, led to
criticism of surveillance monitoring for its lack of management-oriented hypotheses (Nichols and Williams
2006). The lack of well-defined hypotheses to define data
collection protocols and individual variability in skill
levels and data collection processes are major obstacles
in accurately interpreting citizen science data that must
be recognized and addressed during analysis.
Recent advances in Big Data, a broadly defined field
encompassing the access, management, and computational
processing of extremely large data sets to reveal associations, patterns, and trends (Manyika et al. 2011), are
increasingly being integrated into ecological studies
(Hampton et al. 2013). Big Data is not just about ‘‘a lot of
data’’ but includes developing methods to handle the constant acquisition of new data, integrating disparate data
from multiple sources, and most importantly addressing
issues of data quality across the various sources of data
(Lagoze 2014).
One citizen science project that is collecting large
volumes of data across broad spatial and temporal
extents is eBird (Sullivan et al. 2009), which uses Big
Data techniques to curate, access, and analyze data.
The goal of collecting information about birds’ distributions across huge regions and throughout the year
requires eBird to engage a large number of regular
participants; this is only possible when participants are
not highly constrained in how they make their observations (Wood et al. 2011). However, these same protocols impose a cost during the data analysis because it
is easier to analyze data that conform to more standardized protocols that remove potential sources of
variation in counts of birds by constraining aspects of
the observation process, for example, locations, times
of day, and durations of observation. The analysis
challenge is the need to identify and model important
aspects of the observation process that were not controlled during data collection. Once sources of variation
in the observation process can be modeled, they can be
predicted, which provides the avenue for post-collection analytical data quality control.
In this paper, we address the data quality challenges
inherent in surveillance monitoring projects such as eBird.
First, we describe how we improve data quality during data
submission. Second, we show how we can model variability among individual participants. Third, we describe
how we use species distribution models to fill in data gaps
while also modeling the data collection process in order to
control for biases in where, when, and how eBird data are
collected.
123
MATERIALS AND METHODS
eBird
Data for this study came from eBird, which engages volunteers via the Internet and mobile apps to collect bird
observations (Sullivan et al. 2014). Presently, more than
250 000 participants have submitted more than 17 (...truncated)