Developing Standards for Improved Data Quality and for Selecting Fit for Use Biodiversity Data
Biodiversity Information Science and Standards 4: e50889
doi: 10.3897/biss.4.50889
Standards
Developing Standards for Improved Data Quality
and for Selecting Fit for Use Biodiversity Data
Arthur D Chapman‡, Lee Belbin§, Paula F Zermoglio|, John Wieczorek¶, Paul J Morris#, Miles Nicholls¤
, Emily Rose Rees¤, Allan Koch Veiga«, Alexander Thompson», Antonio Mauro Saraiva«, Shelley A
James˄, Christian Gendreau˅, Abigail Benson¦, Dmitry Schigel˅
‡ Australian Biodiversity Information Services, Ballan, Australia
§ The Atlas of Living Australia, Carlton, Australia
| VertNet, Buenos Aires, Argentina
¶ Museum of Vertebrate Zoology, University of California, Berkeley, United States of America
# Museum of Comparative Zoology, Harvard University, Cambridge, MA, United States of America
¤ Atlas of Living Australia, Canberra, Australia
« University of Sao Paulo, Sao Paulo, Brazil
» iDigBio, Gainesville, United States of America
˄ Department of Biodiversity, Conservation and Attractions, Western Australian Herbarium, Kensington, WA, Australia
˅ Global Biodiversity Information Facility - Secretariat, Copenhagen Ø, Denmark
¦ U.S. Geological Survey, Lakewood, CO, United States of America
Corresponding author: Arthur D Chapman ()
Academic editor: Gail Kampmeier
Received: 07 Feb 2020 | Accepted: 16 Mar 2020 | Published: 20 Mar 2020
Citation: Chapman AD, Belbin L, Zermoglio PF, Wieczorek J, Morris PJ, Nicholls M, Rees ER, Veiga AK,
Thompson A, Saraiva AM, James SA, Gendreau C, Benson A, Schigel D (2020) Developing Standards for
Improved Data Quality and for Selecting Fit for Use Biodiversity Data. Biodiversity Information Science and
Standards 4: e50889. https://doi.org/10.3897/biss.4.50889
Abstract
The quality of biodiversity data publicly accessible via aggregators such as GBIF (Global
Biodiversity Information Facility), the ALA (Atlas of Living Australia), iDigBio (Integrated
Digitized Biocollections), and OBIS (Ocean Biogeographic Information System) is often
questioned, especially by the research community.
The Data Quality Interest Group, established by Biodiversity Information Standards
(TDWG) and GBIF, has been engaged in four main activities: developing a framework for
the assessment and management of data quality using a fitness for use approach; defining
a core set of standardised tests and associated assertions based on Darwin Core terms;
gathering and classifying user stories to form contextual-themed use cases, such as
© Chapman A et al. This is an open access article distributed under the terms of the Creative Commons Attribution License (CC
BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are
credited.
2
Chapman A et al
species distribution modelling, agrobiodiversity, and invasive species; and developing a
standardised format for building and managing controlled vocabularies of values.
Using the developed framework, data quality profiles have been built from use cases to
represent user needs. Quality assertions can then be used to filter data suitable for a
purpose. The assertions can also be used to provide feedback to data providers and
custodians to assist in improving data quality at the source. A case study, using two
different implementations of tests and assertions based around the Darwin Core "Event
Date" terms, were also tested against GBIF data, to demonstrate that the tests are
implementation agnostic, can be run on large aggregated datasets, and can make
biodiversity data more fit for typical research uses.
Keywords
data quality, profile, framework, fitness for use, standards, tests and assertions, data
quality tests, vocabularies, Darwin Core, GBIF
1. Introduction
Biodiversity Information Standards (TDWG) is a not-for-profit volunteer-based scientific
association formed to establish international collaboration among the world's biological
databases (TDWG 2007). TDWG encourages the wider and more effective dissemination
of information about biological organisms for the benefit of the world at large through the
establishment of biodiversity information standards. In recent years, TDWG has focused on
the development of standards for the exchange and dissemination of different types of
biological and biodiversity data—including names, taxa, specimens, observations, images,
geographic locations, ecology, genetics, traits, and animal movements.
The Global Biodiversity Information Facility (GBIF) is an international network and research
infrastructure that aggregates biodiversity data shared by myriad sources around the world.
The volume of aggregated biodiversity data has increased in recent years, with GBIF now
publishingover 1.3 billion records (GBIF 2018, GBIF 2020). Quality varies considerably
within this mass of data (Gaiji et al. 2013, Mesibov 2013, Mesibov 2018) and issues and
variation in quality affect the fitness for use of these data in different contexts (Chrisman
1991, Chapman 2005a, Chapman 2005b).
Recognising the urgent need to address the data quality issue, TDWG, in conjunction with
GBIF, established a Data Quality Interest Group to examine biodiversity data quality and to
make recommendations on ways to address it (Belbin et al. 2013, Saraiva and Chapman
2013).
Developing Standards for Improved Data Quality and for Selecting Fit for ...
3
2. Background
Digital exchange of institutional biological data began in the 1970s (Busby 1979) with small
amounts of data, largely between individual institutions and researchers. It wasn't until the
1990s that biodiversity data began to be digitised on a large scale and made available to a
wider audience (e.g., ERIN (Chapman and Busby 1994), FishGopher (see Wiley and
Peterson 2004p. 92), and MaNIS (Stein and Wieczorek 2004)). Most data exchange
initially was in support of taxonomic research, such as the description of new taxa, the
writing of floras and faunas, and for writing monographs. Over time, demand has grown for
biological data to be used for other purposes - for example for species distribution
modelling (Longmore 1989, Peterson et al. 1998), biogeographic analysis and
regionalisation (Thackway and Cresswell 1992), phylogenetic studies (Hamilton 2013), and
conservation analysis (Ponder et al. 2001, Graham et al. 2004).
The development and expansion of the Internet has been a major driver increasing
demand for data from a wider audience, reflected in the development of aggregation
initiatives (Chapman and Busby 1994, Soberon et al. 1996), including some with specific
purposes in mind, such as for species distribution modelling (Stockwell et al. 2006). In
2001, the Global Biodiversity Information Facility (GBIF) was established (Edwards 2004,
Lane 2005) with the aim of aggregating data from the world's biological institutions, initially
focusing on specimen data from museums and herbaria, then grid-based data from
conservation initiatives (Yesson et al. 2007, Landuyt et al. 2012), and data from
observation initiat (...truncated)