Developing Standards for Improved Data Quality and for Selecting Fit for Use Biodiversity Data

Biodiversity Information Science and Standards, Mar 2020

The quality of biodiversity data publicly accessible via aggregators such as GBIF (Global Biodiversity Information Facility), the ALA (Atlas of Living Australia), iDigBio (Integrated Digitized Biocollections), and OBIS (Ocean Biogeographic Information System) is often questioned, especially by the research community.The Data Quality Interest Group, established by Biodiversity Information Standards (TDWG) and GBIF, has been engaged in four main activities: developing a framework for the assessment and management of data quality using a fitness for use approach; defining a core set of standardised tests and associated assertions based on Darwin Core terms; gathering and classifying user stories to form contextual-themed use cases, such as species distribution modelling, agrobiodiversity, and invasive species; and developing a standardised format for building and managing controlled vocabularies of values.Using the developed framework, data quality profiles have been built from use cases to represent user needs. Quality assertions can then be used to filter data suitable for a purpose. The assertions can also be used to provide feedback to data providers and custodians to assist in improving data quality at the source. A case study, using two different implementations of tests and assertions based around the Darwin Core "Event Date

Article PDF cannot be displayed. You can download it here:

https://biss.pensoft.net/article/50889/download/pdf/

Developing Standards for Improved Data Quality and for Selecting Fit for Use Biodiversity Data

Biodiversity Information Science and Standards 4: e50889 doi: 10.3897/biss.4.50889 Standards Developing Standards for Improved Data Quality and for Selecting Fit for Use Biodiversity Data Arthur D Chapman‡, Lee Belbin§, Paula F Zermoglio|, John Wieczorek¶, Paul J Morris#, Miles Nicholls¤ , Emily Rose Rees¤, Allan Koch Veiga«, Alexander Thompson», Antonio Mauro Saraiva«, Shelley A James˄, Christian Gendreau˅, Abigail Benson¦, Dmitry Schigel˅ ‡ Australian Biodiversity Information Services, Ballan, Australia § The Atlas of Living Australia, Carlton, Australia | VertNet, Buenos Aires, Argentina ¶ Museum of Vertebrate Zoology, University of California, Berkeley, United States of America # Museum of Comparative Zoology, Harvard University, Cambridge, MA, United States of America ¤ Atlas of Living Australia, Canberra, Australia « University of Sao Paulo, Sao Paulo, Brazil » iDigBio, Gainesville, United States of America ˄ Department of Biodiversity, Conservation and Attractions, Western Australian Herbarium, Kensington, WA, Australia ˅ Global Biodiversity Information Facility - Secretariat, Copenhagen Ø, Denmark ¦ U.S. Geological Survey, Lakewood, CO, United States of America Corresponding author: Arthur D Chapman () Academic editor: Gail Kampmeier Received: 07 Feb 2020 | Accepted: 16 Mar 2020 | Published: 20 Mar 2020 Citation: Chapman AD, Belbin L, Zermoglio PF, Wieczorek J, Morris PJ, Nicholls M, Rees ER, Veiga AK, Thompson A, Saraiva AM, James SA, Gendreau C, Benson A, Schigel D (2020) Developing Standards for Improved Data Quality and for Selecting Fit for Use Biodiversity Data. Biodiversity Information Science and Standards 4: e50889. https://doi.org/10.3897/biss.4.50889 Abstract The quality of biodiversity data publicly accessible via aggregators such as GBIF (Global Biodiversity Information Facility), the ALA (Atlas of Living Australia), iDigBio (Integrated Digitized Biocollections), and OBIS (Ocean Biogeographic Information System) is often questioned, especially by the research community. The Data Quality Interest Group, established by Biodiversity Information Standards (TDWG) and GBIF, has been engaged in four main activities: developing a framework for the assessment and management of data quality using a fitness for use approach; defining a core set of standardised tests and associated assertions based on Darwin Core terms; gathering and classifying user stories to form contextual-themed use cases, such as © Chapman A et al. This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. 2 Chapman A et al species distribution modelling, agrobiodiversity, and invasive species; and developing a standardised format for building and managing controlled vocabularies of values. Using the developed framework, data quality profiles have been built from use cases to represent user needs. Quality assertions can then be used to filter data suitable for a purpose. The assertions can also be used to provide feedback to data providers and custodians to assist in improving data quality at the source. A case study, using two different implementations of tests and assertions based around the Darwin Core "Event Date" terms, were also tested against GBIF data, to demonstrate that the tests are implementation agnostic, can be run on large aggregated datasets, and can make biodiversity data more fit for typical research uses. Keywords data quality, profile, framework, fitness for use, standards, tests and assertions, data quality tests, vocabularies, Darwin Core, GBIF 1. Introduction Biodiversity Information Standards (TDWG) is a not-for-profit volunteer-based scientific association formed to establish international collaboration among the world's biological databases (TDWG 2007). TDWG encourages the wider and more effective dissemination of information about biological organisms for the benefit of the world at large through the establishment of biodiversity information standards. In recent years, TDWG has focused on the development of standards for the exchange and dissemination of different types of biological and biodiversity data—including names, taxa, specimens, observations, images, geographic locations, ecology, genetics, traits, and animal movements. The Global Biodiversity Information Facility (GBIF) is an international network and research infrastructure that aggregates biodiversity data shared by myriad sources around the world. The volume of aggregated biodiversity data has increased in recent years, with GBIF now publishingover 1.3 billion records (GBIF 2018, GBIF 2020). Quality varies considerably within this mass of data (Gaiji et al. 2013, Mesibov 2013, Mesibov 2018) and issues and variation in quality affect the fitness for use of these data in different contexts (Chrisman 1991, Chapman 2005a, Chapman 2005b). Recognising the urgent need to address the data quality issue, TDWG, in conjunction with GBIF, established a Data Quality Interest Group to examine biodiversity data quality and to make recommendations on ways to address it (Belbin et al. 2013, Saraiva and Chapman 2013). Developing Standards for Improved Data Quality and for Selecting Fit for ... 3 2. Background Digital exchange of institutional biological data began in the 1970s (Busby 1979) with small amounts of data, largely between individual institutions and researchers. It wasn't until the 1990s that biodiversity data began to be digitised on a large scale and made available to a wider audience (e.g., ERIN (Chapman and Busby 1994), FishGopher (see Wiley and Peterson 2004p. 92), and MaNIS (Stein and Wieczorek 2004)). Most data exchange initially was in support of taxonomic research, such as the description of new taxa, the writing of floras and faunas, and for writing monographs. Over time, demand has grown for biological data to be used for other purposes - for example for species distribution modelling (Longmore 1989, Peterson et al. 1998), biogeographic analysis and regionalisation (Thackway and Cresswell 1992), phylogenetic studies (Hamilton 2013), and conservation analysis (Ponder et al. 2001, Graham et al. 2004). The development and expansion of the Internet has been a major driver increasing demand for data from a wider audience, reflected in the development of aggregation initiatives (Chapman and Busby 1994, Soberon et al. 1996), including some with specific purposes in mind, such as for species distribution modelling (Stockwell et al. 2006). In 2001, the Global Biodiversity Information Facility (GBIF) was established (Edwards 2004, Lane 2005) with the aim of aggregating data from the world's biological institutions, initially focusing on specimen data from museums and herbaria, then grid-based data from conservation initiatives (Yesson et al. 2007, Landuyt et al. 2012), and data from observation initiat (...truncated)


This is a preview of a remote PDF: https://biss.pensoft.net/article/50889/download/pdf/
Article home page: https://biss.pensoft.net/article/50889/

Arthur Chapman, Lee Belbin, Paula Zermoglio, John Wieczorek, Paul Morris, Miles Nicholls, Emily Rose Rees, Allan Veiga, Alexander Thompson, Antonio Saraiva, Shelley James, Christian Gendreau, Abigail Benson, Dmitry Schigel. Developing Standards for Improved Data Quality and for Selecting Fit for Use Biodiversity Data, Biodiversity Information Science and Standards, 2020, Issue 4, DOI: doi:10.3897/biss.4.50889