Sieve-based relation extraction of gene regulatory networks from biological literature
Žitnik et al. BMC Bioinformatics 2015, 16(Suppl 16):S1
http://www.biomedcentral.com/1471-2105/16/S16/S1
RESEARCH
Open Access
Sieve-based relation extraction of gene regulatory
networks from biological literature
Slavko Žitnik1,3*, Marinka Žitnik1, Blaž Zupan1,2, Marko Bajec1
From BioNLP Shared Task 2013
Sofia, Bulgaria. 9 August 2013
Abstract
Background: Relation extraction is an essential procedure in literature mining. It focuses on extracting semantic
relations between parts of text, called mentions. Biomedical literature includes an enormous amount of textual
descriptions of biological entities, their interactions and results of related experiments. To extract them in an
explicit, computer readable format, these relations were at first extracted manually from databases. Manual curation
was later replaced with automatic or semi-automatic tools with natural language processing capabilities. The
current challenge is the development of information extraction procedures that can directly infer more complex
relational structures, such as gene regulatory networks.
Results: We develop a computational approach for extraction of gene regulatory networks from textual data. Our
method is designed as a sieve-based system and uses linear-chain conditional random fields and rules for relation
extraction. With this method we successfully extracted the sporulation gene regulation network in the bacterium
Bacillus subtilis for the information extraction challenge at the BioNLP 2013 conference. To enable extraction of distant
relations using first-order models, we transform the data into skip-mention sequences. We infer multiple models, each
of which is able to extract different relationship types. Following the shared task, we conducted additional analysis
using different system settings that resulted in reducing the reconstruction error of bacterial sporulation network from
0.73 to 0.68, measured as the slot error rate between the predicted and the reference network. We observe that all
relation extraction sieves contribute to the predictive performance of the proposed approach. Also, features constructed
by considering mention words and their prefixes and suffixes are the most important features for higher accuracy of
extraction. Analysis of distances between different mention types in the text shows that our choice of transforming
data into skip-mention sequences is appropriate for detecting relations between distant mentions.
Conclusions: Linear-chain conditional random fields, along with appropriate data transformations, can be
efficiently used to extract relations. The sieve-based architecture simplifies the system as new sieves can be easily
added or removed and each sieve can utilize the results of previous ones. Furthermore, sieves with conditional
random fields can be trained on arbitrary text data and hence are applicable to broad range of relation extraction
tasks and data domains.
Background
We are witnessing an unprecedented increase in the
number of biomedical abstracts, experimental results
and phenotype and gene descriptions being deposited to
publicly available databases, such as NCBI’s PubMed.
* Correspondence:
1
Faculty of Computer and Information Science, University of Ljubljana, Večna
pot 113, SI-1000 Ljubljana, Slovenia
Full list of author information is available at the end of the article
Collectively, this content represents potential new discoveries that could be inferred with appropriately
designed natural language processing approaches. Identification of topics that appear in biomedical research literature was among first computational approaches to
predict associations between diseases and genes and has
become indispensable to both researchers in the biomedical field and curators [1-4]. Information from publication repositories is often mined together with other data
© 2015 Žitnik et al. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://
creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the
original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/
zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Žitnik et al. BMC Bioinformatics 2015, 16(Suppl 16):S1
http://www.biomedcentral.com/1471-2105/16/S16/S1
sources. Databases that store relations from integrative
mining are for example the OMIM database on human
genes and genetic phenotypes [5], the GeneRIF function
annotation database [6], the Gene Ontology [7] and clinical drug information from the DailyMed database [8].
Biomedical mining of literature is a compelling way to
identify possible candidate genes through integration of
existing data.
A dedicated set of computational techniques is required
to infer structured relations from plain textual information
stored in large literature databases [9]. Relation extraction
tools [10] can identify semantic relations between entities
found in text. Early relationship extraction systems relied
mostly on manually defined rules to extract a limited
number of relationship types [11]. Later, machine learning-based methods were introduced to address the extraction task by inferring prediction models from sets of
labeled relationship types [12-14]. When no labeled data
were available, unsupervised systems were developed
to extract relationship descriptors based on the language
syntax [10]. Current state-of-the-art systems combine
both machine learning and rule-based approaches to
extract relevant information from narrative summaries
and represent it in a structured form [15,16].
This paper aims at the extraction of gene regulatory
networks of Bacillus subtilis. The reconstruction and
elucidation of gene regulation networks is an important
task that can change our understanding of the processes
and molecular interactions within the cell [17-19]. We
have developed a novel sieve-based computational methodology that builds upon conditional random fields [20]
and specialized rules to extract gene relations from
unstructured text. Extracted relations are assembled into
a multi-relational gene network that is informative of
the type of regulation between pairs of genes and the
directionality of their action. The proposed approach
can consider biological literature on gene interactions
from multiple data sources. The main novelty of our
work here is the construction of a sequential analysis
pipeline for extracting gene relations of various types
from literature data (Figure 1). We demonstrate the
effectiveness and applicability of our recently proposed
coreference resolution system [21]. Our system uses linear-chain conditional random fields in an innovative
way and can detect distant coreferent mentions in text
using a novel transformation of data into skip-mention
sequences.
We evaluate the proposed methodology by measuring
th (...truncated)