FACTA: a text search engine for finding associated biomedical concepts
Yoshimasa Tsuruoka
1
2
Jun'ichi Tsujii
0
1
2
Sophia Ananiadou
1
2
Associate Editor: Jonathan Wren
0
Department of Computer Science, The University of Tokyo
,
Japan
1
National Centre for Text Mining (NaCTeM)
,
Manchester, UK
2
School of Computer Science, The University of Manchester
Summary: FACTA is a text search engine for MEDLINE abstracts, which is designed particularly to help users browse biomedical concepts (e.g. genes/proteins, diseases, enzymes and chemical compounds) appearing in the documents retrieved by the query. The concepts are presented to the user in a tabular format and ranked based on the co-occurrence statistics. Unlike existing systems that provide similar functionality, FACTA pre-indexes not only the words but also the concepts mentioned in the documents, which enables the user to issue a flexible query (e.g. free keywords or Boolean combinations of keywords/concepts) and receive the results immediately even when the number of the documents that match the query is very large. The user can also view snippets from MEDLINE to get textual evidence of associations between the query terms and the concepts. The concept IDs and their names/synonyms for building the indexes were collected from several biomedical databases and thesauri, such as UniProt, BioThesaurus, UMLS, KEGG and DrugBank. Availability: The system is available at http://www.nactem.ac.uk/ software/facta/ Contact:
1 INTRODUCTION
Information about pairwise association between biomedical
concepts, such as genes, proteins, diseases and chemical compounds
constitutes an important part of biomedical knowledge.1 It is
common for a researcher to need answers to questions like What
diseases are relevant to a particular gene? or What chemical
compounds are relevant to a particular disease? Text mining
complements biomedical databases by providing researchers with
a convenient way to find such information from the literature.
There are a number of web-based text mining applications
which can be used for this purpose. EBIMed
(RebholzSchuhmann et al., 2007) receives a PubMed-style query from
To whom correspondence should be addressed.
1In this article, a biomedical concept refers to a conceptual entity which is
normally grounded to a record in a biomedical database. In text, the same
concept (e.g. UniProt:O00203) may be represented by different terms (e.g.
AP-3 complex subunit beta-1 or Beta3A-adaptin). Note also that the same
term may represent different concepts depending on the context, although
this problem is currently not resolved in FACTA.
the user and analyzes the matched documents to recognize
protein/gene names, GO annotations, drugs and species mentioned.
Frequently occurring concepts are shown in a table, and the
user can view the sentences corresponding to the associations.
PolySearch (Cheng et al., 2008) can produce a list of concepts
which are relevant to the users query by analyzing multiple
information sources including PubMed, OMIM, DrugBank and
Swiss-Prot. It covers many types of biomedical concepts including
diseases, genes/proteins, drugs, metabolites, SNPs, pathways
and tissues. Systems that provide similar functionality include
XplorMed (Perez-Iratxeta et al., 2003), MedlineR (Lin et al.,
2004), LitMiner (Maier et al., 2005) and Anii (Jelier et al.,
2008).
Although these applications are useful in exploring such
information in the literature, not many of them provide real-time
responsesthe users often have to wait for several minutes (or even
hours) before they receive the results. Some of the systems provide
reasonably quick responses by limiting the number of documents
to be analyzed to a very small number (e.g. 500 abstracts), but
such limitation leads to a significant deterioration of the coverage.
LitMiner and Anii are exceptions in that they can return the
result immediately, presumably thanks to pre-computed association
statistics between the concepts. However, they do not accept a
flexible query (e.g. free keywords or Boolean combinations of
keywords/concepts), hence the concepts that can be specified by
the users query are limited to predefined ones.
To complement existing applications, we have developed
FACTA, which is a text search engine for browsing biomedical
concepts that are potentially relevant to a query. The distinct
advantage of FACTA is that it delivers real-time responses
while being able to accept flexible queries. This is achieved by
online computation of association statisticsFACTA analyzes the
documents retrieved by the query dynamically, using pre-indexed
words and concepts.
SOFTWARE FEATURES
FACTA receives a query from the user as the input. A query can
be a word (e.g. p53), a concept ID (e.g. UNIPROT:P04637),
or a combination of these [e.g. (UNIPROT:P04637 AND (lung
OR gastric))]. The system then retrieves all the documents that
match the query from MEDLINE using word/concept indexes. The
concepts contained in the documents are then counted and ranked
Fig. 1. A screenshot of FACTA search results.
according to their relevance to the query. The results are presented
to the user in a tabular format.
Figure 1 shows an example of the search result. For the input
query apoptosis AND blood, the system retrieved 7734 documents
from MEDLINE in 0.04 s. The relevant concepts of six categories
are displayed in a table and ranked by their frequencies. The
document icon next to each concept name in the table allows the
user to view snippets from MEDLINE and see textual evidence
of the association. The user can also invoke another search by
clicking a concept name in the table. This allows the user to explore
associations between many different concepts in a highly interactive
manner.
FACTAs real-time responses to the queries are made possible by the
use of its own indexing scheme and implementation of the analysis
engines in C++. It uses two indexes built offlineone for the words
and the other for the concepts. Both indexes are stored in memory to
achieve quick responses, while the actual sentences of MEDLINE
abstracts are stored on external storage. The system runs on a generic
Linux server with 2.2 GHz AMD Opteron processors and 16 GB
memory.
Currently, FACTA covers six categories of biomedical concepts:
human genes/proteins, diseases, symptoms, drugs, enzymes and
chemical compounds. The concepts appearing in the documents are
recognized by dictionary matching. In total, 80 260 unique concepts
are indexed. We used UniProt accession numbers as the concept
IDs for genes/proteins and collected their names and synonyms
from BioThesaurus (Liu et al., 2006). We used UMLS (Humphreys
and Lindberg, 1989) for diseases and symptoms. The concept IDs
and names for drugs, enzymes and chemical compounds were
collected from several databases including HMDB, KEGG and
DrugBank.
Ambiguity causes problems in indexing. For example, the term
collapse is not necessarily used as a symptom name in the
documents that produced the results shown in Figure 1, so ideally
such occ (...truncated)