Influenza-Like Illness Surveillance on Twitter through Automated Learning of Naïve Language
et al. (2013) Influenza-Like Illness Surveillance on Twitter through Automated
Learning of Nave Language. PLoS ONE 8(12): e82489. doi:10.1371/journal.pone.0082489
Editor: Alex R Cook
Influenza-Like Illness Surveillance on Twitter through Automated Learning of Nave Language
Francesco Gesualdo 0
Giovanni Stilo 0
Eleonora Agricola 0
Michaela V. Gonfiantini 0
Elisabetta Pandolfi 0
Paola Velardi 0
Alberto E. Tozzi 0
0 1 Multifactorial Diseases and Complex Phenotypes Research Area, Bambino Gesu Children's Hospital IRCCS , Rome , Italy , 2 Department of Informatics, Sapienza University of Rome , Rome , Italy
Twitter has the potential to be a timely and cost-effective source of data for syndromic surveillance. When speaking of an illness, Twitter users often report a combination of symptoms, rather than a suspected or final diagnosis, using nave, everyday language. We developed a minimally trained algorithm that exploits the abundance of health-related web pages to identify all jargon expressions related to a specific technical term. We then translated an influenza case definition into a Boolean query, each symptom being described by a technical term and all related jargon expressions, as identified by the algorithm. Subsequently, we monitored all tweets that reported a combination of symptoms satisfying the case definition query. In order to geolocalize messages, we defined 3 localization strategies based on codes associated with each tweet. We found a high correlation coefficient between the trend of our influenza-positive tweets and ILI trends identified by US traditional surveillance systems.
-
Digital traces left on the Internet by web users, if properly
aggregated and analysed, hold the promise to inform
syndromic surveillance systems with real time data collected
directly from individuals [1].
A number of studies have focused on measuring the
occurrence of specific health-related and disease-related
search keywords. In some cases, a correlation between search
volumes and disease trends has been identified [2] and, in
2008, a Google service has been developed to estimate and
predict influenza activity by aggregating Google search query
volumes [3,4]. Nevertheless, this demand-based approach can
suffer from a high level of noise: indeed, web users search for
health subjects of which they have close experience, but often
search peaks can be completely unrelated to the incidence of a
disease, as search behaviors change in time and discussions
on traditional media may reflect on search patterns [5,6].
Supply-based infodemiology on the other hand, aims
straight at what web users speak about, investigating
communication contents and patterns in discussion groups,
blogs and microblogs [7]. In such environments, keywords
occur in contexts, which allow the use of text mining techniques
for sense disambiguation, topic filtering and mood analysis
[8,9]
Twitter, a popular free networking and microblogging service,
counting in 2012 500 million users generating over 300 million
tweets daily [10], has also been analysed as a source of
syndromic surveillance data [11,12]. One of the strong
implications of the use of Twitter for infodemiology is that it
provides location indicators [13], potentially allowing a
constant, dynamic and real-time update of disease maps [14].
Previous studies for tweet-mining measured the occurrence
of single pre-specified terms, consisting either in the name of a
clinical condition or its synonyms (eg: H1N1 or swine flu) [11]
or in words, arbitrarily chosen by the authors, related to the
clinical syndrome itself (eg. flu, vaccine, tamiflu) [12].
This kind of approach may suffer from two major biases.
First, in blogs and forums, people are motivated by a
communication need (possibly among pairs), rather than by an
information need and therefore nave language is often
preferred to technical language.
Secondly, it is likely that, in their tweets, most users will
describe a combination of symptoms rather than a diagnosis.
An approach that takes into account only disease-related
keywords can miss a large volume of messages in which users
include a mix of signs and symptoms that can actually describe
a clinical syndrome.
In order to address these biases, we first developed a
minimally supervised algorithm to learn technical term-nave
term pairs, based on pattern generalization and complete
linkage clustering, and we applied it to a group of technical
terms extracted from the European Centre for Disease
Prevention and Control (ECDC) case definition for ILI.
Subsequently, we built a Boolean query based on the ECDC
case definition for ILI, using both technical and related jargon
terms as identified by the algorithm. Using the available APIs,
we collected two sets of Twitter messages matching the ILI
query, and we compared the trends of these messages with
traditional surveillance data for influenza in the US.
Materials and Methods
Algorithm development: extraction of nave-medical
jargon
We developed an algorithm that automatically maps all nave
terms related to a specific medical term from Freebase
(www.freebase.com/view/medicine/disease), exploiting the
abundance of web pages that aim at popularizing medical
topics (e.g.: chills are the frequent name for a feeling of
coldness, or sore throat, your doctor would call it
pharyngitis). The algorithm starts with an initial small learning
set of medical conditions, composed by term pairs (1 technical
and 1 nave term, e.g.: emesis-vomiting) to extract basic
patterns from the web, and then generalize, cluster and weight
these patterns based on another small set of pairs.
Generalized patterns are learned both for sentence fragments
that relate technical and nave terms (e.g.: a common term for
#DT #JJterm for), and for multi-word expressions
describing medical conditions (e.g. inflammation of the nose
inflammation of BODYPART). Patterns are based on
lexical, syntactic and semantic features. The performance of
the algorithm is evaluated on a golden test set of pairs
extracted from Freebase (www.freebase.com/view/medicine/
disease), and through manual evaluation by domain experts.
For a complete description of the algorithm, see Information
S1.
Query development
In order to analyse the performance of Twitter as a source of
data for syndromic surveillance, we developed a Boolean query
derived from an ILI case definition. We first considered the
translation of the ILI case definition adopted by the CDC [15]:
fever and a cough and/or a sore throat without a known cause
other than influenza. Nevertheless, translating this case
definition into a Boolean query was not straightforward, as the
generic expression without a known cause other than
influenza cannot be transformed in an effective search string.
Since the scope of our work was to test a tool for syndromic
surveillance based on an aggregation of symptoms, we
decided to adopt the ECDC case definition [16]: Sudden onset
of symptoms AND at least one of the foll (...truncated)