Influenza-Like Illness Surveillance on Twitter through Automated Learning of Naïve Language (pdf)

Article PDF cannot be displayed. You can download it here:

https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0082489&type=printable

Influenza-Like Illness Surveillance on Twitter through Automated Learning of Naïve Language

et al. (2013) Influenza-Like Illness Surveillance on Twitter through Automated Learning of Nave Language. PLoS ONE 8(12): e82489. doi:10.1371/journal.pone.0082489 Editor: Alex R Cook Influenza-Like Illness Surveillance on Twitter through Automated Learning of Nave Language Francesco Gesualdo 0 Giovanni Stilo 0 Eleonora Agricola 0 Michaela V. Gonfiantini 0 Elisabetta Pandolfi 0 Paola Velardi 0 Alberto E. Tozzi 0 0 1 Multifactorial Diseases and Complex Phenotypes Research Area, Bambino Gesu Children's Hospital IRCCS , Rome , Italy , 2 Department of Informatics, Sapienza University of Rome , Rome , Italy Twitter has the potential to be a timely and cost-effective source of data for syndromic surveillance. When speaking of an illness, Twitter users often report a combination of symptoms, rather than a suspected or final diagnosis, using nave, everyday language. We developed a minimally trained algorithm that exploits the abundance of health-related web pages to identify all jargon expressions related to a specific technical term. We then translated an influenza case definition into a Boolean query, each symptom being described by a technical term and all related jargon expressions, as identified by the algorithm. Subsequently, we monitored all tweets that reported a combination of symptoms satisfying the case definition query. In order to geolocalize messages, we defined 3 localization strategies based on codes associated with each tweet. We found a high correlation coefficient between the trend of our influenza-positive tweets and ILI trends identified by US traditional surveillance systems. - Digital traces left on the Internet by web users, if properly aggregated and analysed, hold the promise to inform syndromic surveillance systems with real time data collected directly from individuals [1]. A number of studies have focused on measuring the occurrence of specific health-related and disease-related search keywords. In some cases, a correlation between search volumes and disease trends has been identified [2] and, in 2008, a Google service has been developed to estimate and predict influenza activity by aggregating Google search query volumes [3,4]. Nevertheless, this demand-based approach can suffer from a high level of noise: indeed, web users search for health subjects of which they have close experience, but often search peaks can be completely unrelated to the incidence of a disease, as search behaviors change in time and discussions on traditional media may reflect on search patterns [5,6]. Supply-based infodemiology on the other hand, aims straight at what web users speak about, investigating communication contents and patterns in discussion groups, blogs and microblogs [7]. In such environments, keywords occur in contexts, which allow the use of text mining techniques for sense disambiguation, topic filtering and mood analysis [8,9] Twitter, a popular free networking and microblogging service, counting in 2012 500 million users generating over 300 million tweets daily [10], has also been analysed as a source of syndromic surveillance data [11,12]. One of the strong implications of the use of Twitter for infodemiology is that it provides location indicators [13], potentially allowing a constant, dynamic and real-time update of disease maps [14]. Previous studies for tweet-mining measured the occurrence of single pre-specified terms, consisting either in the name of a clinical condition or its synonyms (eg: H1N1 or swine flu) [11] or in words, arbitrarily chosen by the authors, related to the clinical syndrome itself (eg. flu, vaccine, tamiflu) [12]. This kind of approach may suffer from two major biases. First, in blogs and forums, people are motivated by a communication need (possibly among pairs), rather than by an information need and therefore nave language is often preferred to technical language. Secondly, it is likely that, in their tweets, most users will describe a combination of symptoms rather than a diagnosis. An approach that takes into account only disease-related keywords can miss a large volume of messages in which users include a mix of signs and symptoms that can actually describe a clinical syndrome. In order to address these biases, we first developed a minimally supervised algorithm to learn technical term-nave term pairs, based on pattern generalization and complete linkage clustering, and we applied it to a group of technical terms extracted from the European Centre for Disease Prevention and Control (ECDC) case definition for ILI. Subsequently, we built a Boolean query based on the ECDC case definition for ILI, using both technical and related jargon terms as identified by the algorithm. Using the available APIs, we collected two sets of Twitter messages matching the ILI query, and we compared the trends of these messages with traditional surveillance data for influenza in the US. Materials and Methods Algorithm development: extraction of nave-medical jargon We developed an algorithm that automatically maps all nave terms related to a specific medical term from Freebase (www.freebase.com/view/medicine/disease), exploiting the abundance of web pages that aim at popularizing medical topics (e.g.: chills are the frequent name for a feeling of coldness, or sore throat, your doctor would call it pharyngitis). The algorithm starts with an initial small learning set of medical conditions, composed by term pairs (1 technical and 1 nave term, e.g.: emesis-vomiting) to extract basic patterns from the web, and then generalize, cluster and weight these patterns based on another small set of pairs. Generalized patterns are learned both for sentence fragments that relate technical and nave terms (e.g.: a common term for #DT #JJterm for), and for multi-word expressions describing medical conditions (e.g. inflammation of the nose inflammation of BODYPART). Patterns are based on lexical, syntactic and semantic features. The performance of the algorithm is evaluated on a golden test set of pairs extracted from Freebase (www.freebase.com/view/medicine/ disease), and through manual evaluation by domain experts. For a complete description of the algorithm, see Information S1. Query development In order to analyse the performance of Twitter as a source of data for syndromic surveillance, we developed a Boolean query derived from an ILI case definition. We first considered the translation of the ILI case definition adopted by the CDC [15]: fever and a cough and/or a sore throat without a known cause other than influenza. Nevertheless, translating this case definition into a Boolean query was not straightforward, as the generic expression without a known cause other than influenza cannot be transformed in an effective search string. Since the scope of our work was to test a tool for syndromic surveillance based on an aggregation of symptoms, we decided to adopt the ECDC case definition [16]: Sudden onset of symptoms AND at least one of the foll (...truncated)