Digital Pharmacovigilance and Disease Surveillance: Combining Traditional and Big-Data Systems for Better Public Health
The Journal of Infectious Diseases
SUPPLEMENT ARTICLE
Digital Pharmacovigilance and Disease Surveillance:
Combining Traditional and Big-Data Systems for Better
Public Health
Marcel Salathé
Digital Epidemiology Laboratory, School of Life Sciences and School of Computer and Communication Sciences, EPFL, Geneva, Switzerland
Traditional disease surveillance has been a key ingredient in any
public health portfolio for many decades. Disease surveillance is
widely recognized as one of the most important tools to assess,
predict, and mitigate infectious disease outbreaks. Traditional
disease surveillance is based on data collected by health institutions, and the data typically consist of information such as morbidity and mortality data, laboratory reports, individual case
reports, field investigations, surveys, and demographic data.
They are generally collected by physicians, public health laboratories, hospitals, and other health providers and institutions.
The computer revolution that began in the 1970s has affected
traditional disease surveillance systems by improving the accessibility of data and by increasing the speed at which data are
transmitted between institutions. However, the ongoing Internet and mobile phone revolution has a qualitatively distinct effect: in addition to making epidemiologic data available faster
and more broadly, new data are generated directly by the public,
often on platforms not primarily designed for health purposes.
These data streams of user-generated data are almost always bypassing traditional public health channels. They are the data
streams on which digital epidemiology is generally based [1, 2].
Correspondence: M. Salathé, Digital Epidemiology Lab, School of Life Sciences and School of
Computer and Communication Sciences, EPFL, Geneva, Switzerland ().
The Journal of Infectious Diseases® 2016;214(S4):S399–403
© The Author 2016. Published by Oxford University Press for the Infectious Diseases Society of
America. This is an Open Access article distributed under the terms of the Creative Commons
Attribution-NonCommercial-NoDerivs licence (http://creativecommons.org/licenses/by-nc-nd/
4.0/), which permits non-commercial reproduction and distribution of the work, in any
medium, provided the original work is not altered or transformed in any way, and that the
work is properly cited. For commercial re-use, contact .
DOI: 10.1093/infdis/jiw281
One of the first and certainly the most prominent examples
of digital disease surveillance was Google Flu Trends [3]. Google
Flu Trends was essentially an analytical estimate of the level of
weekly influenza activity based on the search queries that Google received. The analytical estimate was derived by a model selected by generating the best fit to the Centers for Disease
Control and Prevention’s (CDC’s) influenza-like illness (ILI)
data from a number of different US regions. The original
model results obtained a mean correlation of 0.9 with the CDC
data. A few years later, in summer 2015, Google decided to shut
down the public website of Google Flu Trends and instead opted
to give select academic and public health institutions access to the
data. This announcement followed numerous reports [4–6] that
systematically assessed Google Flu Trends’ overestimation of influenza activity, attributing it to a combination of a phenomenon
termed “big-data hubris” and algorithm dynamics. The first refers to the assumption that the novel big-data streams are a substitute, rather than a supplement, to traditional data collection
efforts. The second refers to the observation that, while the Google search algorithm receives updates on a weekly or even daily
basis, the Google Flu Trends model received updates only rarely.
This led to a situation where the model did not keep in sync with
the changing nature of the data from which it was supposed to
generate predictions.
Despite the problems of Google Flu Trends, the system was
an important example of the promises of digital epidemiology:
to use novel data streams, often generated for purposes quite
distinct from public health, to extract additional public health
signals, such as those relevant for disease surveillance. But
while Google makes some search pattern data available through
an interface called Google Trends, the raw search-query data
Digital Pharmacovigilance and Disease Surveillance • JID 2016:214 (Suppl 4) • S399
The digital revolution has contributed to very large data sets (ie, big data) relevant for public health. The two major data sources are
electronic health records from traditional health systems and patient-generated data. As the two data sources have complementary
strengths—high veracity in the data from traditional sources and high velocity and variety in patient-generated data—they can be
combined to build more-robust public health systems. However, they also have unique challenges. Patient-generated data in particular are often completely unstructured and highly context dependent, posing essentially a machine-learning challenge. Some recent
examples from infectious disease surveillance and adverse drug event monitoring demonstrate that the technical challenges can be
solved. Despite these advances, the problem of verification remains, and unless traditional and digital epidemiologic approaches are
combined, these data sources will be constrained by their intrinsic limits.
Keywords. digital epidemiology; disease surveillance; pharmagovigilance; Twitter.
S400 • JID 2016:214 (Suppl 4) • Salathé
undervaccinated populations. Later work on the same data set
investigated how negative and positive sentiments about vaccination spread across the social network, suggesting that negative
sentiments are more susceptible to social contagion than positive sentiments [15]. Last but not least, data from most of these
services are increasingly generated on mobile phones and other
devices, increasing the probability that high-resolution geographic information is associated with the data, a phenomenon
that will become increasingly important, given the spatial dynamics of disease spread.
DIGITAL PHARMACOVIGILANCE
The widespread use of the Internet and of social media in particular has had a dramatic effect not only on infectious disease surveillance, but also on the surveillance of drug use and related
events. Perhaps even more so than traditional infectious disease
surveillance, traditional surveillance of adverse drug reactions
(ADRs) after drug use is slow and patchy. When reported by patients or healthcare professionals, ADRs are typically assessed by
drug experts and pharmaceutical companies, and the results are
then passed on to government agencies. This leads to substantial
data loss and delays. A recent study in the United States showed
that hospital staff did not report 86% of ADRs among patients
[16]. The rate of underreporting in nonclinical settings is arguably even higher. Once government agencies receive the reports,
they often release them with a d (...truncated)