Testing the hypothesis of preferential attachment in social network formation
House et al. EPJ Data Science (2015) 4:13
DOI 10.1140/epjds/s13688-015-0052-2
REGULAR ARTICLE
Open Access
Testing the hypothesis of preferential
attachment in social network formation
Thomas House1,2* , Jonathan M Read3 , Leon Danon4 and Matthew J Keeling2
*
Correspondence:
1
School of Mathematics, University
of Manchester, Oxford Road,
Manchester, M13 9PL, UK
2
Warwick Infectious Disease
Epidemiology Research (WIDER),
University of Warwick, Gibbet Hill
Road, Coventry, CV4 7AL, UK
Full list of author information is
available at the end of the article
Abstract
The hypothesis of preferential attachment (PA) - whereby better connected
individuals make more connections - is hotly debated, particularly in the context of
epidemiological networks. The simplest models of PA, for example, are incompatible
with the eradication of any disease through population-level control measures such
as random vaccination. Typically, evidence has been sought for the presence or
absence of preferential attachment via asymptotic power-law behaviour. Here, we
present a general statistical method to test directly for evidence of PA in count data
and apply this to data for contacts relevant to the spread of respiratory diseases. We
find that while standard methods for model selection prefer a form of PA, careful
analysis of the best fitting PA models allows for a level of contact heterogeneity that
in fact allows control of respiratory diseases. Our approach is based on a flexible but
numerically cheap likelihood-based model that could in principle be applied to other
integer data where the hypothesis of PA is of interest.
Keywords: MLE; Phase-type distribution; model selection; spectral methods
1 Introduction
1.1 Contact heterogeneity in infectious disease epidemiology
Infectious pathogens that spread via contact between people are a major cause of human
disease, driving attempts to understand their epidemiology []. Much theoretical work on
infectious disease dynamics has been focused on the role of heterogeneity in the human
population [], which is often conceptualised as a network of epidemiologically relevant
contacts [–].
Perhaps the most important quantity in any infectious disease outbreak is the basic reproductive ratio, R , which is defined verbally as the expected number of secondary cases
generated by an average primary case early in the epidemic. An epidemic is possible exactly
when R > , and typically the efforts required to control such an outbreak grow monotonically with R [, ]. In the simplified scenario where each individual picks an integer
number of contacts K from the same degree distribution, but transmission is otherwise
homogeneous,
R ∝ E K .
()
© 2015 House et al. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License
(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and
indicate if changes were made.
House et al. EPJ Data Science (2015) 4:13
Page 2 of 13
This dependence of the basic reproductive ratio on the second moment of the degree distribution has been a ‘textbook’ result for some time [], however work by Pastor-Satorras
and Vespignani [] and May and Lloyd [] raised the question of what might happen for
large, or asymptotically divergent, second moments. Such questions can be posed and
answered at different levels of mathematical rigour [] however in the context of epidemiology it is clear that a highly variable degree distribution is associated with the epidemiologically unrealistic scenario that even the most weakly transmissible pathogen can cause
an epidemic, and as a corollary that control of any infectious disease through untargeted
vaccination would be impossible.
1.2 Data
Of course, whether such a theoretical possibility matters for the study of infectious diseases depends on the actual variance in degree for epidemiologically relevant contacts.
While th century models of infectious disease were often based on strong a priori assumptions about mixing patterns [], various methods for measurement of contact patterns now exist and were reviewed by Read et al. []. As well as direct measurement of
individuals through surveys [] it is possible to improve coverage through snowball and
respondent-driven sampling [, ], to make use of the extremely large datasets produced
by electronic sensors [, ], and also to combine aggregate data [, ].
Previous empirical studies have seen evidence that for direct (e.g. [, ]) and sexual
(e.g. [, ]) contacts, an approximate power-law relationship may hold such that for
large k, a randomly selected node obeys
Pr(node has k links) ≈ k –γ .
()
As is the case for almost all biological data, there is much more complexity in the data than
such a simple parametric relationships as () would imply. For example, Leigh Brown et al.
[] found that while a power-law was a better functional form than the negative binomial
for sexual contacts, the richer Waring distribution was preferable to either. What is hard
to dispute, however, is that as better quality data on epidemiologically relevant contacts
is obtained the evidence consistently points to a very high level of variance; for example,
Read et al. [] were able to validate the high numbers of contacts reported by some study
participants through direct (rather than postal) surveying.
These empirical observations of high heterogeneity in contact number, together with
theoretical results about R , present a paradox for infectious disease epidemiology: is the
extreme heterogeneity in observed contact patterns indicative of PA and does that imply that R > for almost any finite level of person-to-person transmissibility meaning
that our theoretical understanding of infectious disease epidemiology is somehow severely
lacking?
1.3 Preferential attachment and power laws in empirical data
Recent years have seen a debate about the level of heterogeneity that exists in a variety of
observed networks. A particularly influential paper by Barabási and Albert [] considered
a model of network formation in which many new nodes are added to a small existing
network. These new nodes connect preferentially to nodes that have more links in the
existing network, leading to the asymptotic result () with γ = . In this way preferential
House et al. EPJ Data Science (2015) 4:13
attachment is intimately linked with, but not always equivalent to, asymptotic power-law
behaviour.
Simple power-law relationships have been claimed for numerous real-world systems,
and a critical review of these claims by Clauset et al. [] used maximum-likelihood fitting
of distribution tails to power-law distributions to show varying levels of statistical support
for claims in the literature. In the context of discrete dat (...truncated)