Testing the hypothesis of preferential attachment in social network formation (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1140%2Fepjds%2Fs13688-015-0052-2.pdf

Testing the hypothesis of preferential attachment in social network formation

House et al. EPJ Data Science (2015) 4:13 DOI 10.1140/epjds/s13688-015-0052-2 REGULAR ARTICLE Open Access Testing the hypothesis of preferential attachment in social network formation Thomas House1,2* , Jonathan M Read3 , Leon Danon4 and Matthew J Keeling2 * Correspondence: 1 School of Mathematics, University of Manchester, Oxford Road, Manchester, M13 9PL, UK 2 Warwick Infectious Disease Epidemiology Research (WIDER), University of Warwick, Gibbet Hill Road, Coventry, CV4 7AL, UK Full list of author information is available at the end of the article Abstract The hypothesis of preferential attachment (PA) - whereby better connected individuals make more connections - is hotly debated, particularly in the context of epidemiological networks. The simplest models of PA, for example, are incompatible with the eradication of any disease through population-level control measures such as random vaccination. Typically, evidence has been sought for the presence or absence of preferential attachment via asymptotic power-law behaviour. Here, we present a general statistical method to test directly for evidence of PA in count data and apply this to data for contacts relevant to the spread of respiratory diseases. We ﬁnd that while standard methods for model selection prefer a form of PA, careful analysis of the best ﬁtting PA models allows for a level of contact heterogeneity that in fact allows control of respiratory diseases. Our approach is based on a ﬂexible but numerically cheap likelihood-based model that could in principle be applied to other integer data where the hypothesis of PA is of interest. Keywords: MLE; Phase-type distribution; model selection; spectral methods 1 Introduction 1.1 Contact heterogeneity in infectious disease epidemiology Infectious pathogens that spread via contact between people are a major cause of human disease, driving attempts to understand their epidemiology []. Much theoretical work on infectious disease dynamics has been focused on the role of heterogeneity in the human population [], which is often conceptualised as a network of epidemiologically relevant contacts [–]. Perhaps the most important quantity in any infectious disease outbreak is the basic reproductive ratio, R , which is deﬁned verbally as the expected number of secondary cases generated by an average primary case early in the epidemic. An epidemic is possible exactly when R > , and typically the eﬀorts required to control such an outbreak grow monotonically with R [, ]. In the simpliﬁed scenario where each individual picks an integer number of contacts K from the same degree distribution, but transmission is otherwise homogeneous, R ∝ E K  . () © 2015 House et al. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. House et al. EPJ Data Science (2015) 4:13 Page 2 of 13 This dependence of the basic reproductive ratio on the second moment of the degree distribution has been a ‘textbook’ result for some time [], however work by Pastor-Satorras and Vespignani [] and May and Lloyd [] raised the question of what might happen for large, or asymptotically divergent, second moments. Such questions can be posed and answered at diﬀerent levels of mathematical rigour [] however in the context of epidemiology it is clear that a highly variable degree distribution is associated with the epidemiologically unrealistic scenario that even the most weakly transmissible pathogen can cause an epidemic, and as a corollary that control of any infectious disease through untargeted vaccination would be impossible. 1.2 Data Of course, whether such a theoretical possibility matters for the study of infectious diseases depends on the actual variance in degree for epidemiologically relevant contacts. While th century models of infectious disease were often based on strong a priori assumptions about mixing patterns [], various methods for measurement of contact patterns now exist and were reviewed by Read et al. []. As well as direct measurement of individuals through surveys [] it is possible to improve coverage through snowball and respondent-driven sampling [, ], to make use of the extremely large datasets produced by electronic sensors [, ], and also to combine aggregate data [, ]. Previous empirical studies have seen evidence that for direct (e.g. [, ]) and sexual (e.g. [, ]) contacts, an approximate power-law relationship may hold such that for large k, a randomly selected node obeys Pr(node has k links) ≈ k –γ . () As is the case for almost all biological data, there is much more complexity in the data than such a simple parametric relationships as () would imply. For example, Leigh Brown et al. [] found that while a power-law was a better functional form than the negative binomial for sexual contacts, the richer Waring distribution was preferable to either. What is hard to dispute, however, is that as better quality data on epidemiologically relevant contacts is obtained the evidence consistently points to a very high level of variance; for example, Read et al. [] were able to validate the high numbers of contacts reported by some study participants through direct (rather than postal) surveying. These empirical observations of high heterogeneity in contact number, together with theoretical results about R , present a paradox for infectious disease epidemiology: is the extreme heterogeneity in observed contact patterns indicative of PA and does that imply that R >  for almost any ﬁnite level of person-to-person transmissibility meaning that our theoretical understanding of infectious disease epidemiology is somehow severely lacking? 1.3 Preferential attachment and power laws in empirical data Recent years have seen a debate about the level of heterogeneity that exists in a variety of observed networks. A particularly inﬂuential paper by Barabási and Albert [] considered a model of network formation in which many new nodes are added to a small existing network. These new nodes connect preferentially to nodes that have more links in the existing network, leading to the asymptotic result () with γ = . In this way preferential House et al. EPJ Data Science (2015) 4:13 attachment is intimately linked with, but not always equivalent to, asymptotic power-law behaviour. Simple power-law relationships have been claimed for numerous real-world systems, and a critical review of these claims by Clauset et al. [] used maximum-likelihood ﬁtting of distribution tails to power-law distributions to show varying levels of statistical support for claims in the literature. In the context of discrete dat (...truncated)