The computer program STRUCTURE does not reliably identify the main genetic clusters within species: simulations and implications for human population structure

Heredity, Aug 2010

One of the primary goals of population genetics is to succinctly describe genetic relationships among populations, and the computer program STRUCTURE is one of the most frequently used tools for doing so. The mathematical model used by STRUCTURE was designed to sort individuals into Hardy–Weinberg populations, but the program is also frequently used to group individuals from a large number of populations into a small number of clusters that are supposed to represent the main genetic divisions within species. In this study, I used computer simulations to examine how well STRUCTURE accomplishes this latter task. Simulations of populations that had a simple hierarchical history of fragmentation showed that when there were relatively long divergence times within evolutionary lineages, the clusters created by STRUCTURE were frequently not consistent with the evolutionary history of the populations. These difficulties can be attributed to forcing STRUCTURE to place individuals into too few clusters. Simulations also showed that the clusters produced by STRUCTURE can be strongly influenced by variation in sample size. In some circumstances, STRUCTURE simply put all of the individuals from the largest sample in the same cluster. A reanalysis of human population structure suggests that the problems I identified with STRUCTURE in simulations may have obscured relationships among human populations—particularly genetic similarity between Europeans and some African populations.

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/hdy201095.pdf

The computer program STRUCTURE does not reliably identify the main genetic clusters within species: simulations and implications for human population structure

Heredity (2011) 106, 625–632 & 2011 Macmillan Publishers Limited All rights reserved 0018-067X/11 ORIGINAL ARTICLE www.nature.com/hdy The computer program STRUCTURE does not reliably identify the main genetic clusters within species: simulations and implications for human population structure ST Kalinowski Department of Ecology, Montana State University, Bozeman, MT, USA One of the primary goals of population genetics is to succinctly describe genetic relationships among populations, and the computer program STRUCTURE is one of the most frequently used tools for doing so. The mathematical model used by STRUCTURE was designed to sort individuals into Hardy–Weinberg populations, but the program is also frequently used to group individuals from a large number of populations into a small number of clusters that are supposed to represent the main genetic divisions within species. In this study, I used computer simulations to examine how well STRUCTURE accomplishes this latter task. Simulations of populations that had a simple hierarchical history of fragmentation showed that when there were relatively long divergence times within evolutionary lineages, the clusters created by STRUCTURE were frequently not consistent with the evolutionary history of the populations. These difficulties can be attributed to forcing STRUCTURE to place individuals into too few clusters. Simulations also showed that the clusters produced by STRUCTURE can be strongly influenced by variation in sample size. In some circumstances, STRUCTURE simply put all of the individuals from the largest sample in the same cluster. A reanalysis of human population structure suggests that the problems I identified with STRUCTURE in simulations may have obscured relationships among human populations—particularly genetic similarity between Europeans and some African populations. Heredity (2011) 106, 625–632; doi:10.1038/hdy.2010.95; published online 4 August 2010 Keywords: population structure; STRUCTURE; population genetics; evolutionary tree, humans One of the principal goals of population genetics is to describe the genetic structure of populations. In essence, this means summarizing the genetic similarities and differences among populations in as simple of a manner as possible. For some taxa, this is easy. For example, the range-wide population structure of Atlantic salmon has two salient features: Atlantic salmon in Europe and North America are very different from each other, and within each continent, genetic differentiation between populations is proportional to geographic distance (King et al., 2001). The population structure of other species can be difficult to summarize. For example, human population structure is quite complex, and there has been recent debate on the extent to which human genetic diversity is distributed in clusters or along clines (for example, Manica et al., 2005; Rosenberg et al., 2005). No matter how simple or complex genetic relationships among populations may be, geneticists need to be careful that the statistical methods they use to summarize relationships do not distort the actual relationships among populations. Imposing inappropriate statistical models upon genetic data is all too easy. For example, if populations Correspondence: Dr ST Kalinowski, Department of Ecology, 310 Lewis Hall, Montana State University, Bozeman, MT 59717, USA. E-mail: Received 22 July 2009; revised 7 May 2010; accepted 17 June 2010; published online 4 August 2010 have an isolation-by-distance population structure, an unweighted pair group method with arithmetic mean tree could easily provide a misleading depiction of the genetic structure (Kalinowski, 2009). This happens because an unweighted pair group method with arithmetic mean tree cannot show a population structure that is not hierarchical. The computer program STRUCTURE (Pritchard et al., 2000; Falush et al., 2003; Hubisz et al., 2009) is currently one of the most frequently used statistical tools for describing population structure. The program does this by sorting individuals into Hardy–Weinberg/linkage equilibrium populations, which creates clusters of individuals that have distinctive allele frequencies. An important step in this analysis is deciding how many clusters to sort individuals into. This number, K, is selected by the user. If K is equal to the actual number of Hardy–Weinberg populations that the individuals belong to, STRUCTURE will attempt to sort individuals into the populations they came from. This can be very useful when the origin of individuals is unknown. However, STRUCTURE is also frequently used to identify the main genetic clusters within species. In this second type of analysis, individuals are assigned to clusters in the same manner as above, but K is deliberately set to be smaller than the actual number of populations. Rosenberg et al. (2001) argued that such clustering is useful for ‘identification of population relationships, history, and within-species genetic units for conservation’, last sentence of paper). Evaluating STRUCTURE ST Kalinowski 626 In the 10 years since STRUCTURE was created, over 3000 papers have cited the program, and many users of STRUCTURE have used the program to describe genetic relationships among populations. For example, in a landmark study of human population structure, Rosenberg et al. (2002; 2005) used STRUCTURE to sort people from 52 ethnic groups into five clusters. This analysis clustered individuals by continent, and this result has been influential in subsequent discussions of human population structure. However, this result—and other analyses of population-level relationships made by STRUCTURE—may need to be reevaluated. The mathematical model used by STRUCTURE was designed for clustering individuals into Hardy–Weinberg/linkage equilibrium populations. It was not designed for clustering individuals into groups of populations, and may not work as its users intend when this is done. A few investigators have evaluated how well STRUCTURE works in different applications, but this testing has shed little light on how well STRUCTURE summarizes relationship among populations. For example, Rosenberg et al. (2001) showed that STRUCTURE could accurately sort individual chickens by breed, but this empirical test did not evaluate how well STRUCTURE could cluster individuals into groups of related populations. Evanno et al. (2005) addressed this later question using simulated data and showed that STRUCTURE was able to do this successfully. However, Evanno et al. (2005) used a hierarchical island model of gene flow which made the biologically simplistic assumption that all groups of populations were equally different from each other. Real populations are expected to show more complex relationships, and this may affect the manner in which STRUCTURE assigns individual to clusters. Lastly, Schwartz and McKelvey (2008) showed that when individuals were distributed continuously on a twodimen (...truncated)


This is a preview of a remote PDF: https://www.nature.com/articles/hdy201095.pdf
Article home page: https://www.nature.com/articles/hdy201095

S T Kalinowski. The computer program STRUCTURE does not reliably identify the main genetic clusters within species: simulations and implications for human population structure, Heredity, 2010, pp. 625-632, Issue: 106, DOI: 10.1038/hdy.2010.95