The computer program STRUCTURE does not reliably identify the main genetic clusters within species: simulations and implications for human population structure
Heredity (2011) 106, 625–632
& 2011 Macmillan Publishers Limited All rights reserved 0018-067X/11
ORIGINAL ARTICLE
www.nature.com/hdy
The computer program STRUCTURE does not
reliably identify the main genetic clusters within
species: simulations and implications for human
population structure
ST Kalinowski
Department of Ecology, Montana State University, Bozeman, MT, USA
One of the primary goals of population genetics is to
succinctly describe genetic relationships among populations,
and the computer program STRUCTURE is one of the most
frequently used tools for doing so. The mathematical model
used by STRUCTURE was designed to sort individuals into
Hardy–Weinberg populations, but the program is also frequently used to group individuals from a large number of
populations into a small number of clusters that are supposed
to represent the main genetic divisions within species. In this
study, I used computer simulations to examine how well
STRUCTURE accomplishes this latter task. Simulations of
populations that had a simple hierarchical history of fragmentation showed that when there were relatively long divergence
times within evolutionary lineages, the clusters created by
STRUCTURE were frequently not consistent with the evolutionary history of the populations. These difficulties can be
attributed to forcing STRUCTURE to place individuals into too
few clusters. Simulations also showed that the clusters
produced by STRUCTURE can be strongly influenced by
variation in sample size. In some circumstances, STRUCTURE
simply put all of the individuals from the largest sample in the
same cluster. A reanalysis of human population structure
suggests that the problems I identified with STRUCTURE in
simulations may have obscured relationships among human
populations—particularly genetic similarity between Europeans
and some African populations.
Heredity (2011) 106, 625–632; doi:10.1038/hdy.2010.95;
published online 4 August 2010
Keywords: population structure; STRUCTURE; population genetics; evolutionary tree, humans
One of the principal goals of population genetics is to
describe the genetic structure of populations. In essence,
this means summarizing the genetic similarities and
differences among populations in as simple of a manner
as possible. For some taxa, this is easy. For example, the
range-wide population structure of Atlantic salmon has
two salient features: Atlantic salmon in Europe and
North America are very different from each other, and
within each continent, genetic differentiation between
populations is proportional to geographic distance (King
et al., 2001). The population structure of other species can
be difficult to summarize. For example, human population structure is quite complex, and there has been recent
debate on the extent to which human genetic diversity is
distributed in clusters or along clines (for example,
Manica et al., 2005; Rosenberg et al., 2005).
No matter how simple or complex genetic relationships
among populations may be, geneticists need to be careful
that the statistical methods they use to summarize relationships do not distort the actual relationships among
populations. Imposing inappropriate statistical models
upon genetic data is all too easy. For example, if populations
Correspondence: Dr ST Kalinowski, Department of Ecology, 310 Lewis
Hall, Montana State University, Bozeman, MT 59717, USA.
E-mail:
Received 22 July 2009; revised 7 May 2010; accepted 17 June 2010;
published online 4 August 2010
have an isolation-by-distance population structure, an
unweighted pair group method with arithmetic mean tree
could easily provide a misleading depiction of the genetic
structure (Kalinowski, 2009). This happens because an
unweighted pair group method with arithmetic mean tree
cannot show a population structure that is not hierarchical.
The computer program STRUCTURE (Pritchard et al.,
2000; Falush et al., 2003; Hubisz et al., 2009) is currently one
of the most frequently used statistical tools for describing
population structure. The program does this by sorting
individuals into Hardy–Weinberg/linkage equilibrium
populations, which creates clusters of individuals that
have distinctive allele frequencies. An important step in
this analysis is deciding how many clusters to sort
individuals into. This number, K, is selected by the user.
If K is equal to the actual number of Hardy–Weinberg
populations that the individuals belong to, STRUCTURE
will attempt to sort individuals into the populations they
came from. This can be very useful when the origin of
individuals is unknown. However, STRUCTURE is also
frequently used to identify the main genetic clusters within
species. In this second type of analysis, individuals are
assigned to clusters in the same manner as above, but K is
deliberately set to be smaller than the actual number of
populations. Rosenberg et al. (2001) argued that such
clustering is useful for ‘identification of population
relationships, history, and within-species genetic units for
conservation’, last sentence of paper).
Evaluating STRUCTURE
ST Kalinowski
626
In the 10 years since STRUCTURE was created, over
3000 papers have cited the program, and many users of
STRUCTURE have used the program to describe genetic
relationships among populations. For example, in a
landmark study of human population structure, Rosenberg et al. (2002; 2005) used STRUCTURE to sort people
from 52 ethnic groups into five clusters. This analysis
clustered individuals by continent, and this result has
been influential in subsequent discussions of human
population structure. However, this result—and other
analyses of population-level relationships made by
STRUCTURE—may need to be reevaluated. The mathematical model used by STRUCTURE was designed for
clustering individuals into Hardy–Weinberg/linkage
equilibrium populations. It was not designed for
clustering individuals into groups of populations, and
may not work as its users intend when this is done.
A few investigators have evaluated how well STRUCTURE works in different applications, but this testing has
shed little light on how well STRUCTURE summarizes
relationship among populations. For example, Rosenberg
et al. (2001) showed that STRUCTURE could accurately
sort individual chickens by breed, but this empirical test
did not evaluate how well STRUCTURE could cluster
individuals into groups of related populations. Evanno
et al. (2005) addressed this later question using simulated
data and showed that STRUCTURE was able to do this
successfully. However, Evanno et al. (2005) used a
hierarchical island model of gene flow which made the
biologically simplistic assumption that all groups of
populations were equally different from each other.
Real populations are expected to show more complex
relationships, and this may affect the manner in which
STRUCTURE assigns individual to clusters. Lastly,
Schwartz and McKelvey (2008) showed that when
individuals were distributed continuously on a twodimen (...truncated)