The Structure of Simple Satellite Variation in the Human Genome and Its Correlation With Centromere Ancestry
GBE
The Structure of Simple Satellite Variation in the Human
Genome and Its Correlation With Centromere Ancestry
Iskander Said
, Daniel A. Barbash
, Andrew G. Clark
*
Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY 14853, USA
*Corresponding author: E-mail: .
Abstract
Although repetitive DNA forms much of the human genome, its study is challenging due to limitations in assembly and align
ment of repetitive short-reads. We have deployed k-Seek, software that detects tandem repeats embedded in single reads, on
2,504 human genomes from the 1,000 Genomes Project to quantify the variation and abundance of simple satellites (repeat
units <20 bp). We find that the ancestral monomer of Human Satellite 3 makes up the largest portion of simple satellite con
tent in humans (mean of ∼8 Mb). We discovered ∼50,000 rare tandem repeats that are not detected in the T2T-CHM13v2.0
assembly, including undescribed variants of telomericand pericentromeric repeats. We find broad homogeneity of the most
abundant repeats across populations, except for AG-rich repeats which are more abundant in African individuals. We also
find cliques of highly similar AG- and AT-rich satellites that are interspersed and form higher-order structures that covary
in copy number across individuals, likely through concerted amplification via unequal exchange. Finally, we use pericentro
meric polymorphisms to estimate centromeric genetic relatedness between individuals and find a strong predictive relation
ship between centromeric lineages and pericentromeric simple satellite abundances. In particular, ancestral monomers of
Human Satellite 2 and Human Satellite 3 abundances correlate with clusters of centromeric ancestry on chromosome 16
and chromosome 9, with some clusters structured by population. These results provide new descriptions of the population
dynamics that underlie the evolution of simple satellites in humans.
Key words: population structure, human evolution, repetitive DNA, heterochromatin.
Significance
Satellite DNAs make up large and occasionally essential portions of the human genome, but the study of the variation
and evolution of these repeats is limited by technical problems of using short-read Illumina sequencing. By using k-Seek,
a method to mine tandem repeats in unassembled short-reads, we circumvent some of these technical problems and
analyze the population variation and structure of simple satellites in 2,504 human genomes. We report previously un
described simple satellites, correlated variation in simple satellite abundances and population structure of centromeric
DNA variation.
Introduction
Repetitive DNA is a near-ubiquitous feature of eukaryotic
genomes and in humans represents ∼54% of the genome
(Hoyt et al. 2022). Repetitive DNA in the human genome is
∼89% interspersed transposable element sequences
and ∼11% satellites, tandemly repeating arrays of homo
genized sequence motifs with individual arrays ranging in
length from a few hundred base pairs to several megabases
(Altemose et al. 2022; Hoyt et al. 2022). These tandem re
peats are diverse in sequence, implicated in human disease
© The Author(s) 2024. Published by Oxford University Press on behalf of Society for Molecular Biology and Evolution.
This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits
non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact for reprints
and translation rights for reprints. All other permissions can be obtained through our RightsLink service via the Permissions link on the article page on our site—for further information
please contact .
Genome Biol. Evol. 16(8) https://doi.org/10.1093/gbe/evae153 Advance Access publication 17 July 2024
1
Accepted: July 12, 2024
GBE
Said et al.
(Aldrup-MacDonald et al. 2016; Altemose et al. 2022). In
Drosophila melanogaster the variation of both the amount
and sequence of constitutive heterochromatin in the
Y-chromosome affects gene expression and chromatin
state (Zhou et al. 2012; Berloco et al. 2014; Kelsey and
Clark 2017; Delanoue et al. 2023). However, the full extent
of satellite variation in both abundance and sequence and
how much of this variation affects organismal fitness
have yet to be fully explored.
Many satellites are rapidly evolving. For example, Alpha
Satellite has conserved biological function across taxa, yet
shows rapid rates of evolution in sequence and abundance,
as well as varying mutation rates across centromeres
(Henikoff et al. 2001; Logsdon et al. 2024). There are highly
divergent species-specific simple satellites and satellite sub
families in closely related species of Drosophila and primates
that have rapidly evolved in sequence and abundance (Waye
and Willard 1989; Haaf and Willard 1997; Jarmuz et al. 2007;
Wei et al. 2018; Cechova et al. 2019). Species-specific satel
lites are even implicated in hybrid incompatibility in
Drosophila species (Ferree and Barbash 2009; Bayes and
Malik 2009; Satyaki et al. 2014). There is also considerable
intraspecific variation of simple satellite abundances in natural
and experimental populations of D. melanogaster,
Chlamydomonas reinhardtii and Daphnia pulex (Wei et al.
2014; Flynn et al. 2017, 2018). These observations are some
what contradictory to the theoretical models of concerted
satellite evolution, which predict that within a given species
satellite sequences and variation should be homogenized
(Smith 1976; Perelson and Bell 1977; Stephan 1989;
Stephan and Cho 1994). This deviation of the empirical
data from the evolutionary models suggests that the evolu
tionary dynamics of simple satellites in populations are not
well described and significant advancements to the popula
tion genetic models must be made to fully capture their evo
lutionary rates and variation, particularly if we aim to
understand the selective forces acting on satellites.
Human population-genomics data provide a vast re
source of thousands of high-quality genomes necessary
to advance the evolutionary models of satellites.
However, studies of the population variation of satellites
in humans have been focused primarily on microsatellites
(Payseur et al. 2011; Willems et al. 2014). The variation of
simple and complex satellites has been studied to a lesser
degree, but there is evidence of over 10-fold differences
in HSat3 abundance between human Y-haplogroups, 5to 10-fold differences in centromere size, and tremendous
diversity in the higher-order structures of centromeric re
peats, pointing to rapid evolution of satellites within human
populations (Altemose et al. 2014, 2022; Miga 2019;
Suzuki et al. 2020). Some satellite variation within the cen
tromeres affects the formation and the positioning of the
kinetochore, which may affect the integrity of cell division, (...truncated)