The Structure of Simple Satellite Variation in the Human Genome and Its Correlation With Centromere Ancestry

Genome Biology and Evolution, Aug 2024

Although repetitive DNA forms much of the human genome, its study is challenging due to limitations in assembly and alignment of repetitive short-reads. We have deployed k-Seek, software that detects tandem repeats embedded in single reads, on 2,504 human genomes from the 1,000 Genomes Project to quantify the variation and abundance of simple satellites (repeat units <20 bp). We find that the ancestral monomer of Human Satellite 3 makes up the largest portion of simple satellite content in humans (mean of ∼8 Mb). We discovered ∼50,000 rare tandem repeats that are not detected in the T2T-CHM13v2.0 assembly, including undescribed variants of telomericand pericentromeric repeats. We find broad homogeneity of the most abundant repeats across populations, except for AG-rich repeats which are more abundant in African individuals. We also find cliques of highly similar AG- and AT-rich satellites that are interspersed and form higher-order structures that covary in copy number across individuals, likely through concerted amplification via unequal exchange. Finally, we use pericentromeric polymorphisms to estimate centromeric genetic relatedness between individuals and find a strong predictive relationship between centromeric lineages and pericentromeric simple satellite abundances. In particular, ancestral monomers of Human Satellite 2 and Human Satellite 3 abundances correlate with clusters of centromeric ancestry on chromosome 16 and chromosome 9, with some clusters structured by population. These results provide new descriptions of the population dynamics that underlie the evolution of simple satellites in humans.

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/gbe/article-pdf/16/8/evae153/58757634/evae153.pdf

The Structure of Simple Satellite Variation in the Human Genome and Its Correlation With Centromere Ancestry

GBE The Structure of Simple Satellite Variation in the Human Genome and Its Correlation With Centromere Ancestry Iskander Said , Daniel A. Barbash , Andrew G. Clark * Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY 14853, USA *Corresponding author: E-mail: . Abstract Although repetitive DNA forms much of the human genome, its study is challenging due to limitations in assembly and align ment of repetitive short-reads. We have deployed k-Seek, software that detects tandem repeats embedded in single reads, on 2,504 human genomes from the 1,000 Genomes Project to quantify the variation and abundance of simple satellites (repeat units <20 bp). We find that the ancestral monomer of Human Satellite 3 makes up the largest portion of simple satellite con tent in humans (mean of ∼8 Mb). We discovered ∼50,000 rare tandem repeats that are not detected in the T2T-CHM13v2.0 assembly, including undescribed variants of telomericand pericentromeric repeats. We find broad homogeneity of the most abundant repeats across populations, except for AG-rich repeats which are more abundant in African individuals. We also find cliques of highly similar AG- and AT-rich satellites that are interspersed and form higher-order structures that covary in copy number across individuals, likely through concerted amplification via unequal exchange. Finally, we use pericentro meric polymorphisms to estimate centromeric genetic relatedness between individuals and find a strong predictive relation ship between centromeric lineages and pericentromeric simple satellite abundances. In particular, ancestral monomers of Human Satellite 2 and Human Satellite 3 abundances correlate with clusters of centromeric ancestry on chromosome 16 and chromosome 9, with some clusters structured by population. These results provide new descriptions of the population dynamics that underlie the evolution of simple satellites in humans. Key words: population structure, human evolution, repetitive DNA, heterochromatin. Significance Satellite DNAs make up large and occasionally essential portions of the human genome, but the study of the variation and evolution of these repeats is limited by technical problems of using short-read Illumina sequencing. By using k-Seek, a method to mine tandem repeats in unassembled short-reads, we circumvent some of these technical problems and analyze the population variation and structure of simple satellites in 2,504 human genomes. We report previously un described simple satellites, correlated variation in simple satellite abundances and population structure of centromeric DNA variation. Introduction Repetitive DNA is a near-ubiquitous feature of eukaryotic genomes and in humans represents ∼54% of the genome (Hoyt et al. 2022). Repetitive DNA in the human genome is ∼89% interspersed transposable element sequences and ∼11% satellites, tandemly repeating arrays of homo genized sequence motifs with individual arrays ranging in length from a few hundred base pairs to several megabases (Altemose et al. 2022; Hoyt et al. 2022). These tandem re peats are diverse in sequence, implicated in human disease © The Author(s) 2024. Published by Oxford University Press on behalf of Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact for reprints and translation rights for reprints. All other permissions can be obtained through our RightsLink service via the Permissions link on the article page on our site—for further information please contact . Genome Biol. Evol. 16(8) https://doi.org/10.1093/gbe/evae153 Advance Access publication 17 July 2024 1 Accepted: July 12, 2024 GBE Said et al. (Aldrup-MacDonald et al. 2016; Altemose et al. 2022). In Drosophila melanogaster the variation of both the amount and sequence of constitutive heterochromatin in the Y-chromosome affects gene expression and chromatin state (Zhou et al. 2012; Berloco et al. 2014; Kelsey and Clark 2017; Delanoue et al. 2023). However, the full extent of satellite variation in both abundance and sequence and how much of this variation affects organismal fitness have yet to be fully explored. Many satellites are rapidly evolving. For example, Alpha Satellite has conserved biological function across taxa, yet shows rapid rates of evolution in sequence and abundance, as well as varying mutation rates across centromeres (Henikoff et al. 2001; Logsdon et al. 2024). There are highly divergent species-specific simple satellites and satellite sub families in closely related species of Drosophila and primates that have rapidly evolved in sequence and abundance (Waye and Willard 1989; Haaf and Willard 1997; Jarmuz et al. 2007; Wei et al. 2018; Cechova et al. 2019). Species-specific satel lites are even implicated in hybrid incompatibility in Drosophila species (Ferree and Barbash 2009; Bayes and Malik 2009; Satyaki et al. 2014). There is also considerable intraspecific variation of simple satellite abundances in natural and experimental populations of D. melanogaster, Chlamydomonas reinhardtii and Daphnia pulex (Wei et al. 2014; Flynn et al. 2017, 2018). These observations are some what contradictory to the theoretical models of concerted satellite evolution, which predict that within a given species satellite sequences and variation should be homogenized (Smith 1976; Perelson and Bell 1977; Stephan 1989; Stephan and Cho 1994). This deviation of the empirical data from the evolutionary models suggests that the evolu tionary dynamics of simple satellites in populations are not well described and significant advancements to the popula tion genetic models must be made to fully capture their evo lutionary rates and variation, particularly if we aim to understand the selective forces acting on satellites. Human population-genomics data provide a vast re source of thousands of high-quality genomes necessary to advance the evolutionary models of satellites. However, studies of the population variation of satellites in humans have been focused primarily on microsatellites (Payseur et al. 2011; Willems et al. 2014). The variation of simple and complex satellites has been studied to a lesser degree, but there is evidence of over 10-fold differences in HSat3 abundance between human Y-haplogroups, 5to 10-fold differences in centromere size, and tremendous diversity in the higher-order structures of centromeric re peats, pointing to rapid evolution of satellites within human populations (Altemose et al. 2014, 2022; Miga 2019; Suzuki et al. 2020). Some satellite variation within the cen tromeres affects the formation and the positioning of the kinetochore, which may affect the integrity of cell division, (...truncated)


This is a preview of a remote PDF: https://academic.oup.com/gbe/article-pdf/16/8/evae153/58757634/evae153.pdf
Article home page: https://academic.oup.com/gbe/article/16/8/evae153/7715938

Said, Iskander, Barbash, Daniel A, Clark, Andrew G. The Structure of Simple Satellite Variation in the Human Genome and Its Correlation With Centromere Ancestry, Genome Biology and Evolution, 2024, Volume 16, Issue 8, DOI: 10.1093/gbe/evae153