High Satellite Repeat Turnover in Great Apes Studied with Short- and Long-Read Technologies
High Satellite Repeat Turnover in Great Apes Studied with
Short- and Long-Read Technologies
Monika Cechova,1 Robert S. Harris,1 Marta Tomaszkiewicz,1 Barbara Arbeithuber,1
Francesca Chiaromonte,*,2,3,4 and Kateryna D. Makova*,1,4
1
Department of Biology, Pennsylvania State University, University Park, PA
Department of Statistics, Pennsylvania State University, University Park, PA
3
EMbeDS, Sant’Anna School of Advanced Studies, Pisa, Italy
4
Center for Medical Genomics, Penn State, University Park, PA
2
Abstract
Key words: heterochromatin, satellite repeats, long sequencing reads, great apes.
Introduction
Heterochromatin is the gene-poor and highly compacted
portion of the genome. It is typically dominated by satellite
repeats—long arrays of tandemly repeated noncoding DNA
(Kit 1961; Sueoka 1961) that consist of smaller units organized
into higher order repeat structures. Heterochromatin is abundant, for instance, at telomeres and centromeres of human
chromosomes (Sujiwattanarat et al. 2015). While labeled as
“junk DNA” in the past, heterochromatin was later found to
fulfill important functions in the genome (Walker 1971;
Yunis and Yasmineh 1971; Ferree and Barbash 2009).
Heterochromatin satellite repeat expansions have been associated with changes in gene expression and methylation
(Brahmachary et al. 2014; Quilez et al. 2016). It has also
been proposed that heterochromatin aids in maintaining cellular identity by repressing genes that are not specific to a
particular cell lineage (reviewed in Becker et al. 2016). For
instance, the heterochromatin-associated histone mark
H3K9me3 blocks reprogramming to pluripotency (Soufi
et al. 2012). Additionally, heterochromatin loss is part of
the normal aging process (Zhang et al. 2015) and changes
during stress (Gowen and Gay 1933; Jolly et al. 2004; Rizzi et al.
2004; Tittel-Elmer et al. 2010; Seong et al. 2011). Despite a
growing interest in understanding these important functions
of heterochromatin, satellite repeats are frequently underrepresented in genomic studies—due to the difficulties in sequencing and assembling these highly similar sequences
(Chaisson et al. 2015). The lack of information about satellite
repeats is particularly alarming given their high abundance;
ß The Author(s) 2019. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://
creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium,
provided the original work is properly cited. For commercial re-use, please contact
Open Access
Mol. Biol. Evol. 36(11):2415–2431 doi:10.1093/molbev/msz156 Advance Access publication July 2, 2019
2415
Article
Satellite repeats are a structural component of centromeres and telomeres, and in some instances, their divergence is
known to drive speciation. Due to their highly repetitive nature, satellite sequences have been understudied and
underrepresented in genome assemblies. To investigate their turnover in great apes, we studied satellite repeats of
unit sizes up to 50 bp in human, chimpanzee, bonobo, gorilla, and Sumatran and Bornean orangutans, using unassembled short and long sequencing reads. The density of satellite repeats, as identified from accurate short reads
(Illumina), varied greatly among great ape genomes. These were dominated by a handful of abundant repeated motifs,
frequently shared among species, which formed two groups: 1) the (AATGG)n repeat (critical for heat shock response)
and its derivatives; and 2) subtelomeric 32-mers involved in telomeric metabolism. Using the densities of abundant
repeats, individuals could be classified into species. However, clustering did not reproduce the accepted species phylogeny, suggesting rapid repeat evolution. Several abundant repeats were enriched in males versus females; using Y chromosome assemblies or Fluorescent In Situ Hybridization, we validated their location on the Y. Finally, applying a novel
computational tool, we identified many satellite repeats completely embedded within long Oxford Nanopore and Pacific
Biosciences reads. Such repeats were up to 59 kb in length and consisted of perfect repeats interspersed with other similar
sequences. Our results based on sequencing reads generated with three different technologies provide the first detailed
characterization of great ape satellite repeats, and open new avenues for exploring their functions.
*Corresponding authors: E-mails: ; .
Associate editor: Irina Arkhipova
Illumina sequencing reads from 79 great apes were part of the Ape Diversity Project (Prado-Martinez et al. 2013). Sequencing reads
generated for human populations were generated by (Meyer et al. 2012) Additionally, human samples from the Genome in a Bottle
project (Zook et al. 2016) and two human trios from 1000 Genomes Project (1000 Genomes Project Consortium et al. 2015)—with IDs
HG002, HG003, HG004, NA12889, NA12890, NA12877 and NA12891, NA12892, NA12878, respectively—were used. The publicly
available PacBio data had following ids: SRR2097942 for human, SRR5269473 for chimpanzee, ERR1294100 for gorilla, and
SRR5235143 for Sumatran orangutan. The Nanopore data generated are deposited under the BioProject PRJNA505331. All scripts
available from the git repository are at https://github.com/makovalab-psu/heterochromatin, last accessed July 05, 2019.
MBE
Cechova et al. . doi:10.1093/molbev/msz156
2416
repeat was also identified in orangutan, chicken, maize, sea
urchin, and Daphnia (Grady et al. 1992; Flynn et al. 2017),
however, its variation in great ape species was never studied.
The telomeric (TTAGGG)n satellite functions to maintain
genome stability; telomere loss is correlated with cell division
and aging (Lanza et al. 2000; Rizvi et al. 2014). StSats present in
the genomes of chimpanzee, bonobo, and gorilla (Royle et al.
1994) localize proximal to telomeres (Royle et al. 1994; Koga
et al. 2011; Ventura et al. 2012) and were proposed to play a
role in telomere metabolism (Novo et al. 2013) and meiotic
telomere clustering important for homolog recognition and
pairing (in a process similar to that identified in plants; Bass
et al. 2000; Calderon et al. 2014).
In this study, we characterize turnover of satellites with
repeat units 50 bp among six great ape species—human, chimpanzee, bonobo, gorilla, Bornean orangutan,
and Sumatran orangutan—which diverged <15 My ago
(Goodman et al. 2005). We focus on repeats that constitute portions of long arrays of satellite DNA and use them
as a proxy for heterochromatin (Wei et al. 2014). This
approximation is needed because of challenges in the direct identification of heterochromatin due to its transient
nature in various cells of individuals throughout their lifetime. In this manuscript, we, first, identify satellite repeats
in short sequencing reads gene (...truncated)