High Satellite Repeat Turnover in Great Apes Studied with Short- and Long-Read Technologies

Molecular Biology and Evolution, Nov 2019

Satellite repeats are a structural component of centromeres and telomeres, and in some instances, their divergence is known to drive speciation. Due to their highly repetitive nature, satellite sequences have been understudied and underrepresented in genome assemblies. To investigate their turnover in great apes, we studied satellite repeats of unit sizes up to 50 bp in human, chimpanzee, bonobo, gorilla, and Sumatran and Bornean orangutans, using unassembled short and long sequencing reads. The density of satellite repeats, as identified from accurate short reads (Illumina), varied greatly among great ape genomes. These were dominated by a handful of abundant repeated motifs, frequently shared among species, which formed two groups: 1) the (AATGG)n repeat (critical for heat shock response) and its derivatives; and 2) subtelomeric 32-mers involved in telomeric metabolism. Using the densities of abundant repeats, individuals could be classified into species. However, clustering did not reproduce the accepted species phylogeny, suggesting rapid repeat evolution. Several abundant repeats were enriched in males versus females; using Y chromosome assemblies or Fluorescent In Situ Hybridization, we validated their location on the Y. Finally, applying a novel computational tool, we identified many satellite repeats completely embedded within long Oxford Nanopore and Pacific Biosciences reads. Such repeats were up to 59 kb in length and consisted of perfect repeats interspersed with other similar sequences. Our results based on sequencing reads generated with three different technologies provide the first detailed characterization of great ape satellite repeats, and open new avenues for exploring their functions.

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/mbe/article-pdf/36/11/2415/30193281/msz156.pdf

High Satellite Repeat Turnover in Great Apes Studied with Short- and Long-Read Technologies

High Satellite Repeat Turnover in Great Apes Studied with Short- and Long-Read Technologies Monika Cechova,1 Robert S. Harris,1 Marta Tomaszkiewicz,1 Barbara Arbeithuber,1 Francesca Chiaromonte,*,2,3,4 and Kateryna D. Makova*,1,4 1 Department of Biology, Pennsylvania State University, University Park, PA Department of Statistics, Pennsylvania State University, University Park, PA 3 EMbeDS, Sant’Anna School of Advanced Studies, Pisa, Italy 4 Center for Medical Genomics, Penn State, University Park, PA 2 Abstract Key words: heterochromatin, satellite repeats, long sequencing reads, great apes. Introduction Heterochromatin is the gene-poor and highly compacted portion of the genome. It is typically dominated by satellite repeats—long arrays of tandemly repeated noncoding DNA (Kit 1961; Sueoka 1961) that consist of smaller units organized into higher order repeat structures. Heterochromatin is abundant, for instance, at telomeres and centromeres of human chromosomes (Sujiwattanarat et al. 2015). While labeled as “junk DNA” in the past, heterochromatin was later found to fulfill important functions in the genome (Walker 1971; Yunis and Yasmineh 1971; Ferree and Barbash 2009). Heterochromatin satellite repeat expansions have been associated with changes in gene expression and methylation (Brahmachary et al. 2014; Quilez et al. 2016). It has also been proposed that heterochromatin aids in maintaining cellular identity by repressing genes that are not specific to a particular cell lineage (reviewed in Becker et al. 2016). For instance, the heterochromatin-associated histone mark H3K9me3 blocks reprogramming to pluripotency (Soufi et al. 2012). Additionally, heterochromatin loss is part of the normal aging process (Zhang et al. 2015) and changes during stress (Gowen and Gay 1933; Jolly et al. 2004; Rizzi et al. 2004; Tittel-Elmer et al. 2010; Seong et al. 2011). Despite a growing interest in understanding these important functions of heterochromatin, satellite repeats are frequently underrepresented in genomic studies—due to the difficulties in sequencing and assembling these highly similar sequences (Chaisson et al. 2015). The lack of information about satellite repeats is particularly alarming given their high abundance; ß The Author(s) 2019. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http:// creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact Open Access Mol. Biol. Evol. 36(11):2415–2431 doi:10.1093/molbev/msz156 Advance Access publication July 2, 2019 2415 Article Satellite repeats are a structural component of centromeres and telomeres, and in some instances, their divergence is known to drive speciation. Due to their highly repetitive nature, satellite sequences have been understudied and underrepresented in genome assemblies. To investigate their turnover in great apes, we studied satellite repeats of unit sizes up to 50 bp in human, chimpanzee, bonobo, gorilla, and Sumatran and Bornean orangutans, using unassembled short and long sequencing reads. The density of satellite repeats, as identified from accurate short reads (Illumina), varied greatly among great ape genomes. These were dominated by a handful of abundant repeated motifs, frequently shared among species, which formed two groups: 1) the (AATGG)n repeat (critical for heat shock response) and its derivatives; and 2) subtelomeric 32-mers involved in telomeric metabolism. Using the densities of abundant repeats, individuals could be classified into species. However, clustering did not reproduce the accepted species phylogeny, suggesting rapid repeat evolution. Several abundant repeats were enriched in males versus females; using Y chromosome assemblies or Fluorescent In Situ Hybridization, we validated their location on the Y. Finally, applying a novel computational tool, we identified many satellite repeats completely embedded within long Oxford Nanopore and Pacific Biosciences reads. Such repeats were up to 59 kb in length and consisted of perfect repeats interspersed with other similar sequences. Our results based on sequencing reads generated with three different technologies provide the first detailed characterization of great ape satellite repeats, and open new avenues for exploring their functions. *Corresponding authors: E-mails: ; . Associate editor: Irina Arkhipova Illumina sequencing reads from 79 great apes were part of the Ape Diversity Project (Prado-Martinez et al. 2013). Sequencing reads generated for human populations were generated by (Meyer et al. 2012) Additionally, human samples from the Genome in a Bottle project (Zook et al. 2016) and two human trios from 1000 Genomes Project (1000 Genomes Project Consortium et al. 2015)—with IDs HG002, HG003, HG004, NA12889, NA12890, NA12877 and NA12891, NA12892, NA12878, respectively—were used. The publicly available PacBio data had following ids: SRR2097942 for human, SRR5269473 for chimpanzee, ERR1294100 for gorilla, and SRR5235143 for Sumatran orangutan. The Nanopore data generated are deposited under the BioProject PRJNA505331. All scripts available from the git repository are at https://github.com/makovalab-psu/heterochromatin, last accessed July 05, 2019. MBE Cechova et al. . doi:10.1093/molbev/msz156 2416 repeat was also identified in orangutan, chicken, maize, sea urchin, and Daphnia (Grady et al. 1992; Flynn et al. 2017), however, its variation in great ape species was never studied. The telomeric (TTAGGG)n satellite functions to maintain genome stability; telomere loss is correlated with cell division and aging (Lanza et al. 2000; Rizvi et al. 2014). StSats present in the genomes of chimpanzee, bonobo, and gorilla (Royle et al. 1994) localize proximal to telomeres (Royle et al. 1994; Koga et al. 2011; Ventura et al. 2012) and were proposed to play a role in telomere metabolism (Novo et al. 2013) and meiotic telomere clustering important for homolog recognition and pairing (in a process similar to that identified in plants; Bass et al. 2000; Calderon et al. 2014). In this study, we characterize turnover of satellites with repeat units 50 bp among six great ape species—human, chimpanzee, bonobo, gorilla, Bornean orangutan, and Sumatran orangutan—which diverged <15 My ago (Goodman et al. 2005). We focus on repeats that constitute portions of long arrays of satellite DNA and use them as a proxy for heterochromatin (Wei et al. 2014). This approximation is needed because of challenges in the direct identification of heterochromatin due to its transient nature in various cells of individuals throughout their lifetime. In this manuscript, we, first, identify satellite repeats in short sequencing reads gene (...truncated)


This is a preview of a remote PDF: https://academic.oup.com/mbe/article-pdf/36/11/2415/30193281/msz156.pdf
Article home page: https://academic.oup.com/mbe/article/36/11/2415/5526925

Cechova, Monika, Harris, Robert S, Tomaszkiewicz, Marta, Arbeithuber, Barbara, Chiaromonte, Francesca, Makova, Kateryna D. High Satellite Repeat Turnover in Great Apes Studied with Short- and Long-Read Technologies, Molecular Biology and Evolution, 2019, pp. 2415-2431, Volume 36, Issue 11, DOI: 10.1093/molbev/msz156