A genomic scale map of genetic diversity in Trypanosoma cruzi
Alejandro A Ackermann
0
Leonardo G Panunzi
0
Raul O Cosentino
Daniel O Snchez
Fernn Agero
0
Equal contributors Instituto de Investigaciones Biotecnologicas - Instituto Tecnologico de Chascomus (IIB-INTECH), Universidad Nacional de San Martin - Consejo de Investigaciones Cientificas y Tecnicas (UNSAM-CONICET), Sede San Martin
,
B 1650 HMP, San Martin, Buenos Aires
,
Argentina
Background: Trypanosoma cruzi, the causal agent of Chagas Disease, affects more than 16 million people in Latin America. The clinical outcome of the disease results from a complex interplay between environmental factors and the genetic background of both the human host and the parasite. However, knowledge of the genetic diversity of the parasite, is currently limited to a number of highly studied loci. The availability of a number of genomes from different evolutionary lineages of T. cruzi provides an unprecedented opportunity to look at the genetic diversity of the parasite at a genomic scale. Results: Using a bioinformatic strategy, we have clustered T. cruzi sequence data available in the public domain and obtained multiple sequence alignments in which one or two alleles from the reference CL-Brener were included. These data covers 4 major evolutionary lineages (DTUs): TcI, TcII, TcIII, and the hybrid TcVI. Using these set of alignments we have identified 288,957 high quality single nucleotide polymorphisms and 1,480 indels. In a reduced re-sequencing study we were able to validate ~ 97% of high-quality SNPs identified in 47 loci. Analysis of how these changes affect encoded protein products showed a 0.77 ratio of synonymous to non-synonymous changes in the T. cruzi genome. We observed 113 changes that introduce or remove a stop codon, some causing significant functional changes, and a number of tri-allelic and tetra-allelic SNPs that could be exploited in strain typing assays. Based on an analysis of the observed nucleotide diversity we show that the T. cruzi genome contains a core set of genes that are under apparent purifying selection. Interestingly, orthologs of known druggable targets show statistically significant lower nucleotide diversity values. Conclusions: This study provides the first look at the genetic diversity of T. cruzi at a genomic scale. The analysis covers an estimated ~ 60% of the genetic diversity present in the population, providing an essential resource for future studies on the development of new drugs and diagnostics, for Chagas Disease. These data is available through the TcSNP database (http://snps.tcruzi.org).
-
Background
Trypanosoma cruzi is a protozoan parasite of the order
Kinetoplastida, and the causative agent of Chagas
Disease, one of the so called neglected diseases that
disproportionately affect the poor. The disease is endemic
in most Latin American countries, affecting in excess of
8 million people [1]. Chagas disease has a variable clinical
outcome. In its acute form it can lead to death (mostly in
infants), while in its chronic form, it is a debilitating disease
producing different associated pathologies: mega-colon,
mega-esophagus and cardiomyopathy, among others. These
different clinical outcomes are the result of a complex
interplay between environmental factors, the host genetic
background and the genetic diversity present in the parasite
population. As a result, these different clinical
manifestations have been suggested to be, at least in part, due to the
genetic diversity of T. cruzi [2-5].
The T. cruzi species has a structured population, with a
predominantly clonal mode of reproduction [6], and a
considerable phenotypic diversity [7-10]. Through the use of a
number of molecular markers the population has been
divided in a number of evolutionary lineages, also called
discrete typing units. Some markers allow the distinction of
two or three major lineages [11-14], while other
experimental strategies, such as RAPD and multilocus isoenzyme
electrophoresis (MLEE) support the distinction of six
subdivisions [15-17] originally designated as DTUs I, IIa, IIb,
IIc, IId, and IIe [16]. Recently, this nomenclature was
revised as follows: TcI, TcII (former TcIIb), TcIII (IIc), TcIV
(TcIIa), TcV (TcIId) and TcVI (TcIIe) [18,19]. Lineages TcV
and TcVI (which include the strain used for the first
genomic sequence of T. cruzi, CL Brener) have a very high
degree of heterozygosity but otherwise very homogeneous
population structures with low intralineage diversity
[20,21]. The currently favoured hypothesis suggests that
these two lineages originated after either one or two
independent hybridization events between strains of DTUs TcII
and TcIII [21-23].
Knowledge of the genetic variation present in a genome
(i.e. between the two alleles of a diploid individual) or in a
species (i.e. in the population) is of central importance for a
variety of reasons and applications: i) to understand the
evolutionary forces underlying the biological and
phenotypic properties observed in an individual; ii) to detect cases
of apparent horizontal gene transfer; iii) to assess the
potential for development of resistance when validating a target
for drug development; iv) to prioritize targets for
development of diagnostics or vaccines; v) in the design of
constructs for genetic knockout experiments in order to
increase the success rate when targeting specific alleles;
and vi) as genetic markers in association studies or to
further probe the population structure.
The genome sequence of the CL-Brener clone of
T. cruzi was published in 2005 [24], together with
those of two other trypanosomatids of medical
importance: Trypanosoma brucei (Sleeping sickness, African
trypanosomiasis) [25] and Leishmania major
(Leishmaniasis) [26]. However, the genome of T. cruzi was a particular
case for a number of reasons: it was obtained from a hybrid
TcVI strain composed of two divergent parental haplotypes;
and it was sequenced using a whole genome shotgun
strategy [24]. This choice of strain and sequencing strategy
resulted in high sequence coverage from the two parental
haplotypes, which were derived from ancestral TcII and
TcIII strains. Because of the high allelic variation found
within this diploid genome, a significant number of contigs
were found to be present twice in the assembly [24]. These
divergent haplotypes, which were assembled separately in
many cases, were the basis of a recent re-assembly of the
genome [27]. As a consequence, it is now possible to
identify the genetic diversity present within this diploid
genome.
More recently a number of whole genome sequencing
data have become available from different strains of
T. cruzi: the draft genomic sequence of the Sylvio X10
(TcI) strain [28], high-coverage transcriptomic data,
from another TcI strain (Westergaard G, and Vazquez
MP, manuscript in preparation), as well as 2.5X WGS
shotgun data from the Esmeraldo cl3 (TcII) strain.
To take advantage of the hybrid genome of the
CL-Brener strain, and of other genome and transcriptome
da (...truncated)