Cross-species comparison of site-specific evolutionary-rate variation in influenza haemagglutinin
Austin G. Meyer
0
Eric T. Dawson
0
Claus O. Wilke
0
0
Section of Integrative Biology, Institute for Cellular and Molecular Biology, and Center for Computational Biology and Bioinformatics, The University of Texas
,
Austin, Austin, TX 78731
,
USA
We investigate the causes of site-specific evolutionary-rate variation in influenza haemagglutinin (HA) between human and avian influenza, for subtypes H1, H3, and H5. By calculating the evolutionary-rate ratio, v dN/dS as a function of a residue's solvent accessibility in the three-dimensional protein structure, we show that solvent accessibility has a significant but relatively modest effect on site-specific rate variation. By comparing rates within HA subtypes among host species, we derive an upper limit to the amount of variation that can be explained by structural constraints of any kind. Protein structure explains only 20 - 40% of the variation in v. Finally, by comparing v at sites near the sialic-acid-binding region to v at other sites, we show that v near the sialic-acid-binding region is significantly elevated in both human and avian influenza, with the exception of avian H5. We conclude that protein structure, HA subtype, and host biology all impose distinct selection pressures on sites in influenza HA.
Research
1. Introduction
Viral proteins are highly variable at the sequence level; they accumulate amino
acid substitutions at a rapid pace [1,2]. Yet their structures tend to be fairly
conserved. Highly variable surface regions notwithstanding, most viral proteins
need to maintain a specific structure to carry out their function in the viral
life cycle [3]. The generally accepted picture is that sites in the protein core
maintain the overall protein structure and are, therefore, most conserved.
Sites on the surface are less critical to the protein structure and hence more
free to vary, for example in response to selection pressures imposed by
immune response. This view is based on the finding, replicated in widely
differing organisms and using many different techniques, that, on average,
sequence variability increases the closer a site is located towards the surface
of a protein [4 13]. More specifically, in influenza, exposed sites in
haemagglutinin (HA) and neuraminidase have been found to evolve faster than buried
sites in these proteins [14,15].
Thus, prior work has clearly established that protein structure influences
site variability. What is less clear, however, is the magnitude of this effect. Is
knowing a site is buried sufficient to predict that the site will be evolutionarily
conserved, or are other factors stronger driving forces for site-specific
evolutionary rates? And similarly, will homologous sites in related but distinct viral
strains evolve at similar rates, or do the nature of the viral strain and the
infected host organism impose stronger influences on site-specific evolutionary
rates than the location of a site in the protein structure?
Here, we address these questions for influenza HA. We compare per-site
sequence evolution for two different host species (human and avian) and
three HA subtypes (H1, H3 and H5), and ask the following questions: (i) To
what extent is rate variation determined by the location of a site in the structure,
as measured by the sites relative solvent accessibility (RSA)? (ii) To what extent
is rate variation conserved within HA subtypes among viruses infecting
different host species? (iii) Are v dN/dS ratios elevated near the active site (the
sialic-acid-binding region, SABR) of HA? We find that protein
structure, HA subtype and host biology affect rate variation
in influenza HA.
2. Material and methods
(a) Sequence preparation
We obtained sequences for HA subtypes H1, H3 and H5 for
human and avian hosts from the Influenza Research Database
[16]. Using the built-in curating tools of the database, we carefully
selected subsets of sequences that corresponded as much as
possible to well-defined and distinct viral populations. Sequences were
curated within each host species depending on its subtype. In
particular, for each combination of HA subtype and host species,
we considered only sequences that could be linked to a specific
neuraminidase subtype.
Human H1 sequences were obtained from H1N1 strains
isolated between 1977 (after the Fort Dix outbreak) and 2008 (before
the 2009 flu pandemic). H1N1 strains since 2009 are not direct
descendants of H1N1 strains before 2009 and thus were
excluded. We found 2057 distinct H1 sequences. Human H3
sequences were obtained from H3N2 strains isolated between
1968 and 2012. We found 8315 distinct sequences. Human H5
sequences were obtained from H5N1 sequences without date
restriction. We found 297 distinct sequences.
Avian sequences were curated by subtype with no
restrictions placed on the date range; full datasets from FluDB of
H1N1, H3N2 and H5N1 sequences were used. We found 106,
115 and 2684 distinct sequences, respectively.
To align sequences and map them (...truncated)