MosaicSolver: a tool for determining recombinants of viral genomes from pileup data
Graham R. Wood
1
Eugene V. Ryabov
0
Jessica M. Fannon
0
Jonathan D. Moore
1
David
J. Evans
0
Nigel Burroughs
1
0
School of Life Sciences, University of Warwick
, Coventry, CV4 7AL,
UK
1
Warwick Systems Biology Centre
, Senate House,
University of Warwick
, Coventry, CV4 7AL,
UK
Viral recombination is a key evolutionary mechanism, aiding escape from host immunity, contributing to changes in tropism and possibly assisting transmission across species barriers. The ability to determine whether recombination has occurred and to locate associated specific recombination junctions is thus of major importance in understanding emerging diseases and pathogenesis. This paper describes a method for determining recombinant mosaics (and their proportions) originating from two parent genomes, using high-throughput sequence data. The method involves setting the problem geometrically and the use of appropriately constrained quadratic programming. Recombinants of the honeybee deformed wing virus and the Varroa destructor virus-1 are inferred to illustrate the method from both siRNAs and reads sampling the viral genome population (cDNA library); our results are confirmed experimentally. Matlab software (MosaicSolver) is available.
-
INTRODUCTION
Recombination provides a mechanism for the rapid
evolution of viruses, being implicated in the emergence of many
recent pathogenic viral strains in public health and
agriculture. Recent outbreaks of avian influenza (1,2) have
implicated a recombinant event as a primary cause, honeybee
population decline is associated with a deformed wing virus
(DWV) recombinant (3,4) and current global potato crop
devastation is caused by the highly pathogenic Y NTN virus
strain (5,6). Further, human immunodeficiency virus
continues to evolve with recombinants now predominating in
many geographical areas exacerbating control measures (7),
whilst recombination has also become a focus as a
potential risk factor in the use of live attenuated virus vaccines
(8). These are all examples of virulence shifts, the
recombined virus acquiring new capabilities such as escape from
the immune system, drug resistance, increased transmission
rates, changes in tissue tropism or acquisition of novel host
tropism allowing cross-species transmission. Despite these
evolutionary advantages, a recent review (9) suggests that
recombination of ribonucleic acid (RNA) viruses may not
be a selected trait but a biproduct of the RNA polymerase
mechanism.
Recombination is mediated through co-infection of a cell
and can in principle occur anywhere along the genome,
although recombination points do have preferred hotspots
(1012). For instance, recombination in poliovirus was
shown to be associated with RNA structure and exhibits
a GC content bias over an infection cycle (11), whilst
protein incompatibility and selection pressure on regulatory,
maturation or associated protein functions are likely to add
a further layer of selection for the location of
recombination points, producing the well-known bias between
structural and nonstructural genes (10). Furthermore, recent
evidence indicates that the recombination mechanism is
biphasic, involving distinct crossover and resolution events (12).
Mapping these locations is vital for identifying the
determinants of recombination and understanding the
characteristics of emergent strains. Identification of recombinants
within a population of mixed viral genomes, together with
their abundance, is thus a problem of fundamental
significance.
Detection of recombinants, especially when there is no
prior knowledge of recombination junctions (which would
allow construction of suitable primers), is difficult,
particularly if more than one recombinant progeny form is present.
Next-generation sequencing (NGS) approaches provide a
new opportunity to perform this task; new challenges arise
however, particularly in the reconstruction of the
underlying genomes from small sequences [typically less than 100
nucleotides (nt)]. In this paper, we present a novel approach
to identify, characterize, quantify and assess the
statistical significance of recombinant genomes in NGS sampling
of population mixtures. Throughout we assume that the
parent viral genomes can be globally aligned and that any
recombination involves exchange of homologous regions.
The current work was motivated by ongoing
investigations into honeybees (Apis mellifera) infested with a
parasitic mite (Varroa destructor). The latter acts as a vector
for a range of pathogenic viruses (1315), the most
important of which (both in terms of the individual honeybee and
the penetration of colonies in the UK) are viruses related to
the deformed wing virus (DWV-like viruses), which include
DWV itself and its relative Varroa destructor virus-1
(VDV1) that share an 84% nt (95% amino acid) identity. The
latter was first extracted from Varroa mites (16). High levels of
DWV-like viruses are associated primarily with deformed
wings, including atrophied wing development and
abdominal stunting (17). DWV-like viruses are endemic in
honeybees worldwide, usually being asymptomatic, with the virus
presumably being controlled and thus not reaching
harmful levels; however, it has been reported to be responsible
for overwintering colony demise, although the cause of the
shift from a benign to a pathogenic infection is unknown.
Co-infection of either the host honeybee or the mite with
DWV and VDV-1 may result in the formation of
recombinants between the two viruses. Such recombinants could
accumulate to high levels and it is hypothesized that one or a
very limited range of such recombinant forms is responsible
for colony demise (34,18). Thus, ascertaining the
recombinant profile within a population is a problem of key
significance to food security. Different recombinants of DWV and
VDV-1 strains have been reported (3,19). This makes the
identification of DWV/VDV-1 recombinants a good system
for the development of methods for recombinant
identification, especially as mixed infections (parental and
recombinant genomes) are present in the same individual.
As part of the analysis of the virological consequences
of infesting Varroa-free colonies with mites we acquired
two types of high-throughput sequence data, specifically
sequencing of small interfering RNAs (siRNA;
singlestranded RNAs that were generated as a result of the action
of several components of the honeybee RNAi pathway) and
short reads from the viral genome population [amplified
complementary deoxyribonucleic acid (cDNA)], both
extracted from Varroa-exposed, high viral load pupae. These
independently generated data sets allowed us to investigate
the development of a method to disentangle recombinant
populations using both short, 2122-nt reads (siRNA) and
long, around 100-nt (cDNA) reads. These data, arising from
parent genomes and potential recombinants within the
viral population, allow the relative abundance of DWV and
VDV-1 reads to be determined in any continuous (...truncated)