Clotho: addressing the scalability of forward time population genetic simulation
Putnam et al. BMC Bioinformatics
Clotho: addressing the scalability of forward time population genetic simulation
Patrick P. Putnam 0 1 2 3
Philip A. Wilsey 1 3
Ge Zhang 0 2 3
0 Human Genetics, Cincinnati Children's Hospital Medical Center , 3333 Burnet Ave, 45229-3026 Cincinnati, OH , USA
1 Department of Electrical Engineering
2 Human Genetics, Cincinnati Children's Hospital Medical Center , 3333 Burnet Ave, 45229-3026 Cincinnati, OH , USA
3 and Computing Systems, University of Cincinnati , PO Box 210030, 45221-0030 Cincinnati, OH , USA
Background: Forward Time Population Genetic Simulations offer a flexible framework for modeling the various evolutionary processes occurring in nature. Often this model expressibility is countered by an increased memory usage or computational overhead. With the complexity of simulation scenarios continuing to increase, addressing the scalability of the underlying simulation framework is a growing consideration. Results: We propose a general method for representing in silico genetic sequences using implicit data structures. We provide a generalized implementation as a C++ template library called Clotho. We compare the performance and scalability of our approach with those taken in other simulation frameworks, namely: FWDPP and simuPOP. Conclusions: We show that this technique offers a 4x reduction in memory utilization. Additionally, with larger scale simulation scenarios we are able to offer a speedup of 6x - 46x.
Population genetic simulation; Data structures; Sequence representation; Scalability
-
Background
Forward Time Population Genetic Simulations (FTPGS)
are essential tools that aid in the study of complex
interactions which contribute to the evolutionary process. They
enable the more efficient study of allele frequency change
over time as a result of a set of models that reflect naturally
occurring processes such as mutation, recombination,
selection, gene flow, and genetic drift.
Over the years, a plethora of Forward Time
Population Genetic Simulators have been developed [1]. It is not
uncommon to find simulators that perform efficiently for
a very specific subset of scenarios in a given domain but
fail to provide a broad solution suitable for general use
[2, 3]. Several general simulation frameworks [4–6] have
been developed to allow users to build their own
simulator capable of addressing the scenarios they are interested
in. Often, these frameworks lack support for scalable
performance to study the larger simulation scenarios which
many investigators are pursuing [7].
Scalability refers to the ability of a program to handle an
increased amount of work. In software, this is measured
in terms of both computational runtime, and resource
utilization. The scalability of a simulation depends upon
many elements. At a high level, a simulation is dependent
upon the choice of models, configurations of those
models, and the desired scope of the simulation. Fundamental
to all of these is the implementation. For example, if the
models are not implemented with scalability in mind, then
the scalability of the entire simulation suffers.
The design and implementation of a model is often a
challenging problem with potentially many dependencies
interacting in various ways. For example, in FTPGS most
models being explored depend upon a genetic sequence.
As a result, if the representation of a genetic sequence
is not scalable, then the entire simulation becomes less
scalable.
Impact of genetic sequence representation
We refer to a genetic sequence as the in silico
representation of the genetic material specific to each individual in a
population. The aim of a simulation is to, in effect, evolve
a set of genetic sequences. The various models that are
evaluated during a simulation may either work to modify
a genetic sequence, or analyze the set of genetic sequences
to identify specific characteristics. As genetic sequences
are such an integral component of any FTPGS, their in
silico representation plays a significant role in the overall
scalability of a simulation.
Most FTPGS are constructed considering a genetic
sequence as a locus ordered list of alleles. This design is
intuitive as it mirrors that of genetic structures in nature.
In general, this common data structure is easily
implemented and provides relatively straightforward use. Also,
the models can take advantage of the ordering to improve
their efficiency. Although most simulators are built using
this common structure, they often differ in their
computational abstraction of an allele and the subsequent
computational optimizations that may result.
An allele is generally abstracted as a symbol
reflecting a specific state of a locus. From an implementation
perspective, there is a choice of how the state should be
represented. In some cases, it suffices to set a upper limit
on the number of states for every locus. Thus, every locus
can be represented as a fixed-length value, or symbol.
For example, it may suffice (...truncated)