Clotho: addressing the scalability of forward time population genetic simulation

BMC Bioinformatics, Jun 2015

Background Forward Time Population Genetic Simulations offer a flexible framework for modeling the various evolutionary processes occurring in nature. Often this model expressibility is countered by an increased memory usage or computational overhead. With the complexity of simulation scenarios continuing to increase, addressing the scalability of the underlying simulation framework is a growing consideration. Results We propose a general method for representing in silico genetic sequences using implicit data structures. We provide a generalized implementation as a C++ template library called Clotho. We compare the performance and scalability of our approach with those taken in other simulation frameworks, namely: FWDPP and simuPOP. Conclusions We show that this technique offers a 4x reduction in memory utilization. Additionally, with larger scale simulation scenarios we are able to offer a speedup of 6x - 46x.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

http://www.biomedcentral.com/content/pdf/s12859-015-0631-z.pdf

Clotho: addressing the scalability of forward time population genetic simulation

Putnam et al. BMC Bioinformatics Clotho: addressing the scalability of forward time population genetic simulation Patrick P. Putnam 0 1 2 3 Philip A. Wilsey 1 3 Ge Zhang 0 2 3 0 Human Genetics, Cincinnati Children's Hospital Medical Center , 3333 Burnet Ave, 45229-3026 Cincinnati, OH , USA 1 Department of Electrical Engineering 2 Human Genetics, Cincinnati Children's Hospital Medical Center , 3333 Burnet Ave, 45229-3026 Cincinnati, OH , USA 3 and Computing Systems, University of Cincinnati , PO Box 210030, 45221-0030 Cincinnati, OH , USA Background: Forward Time Population Genetic Simulations offer a flexible framework for modeling the various evolutionary processes occurring in nature. Often this model expressibility is countered by an increased memory usage or computational overhead. With the complexity of simulation scenarios continuing to increase, addressing the scalability of the underlying simulation framework is a growing consideration. Results: We propose a general method for representing in silico genetic sequences using implicit data structures. We provide a generalized implementation as a C++ template library called Clotho. We compare the performance and scalability of our approach with those taken in other simulation frameworks, namely: FWDPP and simuPOP. Conclusions: We show that this technique offers a 4x reduction in memory utilization. Additionally, with larger scale simulation scenarios we are able to offer a speedup of 6x - 46x. Population genetic simulation; Data structures; Sequence representation; Scalability - Background Forward Time Population Genetic Simulations (FTPGS) are essential tools that aid in the study of complex interactions which contribute to the evolutionary process. They enable the more efficient study of allele frequency change over time as a result of a set of models that reflect naturally occurring processes such as mutation, recombination, selection, gene flow, and genetic drift. Over the years, a plethora of Forward Time Population Genetic Simulators have been developed [1]. It is not uncommon to find simulators that perform efficiently for a very specific subset of scenarios in a given domain but fail to provide a broad solution suitable for general use [2, 3]. Several general simulation frameworks [4–6] have been developed to allow users to build their own simulator capable of addressing the scenarios they are interested in. Often, these frameworks lack support for scalable performance to study the larger simulation scenarios which many investigators are pursuing [7]. Scalability refers to the ability of a program to handle an increased amount of work. In software, this is measured in terms of both computational runtime, and resource utilization. The scalability of a simulation depends upon many elements. At a high level, a simulation is dependent upon the choice of models, configurations of those models, and the desired scope of the simulation. Fundamental to all of these is the implementation. For example, if the models are not implemented with scalability in mind, then the scalability of the entire simulation suffers. The design and implementation of a model is often a challenging problem with potentially many dependencies interacting in various ways. For example, in FTPGS most models being explored depend upon a genetic sequence. As a result, if the representation of a genetic sequence is not scalable, then the entire simulation becomes less scalable. Impact of genetic sequence representation We refer to a genetic sequence as the in silico representation of the genetic material specific to each individual in a population. The aim of a simulation is to, in effect, evolve a set of genetic sequences. The various models that are evaluated during a simulation may either work to modify a genetic sequence, or analyze the set of genetic sequences to identify specific characteristics. As genetic sequences are such an integral component of any FTPGS, their in silico representation plays a significant role in the overall scalability of a simulation. Most FTPGS are constructed considering a genetic sequence as a locus ordered list of alleles. This design is intuitive as it mirrors that of genetic structures in nature. In general, this common data structure is easily implemented and provides relatively straightforward use. Also, the models can take advantage of the ordering to improve their efficiency. Although most simulators are built using this common structure, they often differ in their computational abstraction of an allele and the subsequent computational optimizations that may result. An allele is generally abstracted as a symbol reflecting a specific state of a locus. From an implementation perspective, there is a choice of how the state should be represented. In some cases, it suffices to set a upper limit on the number of states for every locus. Thus, every locus can be represented as a fixed-length value, or symbol. For example, it may suffice (...truncated)


This is a preview of a remote PDF: http://www.biomedcentral.com/content/pdf/s12859-015-0631-z.pdf

Patrick Putnam, Philip Wilsey, Ge Zhang. Clotho: addressing the scalability of forward time population genetic simulation, BMC Bioinformatics, 2015, pp. 191, 16, DOI: 10.1186/s12859-015-0631-z