Learning directed acyclic graphs from large-scale genomics data

EURASIP Journal on Bioinformatics and Systems Biology, Sep 2017

In this paper, we consider the problem of learning the genetic interaction map, i.e., the topology of a directed acyclic graph (DAG) of genetic interactions from noisy double-knockout (DK) data. Based on a set of well-established biological interaction models, we detect and classify the interactions between genes. We propose a novel linear integer optimization program called the Genetic-Interactions-Detector (GENIE) to identify the complex biological dependencies among genes and to compute the DAG topology that matches the DK measurements best. Furthermore, we extend the GENIE program by incorporating genetic interaction profile (GI-profile) data to further enhance the detection performance. In addition, we propose a sequential scalability technique for large sets of genes under study, in order to provide statistically significant results for real measurement data. Finally, we show via numeric simulations that the GENIE program and the GI-profile data extended GENIE (GI-GENIE) program clearly outperform the conventional techniques and present real data results for our proposed sequential scalability technique.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://link.springer.com/content/pdf/10.1186%2Fs13637-017-0063-3.pdf

Learning directed acyclic graphs from large-scale genomics data

Nikolay et al. EURASIP Journal on Bioinformatics and Systems Biology Learning directed acyclic graphs from large-scale genomics data Fabio Nikolay 0 Marius Pesavento 0 George Kritikos 1 Nassos Typas 1 0 Communication Systems Group, TU Darmstadt , Merckstr. 25, Darmstadt , Germany 1 European Molecular Biology Laboratory , Heidelberg, Meyerhofstraße 1, 69117 Heidelberg , Germany In this paper, we consider the problem of learning the genetic interaction map, i.e., the topology of a directed acyclic graph (DAG) of genetic interactions from noisy double-knockout (DK) data. Based on a set of well-established biological interaction models, we detect and classify the interactions between genes. We propose a novel linear integer optimization program called the Genetic-Interactions-Detector (GENIE) to identify the complex biological dependencies among genes and to compute the DAG topology that matches the DK measurements best. Furthermore, we extend the GENIE program by incorporating genetic interaction profile (GI-profile) data to further enhance the detection performance. In addition, we propose a sequential scalability technique for large sets of genes under study, in order to provide statistically significant results for real measurement data. Finally, we show via numeric simulations that the GENIE program and the GI-profile data extended GENIE (GI-GENIE) program clearly outperform the conventional techniques and present real data results for our proposed sequential scalability technique. Genetic interaction analysis; Large-scale gene networks; Discrete optimization; Graph learning; Big data; Multiple hypothesis test 1 Introduction Genetic interaction analysis aims at uncovering the interactions among a set of genes with respect to a specified cell function of a biological system, e.g., the fitness of a specific bacteria colony. The interactions among the genes under study can be characterized by a directed acyclic graph (DAG) [ 1 ] where the hierarchical relationship among two genes of a DAG describes their hierarchical interaction type [ 2 ]. However, DAGs cannot be observed directly but only the specified cell function under study which yields observable phenotypes. The term phenotype generally describes the specific manifestation of a biological attribute of an organism which can be observed, e.g., for bacteria, a common biological attribute is the growth measured in colony size, where a specific size of the bacteria colony is a phenotype of this biological attribute. The role of the studied genes in the cell machinery and the hierarchical interaction types of the genes, as well as the DAG, which describes the latter ones, can only be learned by means of knockout experiments where a gene or a set of genes is functionally switched off and the phenotype is observed. Traditionally, only single-knockout (SK) experiments have been conducted but those mainly provide evidence on the importance of a single gene for the investigated cell process and do not convey much information about the interaction among the genes under study. Recently, with the technological advances in microarrays and the development of the synthetic genetic array technologies [ 3 ], new approaches have been taken that are based on large-scale knockout experiments of pairs of genes. Such double-knockout (DK) experiments are much more powerful for exploring genetic interactions since a DK phenotype of an arbitrary pair of genes generally differs considerably from the superposition of the corresponding SK phenotypes of this pair of genes. According to [ 2 ], the gene pairs can be classified into one out of five hierarchical relationship classes based on their SK and DK phenotypes. Further, based on the hierarchical relationship classes, the DAG underlying the observed SK and DK phenotypes can be inferred which directly reflects the genetic interactions among the genes. In order to detect the DAG underlying the SK and DK phenotypes, a variety of statistical methods based on scoring the measurements or on thresholding the genetic interaction (GI)-profile data, which is commonly based on Pearson correlation of the SK and DK phenotypes [ 4–9 ], respectively, have been developed. However, methods as presented in [ 4–9 ] have three considerable disadvantages: (D1) they show poor performance in detecting the DAG underlying the observed SK and DK phenotypes; (D2) they have no ability to combine different types of side information, e.g., GI-profile data with SK and DK phenotypes, to enhance the detection quality; and (D3) they cannot make use of prior knowledge in order to enhance the DAG detection quality. Especially, the ability to overcome the disadvantage in (D2) will become more important in the future, since there is a constantly increasing amount of different data types, e.g., SK and DK phenotypes, Pearson correlation-based GIprofile data, and other types of GI-profile data, freely available. Furthermore, the ability to overcome the deficit in (D3), i.e., to incorporate a priori knowledge about the existing results in genomics research into the DAG detection procedure, is also of great significance, since existing functional relationships among genes are increasingly better understood based on a variety of studies that constantly extend the knowledge on the cell machinery and molecular biology. Although exhibiting the abovementioned disadvantages (D1) to (D3), methods as those presented in [ 4–9 ] are the most commonly used methods to detect the DAG underlying the measured SK and DK data. Therefore, we propose the Genetic-InteractionsDetector (GENIE) program, that is an approach based on the biological system model of [2] with which it is possible to overcome the abovementioned shortcomings of the most popular methods as those reported in [ 4–9 ]. Since the hierarchical relationship classes are mutually dependent, classifying each pair of genes to a specific hierarchical relationship class corresponds to a multihypothesis test. Thus, we formulate this multi-hypothesis test as a linear integer optimization program [ 10–15 ] in order to find the set of hierarchical relationship classes, best matching the observed SK and DK phenotypes. Based on the detected set of hierarchical relationship classes, the set of edges of the DAG which reflects the interactions among the genes can be computed. Furthermore, we propose the GI-GENIE program where we advance the proposed GENIE program by incorporating GI-profile data, e.g., GI-profile data based on Pearson correlation of the observed SK and DK phenotypes, into the DAG detection procedure. Due to incomplete knowledge about the true dependencies among the very most sets of genes, i.e., the true DAG of a set of genes with respect to a specific cell function is unknown or only partially known for almost all sets of genes irrespectively of the cell function under study, there is a strong interest in the genomics research community in statistically reliable statements about the topology of the DAGs underlying large sets of genes, i.e., for the empirical probability of a pair of genes to interact with each other. Towards this aim, we propose a sequential technique based on the GENIE/GI-GENIE algorithms that yields statistically significant statements about the interactions among genes from a large set of genes under study. This paper is organized as follows. We first summarize the biological system model of [ 2 ] in Section 2, and then, we present in Section 3 the GENIE program for detecting the set of hierarchical relationship classes, which represents a valid DAG and matches the DK measurements best. In Section 4, we extend the GENIE program with GI-profile data (GI-GENIE). In Section 5, we present our scalability approach in order to obtain statistically significant results for large sets of genes. Following Section 5, we present results for simulated data which demonstrate the performance of the GENIE and the GIGENIE methods in Section 6. Furthermore, in Section 6, we display real data results for the scalability approach described in Section 5. Finally, we summarize in Section 7 the key parts of this paper and give a brief outlook on future work. 2 System model In this section, we provide a mathematical description of a DAG as well as its biological implications. Furthermore, we introduce the common biological terms and provide a compact description of the genetic interaction model of [ 2 ] including simple explanations on how to read and interpret a DAG. 2.1 Graph properties of a DAG According to [ 16 ], a graph A = (V(A), E(A)) is well defined by a set of nodes V(A) = {a1, a2, ..., aA} and a set of edges E(A) = {{a1, aA, } , {a2, aA, } , ..., {aA, a1, }} where ai, aj, for ai, aj ∈ V(A) denotes a directed edge from ai to aj and cardinality |V(A)| = A denotes the number of elements of set V(A). The operators V(·) and E(·) applied to graph A yield the set of nodes V(A) and the set of edges E(A) respectively. We mostly address sets V(A) and E(A) by GA and EA, respectively, for the sake of notational convenience, i.e., A = (GA, EA). Assume that there is a path P from node ai ∈ GA to node aj ∈ GA in graph A, i.e., a directed connection from node ai ∈ GA to node aj ∈ GA. Then, path P is described by the concatenation of nodes being passed through on the way from node ai ∈ GA to node aj ∈ GA, i.e., P = ai...aj and V(P) = ai, ..., aj denotes the set of nodes of path P [ 16 ]. The functional dependencies among a set of genes G = g1, ..., gG , with G = |G| elements, for a given cell process and specie can be characterized by a genetic interaction map (GI map,[ 17–20 ]) which is essentially a DAG with a common root node, i.e., the reporter level R, [21]. In particular, an arbitrary DAG D can be described as a graph D = (GD, ED) with the set of nodes GD = {G ∪ R} and the set of directed edges ED = gi, gj , ..., gj, gl . As the genetic interactions can only be observed through the reporter, all edges are always orientated in such a way that each path parting from any arbitrary gene gi ∈ G always terminates in the root node R and any gene appears on the path at most once, i.e., there exist no cycles in the graph. Hence, the DAG D is always connected via its root node R. For the sake of notational convenience, in most cases, we write gene i when addressing gene gi, [ 21 ]. The reporter node R is an artificial node, i.e., not a gene, in the concept of a DAG and represents the measured phenotype of the specific cell process under study. To provide a better understanding of the information encoded in a DAG, we state a simple example, which is similar to the one in [ 21 ], based on DAG D0 displayed in Fig. 1. In D0, there exists an direct edge from gene i0 to gene j0, i.e., i0, j0 ∈ ED0 , which indicates that the activity of gene i0 controls the activity of gene j0. Hence, gene i0 only affects the phenotype via gene j0 and not directly. We emphasize that in this model, the existence of edge i0, j0 in the DAG only describes the hierarchical functional dependency between genes i0 and j0 and not the quantitative effect of gene i0 on gene j0. 2.2 Biological interaction model Let us denote R(i) ∈ R as the phenotype for a single gene i ∈ G functionally disabled. In the same way, we define the phenotype for the DK of genes i, j ∈ G as R(i, j) ∈ R. Let the datasets Ri = {R(i, 1), ..., R(i, G)} and Rj = R(j, 1), ..., R(j, G) contain all DK phenotypes involving genes i, j ∈ G. The GI-profile data ρ(i, j) for genes i, j ∈ G can be computed as the Pearson correlation between the samples of the datasets Ri and Rj, respectively. We remark that the GI-profile data ρ(i, j) does not have to be separately computed as the Pearson correlation of Ri and Rj, respectively. It is commonly extracted from a database where a priori knowledge about the set of genes under study, i.e., G, is stored. Since the gene pairs i, j and j, i are identical, it is sufficient to consider only gene pairs i, j ∈ G : j > i. Throughout this paper, we mostly omit the specification that j is greater than i for notational convenience. In genomics research, it is a common assumption that if there is an edge between two genes i, j in DAG D, i.e., there is an interaction between genes i, j in DAG D, then the GI-profile ρ(i, j) is very likely to be high. Furthermore, according to [ 2 ], we assume that each pair of genes i, j belongs to exactly one out of five hierarchical relationship classes that are characterized in Fig. 2. The hierarchical relationship classes k ∈ K = {1, ..., 5} are defined according to the model μk(i, j) in which the single-knockout phenotypes R(i) and R(j) are related with the DK phenotype R(i, j). If the gene pair i, j belongs to the hierarchical relationship class k, then the observed DK phenotype R(i, j) is described by the model μk(i, j) provided in Fig. 2. We remark that the five hierarchical dependency graphs in Fig. 2 do not reflect the absolute adjacency relations, but the hierarchical relations between genes i, j in DAG D. Hence, given that two genes i, j of DAG D are in class k, we cannot conclude that genes i, j are directly arranged in DAG D as displayed by the depiction of class k in Fig. 2. This follows from the fact that the description of the hierarchical relationship classes provided in Fig. 2 only contains relative topology information about two genes i, j in DAG D. In the following and in addition to [ 2 ], we provide, for clarity of presentation, a formal description of the hierarchical relationship classes depicted in Fig. 2 using a graph theoretical representation. Assume that there are I paths Pi,τ , forτ ∈ {1, ..., I}, from gene i to the reporter node R in DAG D and the set Pi containing all such paths is defined as Pi = Pi,1, ..., Pi,I . Furthermore, set Pj = Pj,1, ..., Pj,J contains all J paths from gene j to the reporter node R in DAG D. Given gene pair i, j in DAG D, then pair i, j belongs to the hierarchical relationship class k ∈ K if and only if condition Ck as defined below is satisfied: As stated in condition C1 in (1a), two genes i, j in DAG D belong to the hierarchical relationship class k = 1, if all paths from gene i to the reporter node R pass through gene j. Hence, gene j is always an element of the set of nodes of each path Pi,τ ∈ Pi from gene i to the reporter node R, i.e., j ∈ V(Pi,τ ) for all paths Pi,τ from gene i to the reporter node R. With the same line of argument as used in (1a), two genes i, j in DAG D belong to the hierarchical relationship class k = 2 if condition C2 in (1b) is satisfied. Two genes i, j in DAG D belong to the hierarchical relationship class k = 3 and are considered to be independent from each other if condition C3 in (1c) is satisfied which states that there is no path Pi,τ from gene i to the reporter node R that passes through gene j as well as there is no path Pj,τ˜ from gene j to the reporter node R that passes through gene i. As stated in (1d), two genes i, j in DAG D belong to the hierarchical relationship class k = 4 if there is at least one path Pi,τ from gene i to the reporter node R which does not pass through gene j as well as for all paths Pj,τ˜ ∈ Pj, there is always a path Pi,τ ∈ Pi that is a superpath of the respective Pj,τ˜ ∈ Pj. With the same line of argument as used in (1d), two genes i, j in DAG D belong to the hierarchical relationship class k = 5 if condition C5 in (1e) is satisfied. 2.3 Class coupling—example To illustrate this, let us consider the example DAG D0 of Fig. 1. All paths from gene i0 to node R pass through gene j0, i.e., they are in a linear pathway with gene i0 upwards of gene j0. Thus, the pair of genes i0, j0 belongs to class k = 1. Note that with the same line of argument, we conclude that also genes i0 and l0 belong to relationship class k = 1. Since all paths from gene i0 to the reporter level R do not pass through gene t0 and all paths from gene t0 to the reporter level do not pass through gene i0, genes i0 and t0 belong to the hierarchical relationship class k = 3 as given in Fig. 2, which states that genes i0 and t0 are independent of each other and the DK phenotype amounts to R(i0, t0) = μ3(i0, t0). Finally, let us investigate the hierarchical relation between genes t0 and n0 in DAG D0. Obviously, gene t0 has (at least) one path to node R which does not pass through gene n0, i.e., genes only having paths to R that do not pass through gene n0 do not affect the activity of gene n0. Since there is (at least) one other path from gene t0 to R passing through gene n0, we can reason that genes t0 and n0 belong to class k = 4. Generally, there are strong implications among the hierarchical relationship classes of [ 2 ], i.e., if some pairs belong to a specific class, then this has strong implications for all other pairs. Let us consider the case that DAG D0 was not known and only the hierarchical relationship classes for genes i0 and j0, i.e., genes i0 and j0 belong to class k = 1, as well as the hierarchical relationship class for genes i0 and g0, i.e., genes i0 and g0 belong to class k = 1, were available. By definition of the hierarchical dependency graphs in Fig. 2 and the assumptions that genes i0 and j0 belong to class k = 1 as well as that genes i0 and g0 belong to class k = 1, we conclude that all paths from gene i0 to R pass through genes j0 and g0. Thus, either all paths from gene g0 to R pass through gene j0 or all paths from gene j0 to R pass through gene g0. Consequently, genes j0 and g0 either belong to the hierarchical relationship class k = 1, or k = 2. As we have emphazised by the example above, generally, if the hierarchical relationship class is known for two arbitrary genes i, j as well as for another pair i, l ∈ G : l > i, then this has strong logical implications on the hierarchical relationship classes genes j, l ∈ G : l > j can belong to. Since we can interpret the classification of the pairs of genes i, j, based on their observed SK and DK phenotypes R(i), R(j) and R(i, j), respectively, to exactly one out of the five hierarchical relationship classes as a coupled multi-hypothesis test, we address this problem in Section 3 by a linear integer optimization program. The proposed linear integer optimization program identifies the most consistent set of hierarchical relationship classes, i.e., the set of hierarchical relationship classes that represents a valid DAG and matches best the DK measurements with respect to the logical coupling between the classes. Furthermore, in Section 4, we extend the GENIE program proposed in Section 3 by incorporating GI-profile data in order to jointly detect the most consistent set of hierarchical relationship classes and the corresponding DAG topology. 3 GENIE algorithm In this section, we formulate the problem of classifying the gene pairs i, j into the classes of hierarchical relationships based on the observed SK and DK phenotype values as a linear integer optimization program. Furthermore, we translate the logical implications among the hierarchical relationship classes into constraints that ensure that the detected set of hierarchical relationship classes represents a valid graph. That is, the detected set of hierarchical relationship classes represents a graph which is a DAG as defined in Section 2. Finally, we propose a policy to derive an estimate EˆD of the true set of edges ED of DAG D based on the detected set of hierarchical relationship classes. 3.1 Hierarchical relationship class detection In order to quantify the mismatch between the measured DK phenotypes R(i, j) and the phenotype model μk(i, j) of class k ∈ K according to Fig. 2, under the hypothesis that the gene pairs i, j belong to class k given their respective SK values, we propose a simple quadratic score [ 2, 21 ], as given in Eq. (2) sk(i, j) = R(i, j) − μk(i, j) 2 , k ∈ K ∀i, j :∈ G : j > i (2) We remark that every DAG D can be represented by a set of hierarchical relationship classes which directly corresponds to a set of class-selection variables AD = α1D(i, j), ..., α5D(i, j) . The GENIE algorithm of ∀i,j∈G:j>i classifying the gene pairs i, j into the set of hierarchical relationship classes that represents a valid DAG and matches the observed SK and DK phenotypes best can be formulated as (3) (4a) (4b) (4c) (4d) Let us define the following class-selection variables1 αk(i, j) = 1 if i, j are in class k 0 else k ∈ K, ∀i, j :∈ G : j > i OGENIE : min {αk(i,j)} G ∀i, j ∈ G : j > i |K| k=1 αk(i, j) = 1, αOGENIE (i, j) denotes the solution of program OGENIE in 5 (4) and the set of best matching selection variables AOGENIE corresponds to the most consistent pattern of hierarchical relationship classes. Problem OGENIE in (4) is a linear integer program which can be solved efficiently by BB methods [ 22–29 ]. The objective of problem OGENIE is to minimize the overall mismatch in classifying each gene pair i, j to one out of five hierarchical relationship classes. The constraints in (4b) reflect the binary nature of the class-selection variables αk(i, j), ∀k ∈ K, while (4c) represents a multiple choice constraint to enforce that the gene pairs i, j are only classified to one out of the five hierarchical relationship classes. The set L in (4d) is comprised of additional constraints to ensure that the detected set of selection variables AOGENIE always represents a valid graph, i.e., a DAG. In the following, we exemplarily derive topology constraints in set L. In order to identify the numerous logical dependencies among the class-selection variables αk(i, j), k ∈ K for all i, j ∈ G : j > i, we proceed in the following way. We first fix the assumption that genes i, j belong to class k = 1, i.e., α1(i, j) = 1. Further, we assume that genes i, l ∈ G : l > i belong to class k , i.e., αk (i, l) = 1. Then, we derive the set of classes K that genes j, l ∈ G : l > j can belong to under the assumptions made. In the following, we have formulated the logical dependencies among the selection variables for α1(i, j) = 1, i.e., the case that gene i is linearly upstream of gene j, as linear integer inequalities defined in constraints (5a)–(5e) and summarize them as set L1 L1 = α1(j, l) + α2(j, l) ≥ α1(i, j) + α1(i, l) − 1 α2(j, l) ≥ α1(i, j) + α2(i, l) − 1 α2(j, l) + α3(j, l) + α5(j, l) ≥ α1(i, j) + α3(i, l) − 1 α2(j, l) + α4(j, l) ≥ α1(i, j) + α4(i, l) − 1 α5(j, l) + α2(j, l) ≥ α1(i, j) + α5(i, l) − 1 (5a) (5b) (5c) (5d) (5e) ∀i, j, l ∈ G : l > j > i where constraints (5a)–(5e) are linear after the continuous relaxation of the selection variables αk(i, j). To explain the origin and the functionality of the constraints in L1, let us further define a sub-genetic interaction map (SMAP) S, [ 21 ], as given in Fig. 3 according to the following definition where we adopt the graph notation of [ 16 ]: Definition 1 Given a non-empty set of edges Ein and a non-empty set of edges Eout, graph S = (GS , ES ), with set of nodes GS and set of edges ES , is a SMAP if the following conditions are fulfilled: (i) the graph S is acyclic and directed and (ii) there are ∃ein ∈ Ein and eout ∈ Eout such that each path P through graph S incides S via egde ein and leaves graph S via edge eout. DAG D1, as displayed in Fig. 4, consists of genes i, j and SMAPs S1 and S2. Obviously, genes i, j belong to class k = 1, i.e., α1(i, j) = 1. Furthermore, all genes l ∈ GD1 \ {R} : l > j > i for which α1(i, l) = 1 must be either located in SMAP S1 or S2. Thus, it follows from DAG D1 in Fig. 4 that the gene pair j, l is either in hierarchical relationship class k = 1 or k = 2, i.e., α1(j, l) = 1 or α2(j, l) = 1. This logical implication is directly reflected by constraint (5a). Given α1(i, j) = 1 and α1(i, l) = 1, the right-hand side (RHS) of (5a) amounts to 1. In this case also, the left-hand side (LHS) of (5a) becomes 1 to fulfill the inequality (5a). Thus, either α1(j, l) = 1 or α2(j, l) = 1. Reversely, assume that α1(i, j) = 1 and α1(i, l) = 1 does not hold, and then, the RHS of (5a) is less than 1, i.e., 0 or −1, while the LHS of (5a) is always greater than 0. Hence, constraint (5a) is fulfilled irrespectively of the choice of αk(j, l), i.e., constraint (5a) enforces no logical implications. Similarly for DAG D2 in Fig. 4, it is obvious that genes i, j belong to the hierarchical relationship class k = 1, i.e., α1(i, j) = 1. All genes l ∈ GD2 \ {R} : l > j > i which are in a linear pathway upstream of gene i, i.e., α2(i, l) = 1, must be located in SMAP S3. Hence, it directly follows from DAG D2 that also, gene l must be in a linear pathway upstream of gene j, i.e., α2(j, l) = 1. This logical implication is compactly represented in constraint (5b). Under the assumption that α1(i, j) = 1 and α2(i, l) = 1, the RHS of (5b) amounts to 1 enforcing α2(j, l) = 1, so that the LHS of (5b) equals the RHS and the inequality in (5b) is fulfilled. Reversely, assume that α2(i, l) = 0, then the RHS of (5b) is less than 1 and hence the LHS of (5b) is always bigger than or equal to the RHS irrespectively of the choice of αk(j, l), i.e., constraint (5a) enforces no logical implications. We can proceed in the same fashion to explain constraints (5c)–(5e) based on DAGs D3 to D5 as given in Fig. 4, respectively. Note that the DAGs D1 to D5 are sufficient illustrations of Eqs. (5a)– (5e) in order to derive all logical implications for the case that genes i, j are in class 1, i.e., α1(i, j) = 1. In the case of DAG D1, for instance, there can be other DAGs than DAG D1 indeed where genes i, j are in class 1 and genes i, l are in class 1, i.e. α1(i, j) = 1 and α1(i, l) = 1. However, DAG D1 contains all the necessary information in order to derive the logical implications for gene pair j, l, given that α1(i, j) = 1 and α1(i, l) = 1. The same holds for DAGs D2 to D5. Furthermore, with the same line of argument, we can derive the sets Lk for k ∈ K \ 1 which reflect the logical implications among the selection variables under the assumptions that αk(i, j) = 1 for k ∈ K \ 1. However, due to space limitations, we omit the derivation of the full set of logical implications at this point and refer the interested reader to [ 30 ] where we will provide the full set of topology constraints L as well as further supplementary material. The full set of topology constraints L in (4d) can be computed as L = |K| k=1 {Lk} . Finally, a considerable advantage of the presented algorithm is its ability to incorporate prior knowledge into the classification of the gene pairs i, j to the most consistent hierarchical relationship classes. Assume that it is known from existing experimental results that two specific genes i0, j0 ∈ G : j0 > i0 do not interact with each other. Then, we can easily incorporate this prior knowledge into program OGENIE in (4) by adding Eq. (7) as defined below α3(i0, j0) = 1 as a topology constraint to program OGENIE. This property is very valuable since it allows the GENIE algorithm to take advantage of existing results in genetic interaction research to improve the reliability of the classification. 3.2 Edge computation Based on the detected set of selection variables AOGENIE which corresponds to the most consistent pattern of hierarchical relationship classes given the observed SK and DK phenotypes, an estimate EGENIE of the true set of edges ED of DAG D can be computed. It can be theoretically proven that the representation of an arbitrary DAG D by its corresponding set of hierarchical relationship classes is not unique. AD the set of selection variables which directly corresponds to the hierarchical relationship class pattern of DAG D represents not only the true DAG D (6) (7) but also a set of similar DAGs which have minorly different sets of edges compared to the true DAG D. Assume we are only given that α4D(i, j) = 1 for two arbitrary genes i, j of DAG D, then we suffer an information loss on the number of paths from gene i to the reporter node R which are independent of gene j. Similarly, given that α5D(i, j) = 1 for two arbitrary genes i, j ∈ G : j > i of DAG D, we suffer an information loss on the number of paths from gene j to the reporter node R which are independent of gene i. Hence, this information loss yields ambiguities in computing the set of edges ED of DAG D based on the AD. In order to clarify the origin of the ambiguities further, let us turn to a simple example. Given DAG Da = GDa , EDa as displayed on the LHS of Fig. 5 and the corresponding set of hierarchical relationship classes represented by the corresponding set of selection variables ADa . Note that α4Da (1, 2) = αDa (1, 3) = αDa (1, 4) = 1 due to edge 4 4 e0 ∈ EDa and α1Da (1, 5) = 1. Now, assume that we want to compute the topology of DAG Da, i.e., the set of edges EDa , based on ADa . DAG Dˆ a displayed on the RHS of Fig. 5 shows the estimated topology of DAG Da based on ADa . It can be shown that the black edges of DAG Dˆ a are necessary such that DAG Dˆ a is represented by ADa . Edges e1 and e2 in DAG Dˆ a are optional in a sense that their existence has no effect on set ADa . Edges e1 and e2 create two paths from gene g1 to the reporter node R which are independent of gene g2 and gene g3, respectively. However, due to edge e0, gene g1 already has a path to the reporter node R which is independent of genes g2 and g3. Since α4Da (1, 2) = α4Da (1, 3) = α4Da (1, 4) = 1 do not contain information on the number of paths from gene g1 to R that are independent of g2, g3 and g4, edges e1 and e2 do not affect the pattern of hierarchical relationship classes representing DAG Da, i.e., ADa , and hence, this yields ambiguities in computing the topology of DAG Da based on its corresponding set of selection variables ADa . Since it is a common assumption in genomics research that GI maps, i.e., DAGs, are not overly dense but rather sparse, we propose a policy which computes the sparsest DAG topology based on the detected pattern of hierarchical relationship classes. Given the detected pattern of hierarchical relationship classes of a DAG D, i.e., Aˆ D = αˆ 1D(i, j), ..., αˆ 5D(i, j) , we compute an estimate EˆD ∀i,j∈G:j>i of the true topology set ED of DAG D according to the policy depicted in Table 1 where we make use of the symmetry properties αˆ 1D(i, j) = αˆ 2D(j, i), αˆ 2D(i, j) = αˆ 1D(j, i), αˆ 3D(i, j) = αˆ 3D(j, i), αˆ 4D(i, j) = αˆ 5D(j, i), and αˆ 5D(i, j) = αˆ 4D(j, i). Note that we redundantly expand the set of detected class-selection variables αˆ k(i, j) from all pairs i, j ∈ G : j > i to all pairs i, j ∈ G in order to obtain a compact formulation of the mutually exclusive conditions E1 to E4 as stated in Table 1. Assume that either condition E1 or condition E2 is fulfilled, then we conclude that there is an edge from gene i to gene j in DAG D. Given that either condition E3 or condition E4 is fulfilled, we conclude that there exists an edge from gene j to gene i in DAG D. We remark that there cannot be an edge between two genes i, j if they are independent of each other, i.e., αˆ 3D(i, j) = 1. As described by E1, there is an edge from gene i to gene j in DAG D, if gene i is linearly upstream of gene j, i.e., αˆ 1D(i, j) = 1, and there is no gene l in DAG D that is linearly downstream of gene i, i.e., αˆ 1D(i, l) = 1, and linearly upstream of gene j, i.e., αˆ 2D(j, l) = 1. According to condition E2, there is an edge from gene i to gene j in DAG D, if gene i is upstream of gene j with at least one path from gene i to R which is independent of gene j, and furthermore, there is no gene l in DAG D that is either linearly downstream of gene i or downstream of gene i E1: E2: E3: E4: with gene i having at least one path to R that is independent of l and neither gene l is linearly upstream of gene j nor gene l is upstream of gene j with an independent path to R. In order to elucidate the effect of condition E2 onto the edge computation, we briefly turn to DAG Dˆ a in Fig. 5. Condition E2 ensures that the optional edges e1 and e2 are not detected but only the necessary edges displayed in black color. We remark that conditions E3 and E4 can be elucidated by the same line of argument as used for conditions E1 and E2, but due to space limitations, we omit a detailed explanation at this point. Finally, we propose a condition from which all reporter node edges, i.e, all edges that connect gene i ∈ G with reporter node R in DAG D, can be computed. Based on the detected set of hierarchical relationship classes, i.e., AˆD, we follow our policy of computing the necessary edges only. For clarity of presentation and notational compactness, we define set Mi as Mi = l ∈ G \ i| αˆ 4D(i, l) = 1 containing all genes l ∈ G which are in class k = 4 with gene i ∈ G, i.e., αˆ 4D(i, l) = 1. Furthermore, we define set which contains all genes l of set Mi that are independent of at least one other gene of set Mi. Based on sets Mi and Mi, we formulate condition ER as stated in Table 2. We conclude that there is an edge from gene i to reporter node R in DAG D, if condition ER is fulfilled. Given that gene i is linearly upstream of at least a single gene l, i.e., αˆ 1D(i, l) = 1, there cannot exist an edge from gene i to reporter node R in DAG D, since all paths from gene i to R pass through at least one other gene l. Conversely, if there is no such gene l that αˆ 1D(i, l) = 1, then the LHS of ER as given in Table 2 is fulfilled. The RHS of ER accounts for our policy of detecting sparse DAGs only and is fulfilled if either Mi, Mi, or Mi and Mi are empty. Note that given Mi = ∅, it follows that Mi = ∅ as well, whereas the opposite is not true. In order to explain the effect of the RHS of condition ER in an intuitive manner, let us turn to DAG DR as displayed in Fig. 6. Assume that we are given the pattern of hierarchical relationship classes that corresponds to DAG DR, i.e., ADR , and we want to compute all reporter node edges based on ADR , i.e., all edges that directly connect a gene in DAG DR with the reporter node R. Note that α1DR (2, 3) = 1, α5DR (2, 4) = 1, α4DR (3, 4) = 1, and αDR (1, l) = 1 ∀l ∈ GDR \ g1, R . It 4 can be shown that gene g3 fulfills the LHS of ER, i.e., there is no gene which is linearly downstream of g3, and furthermore, M3 = g4 and M3 = ∅. Hence, condition ER is fulfilled and edge en connecting g3 and R is computed in DAG Dˆ R that is the reconstruction of DAG DR based on ADR . Furthermore, for set ADR , the edge en in the reconstructed DAG Dˆ R is necessary, since α1DR (2, 3) = 1, α5DR (2, 4) = 1, and α4DR (3, 4) = 1. In contrast to en, edge eo is not necessary for ADR to represent Dˆ R, since αDR (1, l) = 1 ∀l ∈ GDR \ g1, R irrespectively of edge eo. 4 Hence, eo is not detected, since M1 = ∅ and M1 = ∅. We obtain an estimate EGENIE of the true set of edges ED of DAG D by setting AˆD = AOGENIE and evaluating conditions E1 to E4 and condition ER as stated in Tables 1 and 2, respectively. 4 GI-GENIE algorithm In this section, we present the proposed GI-GENIE algorithm which jointly formulates the gene pair classification and the corresponding DAG topology estimation. Let us define the following edge-selection variables β(i, j) = 1 ∃ edge between i, j 0 no edge ∀i, j ∈ G : j > i (10) where β(i, j) = 1 denotes that there is an edge between genes i, j in DAG D and β(i, j) = 0 denotes that there exists no edge between genes i and j. Note that unlike αk(i, j) = 1 for k ∈ K, β(i, j) = 1 does not capture directionality information about the graph topology, i.e., β(i, j) = 1 states that there is an edge between genes i, j in DAG D without specifying the hierarchy among both genes. The topology ED of any DAG D can be represented by the corresponding set of class-selection variables AD = α1D(i, j), ..., α5D(i, j) together with the corresponding i,j set of undirected edges β(i, j) for all i, j :∈ G : j > i. The set β(i, j) can be viewed as the undirected “skeleton” of the DAG that is represented by its corresponding set of class-selection variables AD. The GIGENIE algorithm yields an estimate EGI of the true DAG topology ED by computing sets AOGI-GENIE and βˆ(i, j) which are estimates of the true set of class-selection variables and edge-selection variables, AD and β(i, j) , respectively. Based on SK, DK, and GI-profile data, the proposed GI-GENIE algorithm is formulated as the following LIP: min {αk(i,j),β(i,j),zl(i,j)} λd − λc + λp pattern of selection variables AOGI-GENIE detected by program OGI-GENIE, is not contradicting with the set of edge selection variables βˆ(i, j) ∀i, j ∈ G : j > i detected by program OGI-GENIE. In particular, given that the detected pattern of selection variables AOGI-GENIE enforces that there is an edge between genes i, j in DAG D, then the auxiliary variables ensure that the corresponding edge selection variable indicates that there is an edge between genes i, j, i.e., βˆ(i, j) = 1. Furthermore, given that the detected pattern of selection variables AOGI-GENIE enforces that there is no edge between genes i, j in DAG D, then the auxiliary variables ensure that the corresponding edge selection variable indicates that there is no edge between genes i, j, i.e., βˆ(i, j) = 0. On the contrary, assume that the detected edge selection variables enforce that there is an edge between genes i, j in DAG D, i.e., βˆ(i, j) = 1, then the zl(i, j) ensure that the detected pattern of selection variables AOGI-GENIE must fulfill one of the conditions stated in Table 1. Consequently, given that the detected edge selection variables enforce that there is no edge between genes i, j in DAG D, i.e., βˆ(i, j) = 0, then the zl(i, j) ensure that the detected pattern of selection variables AOGI-GENIE does not fulfill any of the conditions stated in Table 1. Furthermore, let the auxiliary parameters q(i, j) = 1 ρ(i, j) ≥ λλpc 0 ρ(i, j) < λλpc ∀ i, j ∈ G : j > i (13) where the scalars λd, λs, λc, and λp are non-negative weighting constants to balance the impact of the SK and DK measurements and the GI-profile data, respectively, on the estimates. In particular, the parameter λd is used for dual purpose: (i) to scale the domain of the knockout scores sk(i, j) to the range [ 0, ..., 1] which is comparable to range of the correlation data ρ(i, j) and (ii) to trade-off the impact of the knockout scores sk(i, j) on the estimation outcome. The parameters λc and λp are in the interval [ 0, 1 ] where λc ≥ λp. The GI-profile (GIP) term is given by −λc G G i=1 j=i+1 ρ(i, j)β(i, j) + λp β(i, j). (12) G G i=1 j=i+1 λ The quotient of λc defines the threshold for reward of p the GI-profile (GIP) term in Eq. (12), where setting the edge selection variable β(i, j) = 1 is rewarded if the corresponding GI-profile measurement ρ(i, j) is above the λ quotient λc . p The auxiliary variables zl(i, j)∀i, j, l ∈ G : j > i, l = i, l = j are generally necessary to ensure that the information about the topology of DAG D, which is encoded in the describe the detection of the edges of DAG D based on GI-profile data only, where q(i, j) = 1 denotes that there is an edge between genes i, j and q(i, j) = 0 denotes that there is no edge between genes i, j. Since any pattern of hierarchical relationship classes implies a specific set of edges and any set of edges implies a specific pattern of hierarchical relationship classes, there is a strong coupling of constraints, i.e., there are strong logical implications among the selection variables αk(i, j) and the selection variables β(i, j); the constraints in Eqs. (11e) to (11g) ensure that the detected hierarchical relationship classes and the corresponding edges, i.e., the αk(i, j) and β(i, j), are not mutually contradicting. Given that two genes i, j in DAG D are independent, i.e., α3(i, j) = 1, there cannot exist an edge between those genes in DAG D, i.e., β(i, j) = 0. This logical implication is reflected by (11e). Set Lc in (11f) and the linear integer inequalities in (11g) model conditions E1 to E4 of our proposed edge detection policy as stated in Table 1. Since we do not want to redundantly expand the set of selection variables αk(i, j) to all i, j ∈ G : j = i in order to not increase the complexity of program OGI-GENIE, we have to consider three cases when formulating conditions E1 to E4 of Table 1 as linear integer inequalities, i.e., i, j, l ∈ G : l > j > i, i, j, l ∈ G : j > i > l α1(i, l) + α2(j, l) ≥ α1(i, j) − zl(i, j) 1 − β(i, j) + q(i, j) ≥ α4(i, j) + α1(i, l)+ α4(i, l) + α2(j, l) + α5(j, l) − 2 1 2 α4(i, j) − zl(i, j) − q(i, j) α1(i, l) + α4(i, l) + α2(j, l) + α5(j, l) ≥ 2 − β(i, j) − q(i, j) ≥ α4(i, j) + α1(i, l) + α2(j, l) + α5(j, l) − 2 1 α1(i, l) + α2(j, l) ≥ 2 α4(i, j) − zl(i, j) − 1 + q(i, j) 1 − β(i, j) + q(i, j) ≥ α5(i, j) + α2(i, l)+ α5(i, l) + α1(j, l) + α4(j, l) − 2 α2(i, l) + α5(i, l) + α1(j, l) + α4(j, l) ≥ 2 − β(i, j) − q(i, j) ≥ α5(i, j) + α2(i, l) + α5(i, l) + α1(j, l) − 2 1 α2(i, l) + α1(j, l) ≥ 2 α5(i, j) − zl(i, j) − 1 + q(i, j) ∀i, j, l ∈ G : l > j > i model the logical implications among the selection variables αk(i, j), αk (i, l), αk (j, l), and β(i, j) for k, k , k ∈ K, ∀i, j, l ∈ G : l > j > i. Together with (11g), constraints (14a)–(14b) model condition E1 of our detection policy taking into account the GI-profile information ρ(i, j) via selection variables β(i, j). Assume that based on the SK and DK phenotypes, it is most consistent that α1(i, j) = and i, j, l ∈ G : j > l > i. Then, the constraints in set Lc,1 Lc,1 = 1 − β(i, j) ≥ α1(i, j) + α1(i, l) + α2(j, l) − 2 (14a) (14b) (14c) (14d) (14e) (14f) (14g) (14h) (14i) α1(i, l) = α2(j, l) = 1 for at least one gene l in DAG D which corresponds to condition E1 being violated. Hence, there cannot exist an edge between genes i and j in DAG D. In this case, the RHS of (14a) amounts to 1 which enforces the LHS of (14a) to amount to 1 as well, i.e., β(i, j) = 0. Note that for α1(i, j) = α1(i, l) = α2(j, l) = 1, (14b) makes no restrictions on zl(i, j). Furthermore, assume that for genes i, j, based on the SK and DK phenotypes, it is most consistent that α1(i, j) = 1, but α1(i, l) and α2(j, l) are not jointly 1 for all other genes l ∈ G : l > j > i, i.e., α1(i, l) + α1(j, l) < 2, then there is an edge between genes i, j in DAG D according to condition E1. In this case, it is obvious that (14a) is always fulfilled, i.e., there are no restrictions on β(i, j) by (14a). Since α1(i, j) = 1 and α1(i, l) + α2(j, l) ≤ 1 for all l ∈ G : l > j > i, constraint (14b) can only be fulfilled if zl(i, j) = 1 ∀l ∈ G : l > j > i. Hence, this enforces β(i, j) = 1 due to constraint (11g). In this case, constraint (14b) forces zl(i, j) = 1 ∀l ∈ G : l > j > i. Hence, given that zl(i, j) = 1 ∀l ∈ G : l > j > i, constraint (11g) sets β(i, j) = 1. Given that the GI-profile data strongly supports that there is no edge between genes i, j in DAG D, i.e., β(i, j) = 0, and α1(i, j) = 1 is most consistent based on the SK and DK phenotypes measured, then it follows from (11g) that there must be at least one l ∈ G : l > j > i for which zl(i, j) = 0. In this case, with β(i, j) = 0, α1(i, j) = 1, and zl(i, j) = 0, the RHS of (14b) amounts to 1, forcing the LHS of (14b) to amount to 1 as well, i.e., α1(i, l) = 1 and α2(j, l) = 1, which is together with the assumption of α1(i, j) = 1 a combination that violates the existence of a direct edge between genes i and j. Furthermore, note that (14a) does not have any implications on the selection variables α1(i, j), α1(i, l), and α2(j, l) for the case that β(i, j) = 0 and zl(i, j) = 0. Assume that the GI-profile data strongly supports that there is an edge between genes i, j in DAG D, i.e., β(i, j) = 1, and α1(i, j) = 1 is most consistent based on the SK and DK phenotypes measured, then according to (14a), there cannot be any gene l ∈ G : l > j > i for which α1(i, l) = 1 and α2(j, l) = 1. Hence, Eq. (14b) can only be fulfilled if zl(i, j) = 1 ∀l ∈ G : l > j > i. Thus, (11g) is fulfilled with equality. We remark that given α1(i, j) = 1, constraints (14c) to (14i) are always fulfilled, i.e., they do not pose any implications among the selection variables αk(i, j) and β(i, j). Together with (11g), the two inequalities in (14c) model condition E3 where we can elucidate their functionality in the same fashion as before. Constraints (14d) to (14g) along with (11g) model a minor modification of condition E2 where we detect not only all necessary edges but also optional edges given that their existence is strongly supported by the GI-profile. Given that the existence of an edge between genes i, j in DAG D is not strongly supported by the GI-profile, i.e., q(i, j) = 0, constraints (14d) to (14e) along with (11g) model condition E2 which only allows necessary edges to be detected and we can elucidate their functionality in the same fashion as in (14a) to (14b). Note that (14f) to (14g) are always fulfilled for q(i, j) = 0, i.e., no implications among the selection variables αk(i, j) and β(i, j) are posed. Assuming that the existence of an edge between genes i, j in DAG D is strongly supported by the GI-profile, i.e., q(i, j) = 1, then the constraints in (14d) and (14e) are always fulfilled, i.e., no implications among the selection variables αk(i, j) and β(i, j) are posed by (14d) and (14e). However, constraints (14f) and (14g) pose relaxed logical implications among the selection variables αk(i, j) and β(i, j) compared to constraints (14d) to (14e). Hence, given that q(i, j) = 1 and α4(i, j) = 1, an edge between genes i, j in DAG D is detected if it is allowed by the pattern of hierarchical relationship classes. Constraints (14h) to (14i) along with (11g) model a minor modification of condition E4 where we detect not only all necessary edges but also optional edges given that their existence is strongly supported by the GI-profile. Furthermore, the functionality of constraints (14h) to (14i) can be explained with the same line of argument as used to elucidate constraints (14d) to (14g). Denote Lc,2 and Lc,3 as the sets of topology constraints that model the logical coupling among the selection variables αk(i, j), αk (i, l), αk (j, l), and β(i, j) for k, k , k ∈ K and i, j, l ∈ G : j > i > l and i, j, l ∈ G : j > l > i, respectively. Then, the full set of coupled constraints of the selection variables αk(i, j) and β(i, j) is given by 3 κ=1 Lc = Lc,κ (15) where we again refer the interested reader to [ 30 ] for a detailed description of Lc. We obtain an estimate EGI of the true set of edges ED of DAG D based on the computed set of edge selection variables βˆ(i, j) of program OGI-GENIE where we infer the directionality of the edges according to AOGI-GENIE . Note that all reporter node edges are computed according to our proposed reporter node edge detection policy as given in Table 2. Since the reporter node is an artificial node in the concept of a DAG, there is no GI-profile data ρ(i, R) ∀i ∈ G and thus, no edge selection variable β(i, R) ∀i ∈ G according to (10). 5 Sequential scalability technique Due to the combinatorial nature of problems OGENIE and OGI-GENIE, the GENIE algorithm and GI-GENIE algorithm, respectively, cannot be applied to the data of large sets of genes G, since the number of candidate solutions to problems OGENIE and OGI-GENIE, respectively, grows exponentially with the number of genes. In order to nevertheless obtain statistically significant statements about the interactions among genes in a large set of genes G, we propose the sequential scalability (SEQSCA) technique which is based on the GENIE algorithm and the GI-GENIE algorithm, respectively. In particular, we create a sequence of S subsets {Gs}S1 of the full set of genes G, i.e., Gs ⊂ G, and ∀s ∈ {1, ..., S}, where we estimate the topology ED,s of each DAG Ds, underlying the data of the subset of genes Gs, by the GENIE or GI-GENIE algorithm, respectively, in order to translate the estimated topology ED,s of DAG Ds into the corresponding adjacency matrix Ms for each s ∈ {1, ..., S}. Based on the computed sequence of adjacency matrices S {Ms}1, we iteratively compute the reliability matrix M ∈ [ 0, 1 ]N×N of the full set of genes G in such a way that each entry [M]i,j∈G describes the empirical probability of an edge to exist between genes i, j ∈ G, i.e., the empirical probability that genes i, j ∈ G directly interact with each other, where a value of 0 means that there is an interaction between the considered pair of genes with probability 0 and a value of 1 means that the considered pair of genes interacts with probability 1. In particular, in each iteration s, we consider a subset Gs of size NS |G| of the full set of genes G, where each gene of Gs is selected from G without replacement with equal probability. Based on the selected subset Gs, we compute in each iteration s an estimate ED,s of the true topology of DAG Ds, underlying the observed data of the genes in subset Gs, by the GENIE or the GIGENIE algorithm, respectively. Furthermore, the topology estimate ED,s of DAG Ds is translated into the corresponding adjacency matrix Ms. The update of the reliability matrix for iteration s is computed according to Eq. (16) M(s+1) i,j = M(s) i,j + [Ms]κi,κj ∀i, j ∈ Gs (16) with M(s) being the N × N reliability matrix at iteration s, κi ∈ {1, ..., NS} ∀i ∈ Gs, ∪iκi = {1, ..., NS} and κi < κj for all i < j. Finally, we obtain the reliability matrix M of the full set of genes G by normalizing each entry M(S) i,j i, j ∈ G by ni,j that is the frequency of how often detecting an edge between genes i and j has been considered. Note that the proposed SEQSCA technique does not intend to yield valid DAGs but to provide statistical statements to which empirical probability two genes interact with each other. In Table 3, we have summarized the SEQSCA technique. Finally, by means of the SEQSCA technique, we are able to provide statistically significant statements about the interactions among the genes from a large set G by using the GENIE or GI-GENIE algorithm, respectively, in a sequential fashion. In order to limit the Monte Carlo simulation time, we consider a total of 10 genes amounting to 225 binary variables and 2670 constraints for the GENIE algorithm and 630 binary variables and 9645 constraints for the GIGENIE algorithm, respectively. For the GENIE method without considering the consistency constraints in L, we have 225 binary variables and 270 constraints. Since we infer the edge orientation for the Pearson correlationbased method from the least mismatch scoring model, i.e., from the GENIE method without considering the consistency constraints in L, we have 270 binary variables and 270 constraints. In Fig. 7, we display the false detection percentage of edges Ped in the detected DAG normalized to the true number of edges |ED| as defined in Eq. (17) versus the SNR. In Fig. 8, we display the percentage of undetected edges Pmis in the detected DAG normalized to the true number of edges |ED| as defined in Eq. (18) versus the SNR, i.e., Table 3 Summary of the proposed SEQSCA-algorithm Initialization: M(0) = 0N×N; Ms=0 = 0NS×NS ; frequency counter ni(,0j) = 0 Repeat: 1: Select subset Gs of size NS from G; draw each gene from G with equal probability without replacement 2: Update: ni(,sj+1) = ni(,sj) + 1 for all i, j ∈ Gs 3: Estimate the DAG topology Es of set Gs using GENIE, GI-GENIE, respectively; =⇒ Ms 4: Update reliability matrix M(s) according to Eq. (16) 7: Update iteration number: s ← s + 1 Until: s = S; Set [M]i,j = M(S) i,j /ni(,Sj) ∀i, j ∈ G 6 Simulation results In this section, we first demonstrate the performance of the GENIE algorithm and the GI-GENIE algorithm with respect to conventional techniques for simulated data and further provide statistically significant statements on the interactions among the genes from a large set of genes based on real data using the proposed SEQSCA technique. For the implementation of the proposed algorithms, we used the popular CVX interface [ 31 ] along with the well-known MOSEK solver [ 32 ]. 6.1 Synthetic data results We have generated the ideal SK phenotypes R(i) ∈ R for all i ∈ G as well as the ideal DK phenotypes R(i, j) ∈ R for all i, j ∈ G : j > i according to the model of [ 2 ] as displayed in Fig. 2, where we have corrupted the ideal SK and DK phenotypes by independently and identically distributed zero-mean Gaussian noise with variance σ 2. Furthermore, the GIprofile data ρ(i, j)∀i, j ∈ G : j > i has been generated on the basis of the SK and DK phenotypes. We compare both the GENIE algorithm and the GI-GENIE algorithm with the well-known GI-profile approach [ 2, 33 ], where the Pearson correlation between the GIprofiles of genes i and j is computed and an edge in the DAG is detected if the Pearson correlation is above a pre-defined threshold tcorr, where the directionality is inferred from the selection variable αk(i, j) corresponding to the least mismatch model μk(i, j). Furthermore, we compare our proposed methods with the solution of program OGENIE without considering set L as a constraint, which means simply classifying each pair i, j to the least mismatch scoring hierarchical relationship class based on the SK and DK phenotypes R(i) and R(i, j), respectively, without ensuring that the resulting pattern of hierarchical relationship classes represents a valid DAG. Ped = Pmis = ED ˆ ED \ ED |ED| ED ED \ EˆD ˆ |ED| (17) (18) Fig. 8 Dmis versus SNR; tcorr = 0.6; 200 Monte Carlo runs; λd = 0.05, λc = 1, λp = 0.8 Note that in multi-hypothesis testing problems, it is common to view the diagnostic plots in Figs. 7 and 8 jointly to assess the quality of the proposed algorithms. In Fig. 7, we observe that in the low SNR regime, the Pearson correlation-based method performs best in terms of false detection percentage of edges Ped; however, it fails to improve performance with increasing SNR, because for correct directionality information of the edges, this approach relies on the hierarchical relationship classes detected by method OGENIE without considering L. Especially in the high SNR regime, the proposed GENIE and GI-GENIE methods clearly outperform program OGENIE without the topology rule set L and approach and respectively reach the performance of the Pearson correlation method. However, the very good performance of the Pearson correlation method in terms of false detection percentage of edges Ped according to Eq. (17) comes at the cost of a rather poor performance in terms of the percentage of undetected edges Pmis according to Eq. (18) as can be seen in Fig. 8. In particular, in terms of the percentage of undetected edges Pmis, all of the proposed methods outperform the Pearson correlation method. Note that in the high SNR regime, the GI-GENIE of combining SK, DK, and GI-profile data yields the best of both worlds, i.e., it shows an equivalent performance as the Pearson correlation method in terms of false detection percentage of edges Ped, as well as an improvement of the strong performance of the GENIE method in terms of the percentage of undetected edges Pmis. 6.2 Real data results Since discovering genetic interaction maps, i.e., DAGs, for specific organisms is an ongoing field of research and the knowledge on genetic interactions is far away from being complete, there is generally no ground truth to directly compare with, even not for yeast which is one of the best understood organisms in this aspect. Therefore, we base the evaluation of the detection performance of the GENIE and the GI-GENIE methods on the biological knowledge that genetic interactions are generally rare and furthermore on the successful detection of known interactions provided by the well-known yeast database of [ 34 ]. We remark that to obtain statistically significant statements about large sets of genes, we have applied the proposed GENIE and GI-GENIE algorithms, respectively, along with the SEQSCA technique presented above. To demonstrate the benefit of using multiple data types instead of only one data type, we compare the reliability matrix results for SEQSCA and GI-GENIE with SEQSCA and GENIE which only utilizes SK/DK data. We have applied the abovementioned algorithms to the dataset reported in [ 35 ] to obtain the reliability matrices for the GENIE-based SEQSCA as well as for the GI-GENIE-based SEQSCA, MG and MGI, respectively. The phenotypes reported in [ 35 ] are colony size measurements normalized to the wild-type size for each particular SK and DK, respectively. Typically, the colony size serves as a proxy for the fitness of the organism under study, which is the actual cell function of interest that cannot be observed. Therefore, the phenotypes in [ 35 ] are non-negative real numbers within the range [0, Cmax], where Cmax denotes the maximum size dictated by the experiment setup. For computational reasons, we only considered the first 200 genes, i.e., |G| = 200, of the query gene list of [ 35 ]. Figure 9 shows MG obtained by the GENIE-based SEQSCA. In Fig. 10, we have displayed MGI obtained by the GIGENIE-based SEQSCA. For both results, we decomposed G into a sequence of S = 5e4 subsets Gs of equal size Ns = 10. In Fig. 9, 78% of the gene pairs i, j considered by MG of the GENIE-based SEQSCA interact with each other with an empirical probability of less than 20%, i.e., [MG]i,j ≤ .2. Hence, the GENIE-based SEQSCA yields approximately sparse results. This is a good performance in terms of sparsity, since it is known from biology that genetic interactions are generally very rare. However, we can clearly see by the reliability matrix MGI that the proposed GI-GENIE algorithm predicts genetic interactions with a much lower frequency as compared to the GENIE algorithm, which means a very good performance Fig. 9 Reliability matrix MG; S = 5e4 subsets considered; subset size NS = 10 in terms of sparsity. We have computed the acceptance ratio = Nr Nt (19) where Nr is the number of interactions found with high significance ([MG]i,j , [MGI]i,j ≥ 1 − ) and which are deposited in the database of [ 34 ] as well. Nt is the total number of highly significant interactions. Given the confirmed interactions at [ 34 ] for our set of genes under study, we remark that evaluating the number of confirmed interactions, that we have also found with our proposed method, would not be a reasonable performance metric, since [ 34 ] combines knowledge and experimental results of numerous sources. In contrast to that, we only had the dataset of [ 35 ] which only considers a particular phenotype, i.e., colony growth. As depicted in Table 4, we have computed for both the GI-GENIE-based SEQSCA and the GENIE-based SEQSCA. It is obvious that the GI-GENIE-based SEQCA outperforms the GENIE-based SEQSCA, since the acceptance ratio for the GI-GENIEbased SEQSCA is significantly higher than the one of the GENIE-based SEQSCA. Fig. 10 Reliability matrix MGI; S = 5e4 subsets considered; subset size NS = 10; λd = 1e3, λc = 1, λp = 0.85 7 Conclusions In this paper, we have considered the problem of learning the interactions between genes in a genetic network. We have proposed the GENIE algorithm and the GIGENIE algorithm to reconstruct the DAG underlying the observed data. The GENIE method is purely based on SK and DK data whereas the GI-GENIE method combines SK and DK data with GI-profile data in order to compute an estimate of the true DAG topology. In Section 5, we have presented the SEQSCA technique in order to obtain statistically significant statements about the interactions among a large set of genes under study. Furthermore, we have shown by simulations that the GI-GENIE algorithm outperforms the conventional techniques and the GENIE algorithm due to the combination of multiple data types, i.e., SK/DK and GI-profile data. Finally, based on the SEQSCA technique, we have presented real data results for the GENIE and the GI-GENIE algorithm, respectively, Table 4 Acceptance ratios; = 0.05 Method: SEQSCA and GENIE where we have confirmed that the GI-GENIE method outperforms the GENIE method. Endnote 1 In a discrete optimization context, the class-selection variables defined in (3) are denoted as SOS-1 type variables. However, for the sake of readability, we will mostly omit this optimization context-based annotation and refer to the variables defined in (3) as class-selection variables. Authors’ contributions FN and MP contributed to the main idea, designed and implemented the proposed algorithms, carried out the simulations, and analyzed the results. FN wrote the initial draft of the paper. MP revised the paper. GK and NT brought in the biological background. All authors read and approved the final manuscript. Competing interests The authors declare that they have no competing interests. Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. 1. A Shojaie , G Michailidis , Discovering graphical Granger causality using the truncating lasso penalty . 26 ECCB 2010 , i517 - i523 ( 2010 ). Department of Statistics, University of Michigan, ECCB, Vol. 26 2. A Battle , MC Jonikas , P Walter , JS Weissman, D Koller , Automated identification of pathways from quantitative genetic interaction data . Mol.Syst. Biol . 6 , 379 - 391 ( 2010 ) 3. AHY Tong , et al, Systematic genetic analysis with ordered arrays of yeast deletion mutants . Science . 294 , 2364 - 2368 ( 2001 ) 4. B Snijder , P Liberali , M Frechin, T Stoeger , L Pelkmans , Predicting functional gene interactions with the hierarchical interaction score . Nat. Methods . 10 ( 11 ), 1089 - 1094 ( 2013 ) 5. A Baryshinkova , et al, Quantitative analysis of fitness and genetic interactions in yeast on a genome scale . Nat. Methods . 7 , 1017 - 1024 ( 2010 ) 6. SR Collins , A Roguev , NJ Krogan, Quantitative genetic interaction mapping using the E-MAP approach . Methods Enzymol . 470 , 205 - 231 ( 2010 ) 7. RO Linden , VP Eronen, T Aittokallio , Quantitative maps of genetic interactions in yeast-comparative evaluation and integrative analysis . BMC Syst. Biol . 5 , 45 - 58 ( 2011 ) 8. SJ Dixon , M Constanzo , A Baryshinkova , B Andrews , C Boone , Systematic mapping of genetic interaction networks . Annu.Rev. Genet . 43 , 601 - 625 ( 2009 ) 9. GN Brock , et al, Methods for detecting gene gene interaction in multiplex extended pedigrees . BMC Genet . 6 , 144 - 149 ( 2005 ) 10. TC Hu, AB Kahng, Linear and integer programming in practice . (Springer International Publishing, Schweiz, 2016 ). ISBN- 10 : 3319239996 11. G Sierksma , Linear and integer programming: theory and practice, second edition . (CRC Press, Boca Raton, 2001 ). ISBN- 10 : 0824706730 12. G Sierksma , Y Zwols , Linear and integer optimization: theory and practice, third edition . (CRC Press, Boca Raton, 2015 ). ISBN- 10 : 1498710166 13. E Demirel , N Demirel , H Gökcen , A mixed integer linear programming model to optimize reverse logistics activities of end-of-life vehicles in Turkey . J. Clean. Prod . 112 , 1813 - 2144 ( 2016 ) 14. CH Antunes, MJ Alves, J Climaco , Multiobjective linear and integer programming . (Springer International Publishing, Schweiz, 2016 ). ISBN- 13 : 9783319287447 15. M Diaby , MH Karwan, Advances in combinatorial optimization . (World Scientific Publishing Co. Pte. Ltd., Singapore , 2016 ). ISBN- 10 : 9814704873 16. R Diestel , Graphentheorie. (Springer-Verlag, Heidelberg, 2012 ). ISBN 978-3-642-14911-5 17. A Jaimovich , et al, Modularity and directionality in genetic interaction maps . Nat. Methods . 26 , 38 - 45 ( 2010 ) 18. A Baryshinkova , M Constanzo , CL Myers , B Andrews , C Boone, Genetic interaction networks: toward an understanding of heritability . Annu.Rev. Genomics Hum. Genet . 14 , 111 - 133 ( 2013 ) 19. A Rogueav , et al, Quantitative genetic-interaction mapping in mammalian cells . Nat. Methods . 10 , 432 - 437 ( 2013 ) 20. M Constanzo , et al, The genetic landscape of a cell . Science . 327 , 425 - 431 ( 2010 ) 21. F Nikolay , M Pesavento , Learning directed-acyclic-graphs from large-scale double-knockout experiments. (C, Communications System Group , TU Darmstadt, EUSIPCO, 2016 ). Budapest, August - September 2016 22. V Balakrishnan , S Boyd , S Balemi , Branch and bound algorithm for computing the minimum stability degree of parameter-dependent linear systems . Int. J. Robust Nonlinear Control . 1 ( 4 ), 295 - 317 ( 1991 ) 23. EL Lawler, DE Wood, Branch-and-bound methods: a survey . Oper. Res . 14 , 699 - 719 ( 1966 ) 24. RE Moore, Global optimization to prescribed accuracy . Comput. Math. Appl . 21 ( 6 /7), 25 - 39 ( 1991 ) 25. Y Cheng, M Pesavento, Joint rate adaptation and downlink beamforming using mixed integer conic programming . IEEE Trans. Signal Process . 63 , 1750 - 1764 ( 2013 ) 26. Y Cheng, M Pesavento , An optimal iterative algorithm for codebook-based downlink beamforming . IEEE Signal Process. Lett . 20 , 775 - 778 ( 2013 ) 27. Y Cheng, M Pesavento, Joint optimization of source power allocation and distributed relay beamforming in multiuser peer-to-peer relay networks . IEEE Trans. Signal Process . 60 ( 6 ), 2395 - 2404 ( 2012 ) 28. Y Cheng, M Pesavento , A Philipp , Joint network optimization and downlink beamforming for CoMP transmissions using mixed integer conic programming . IEEE Trans. Signal Process . 61 , 3972 - 3987 ( 2013 ) 29. CH Papadimitriou , K Steiglitz , Combinatorial optimization: algorithms and complexity . (Dover Publications, Mineola NY , 1998 ). ISBN 0486402584 30. Supplementary Material. https://www2.spg.tu-darmstadt.de/fnikolay/ supp_journal.pdf 31. CVX - A Matlab based convex modeling framework . http://cvxr.com 32. MOSEK Solver. https://www.mosek.com/ 33. M Babu , et al, Quantitative genome-wide genetic interaction screens reveal global epistatic relationships of protein complexes in Escherichia coli . PLoS Genet . 10 , 400 - 414 ( 2014 ) 34. SGD - Saccharomyces genome database . http://www.yeastgenome.org 35. M Costanzo , et al, DRYGIN - Data repository of yeast genetic interactions . Terence Donnelly Centre for Cellular and Biochemical Research , University of Toronto. http://drygin.ccbr.utoronto.ca/~costanzo2009/


This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1186%2Fs13637-017-0063-3.pdf

Fabio Nikolay, Marius Pesavento, George Kritikos, Nassos Typas. Learning directed acyclic graphs from large-scale genomics data, EURASIP Journal on Bioinformatics and Systems Biology, 2017, 10, DOI: 10.1186/s13637-017-0063-3