A Simple SublinearTime Algorithm for Counting Arbitrary Subgraphs via Edge Sampling
I T C S
A Simple SublinearTime Algorithm for Counting Arbitrary Subgraphs via Edge Sampling
Sanjeev Khanna 2 0 3
0 Department of Computer and Information Science, University of Pennsylvania , Philadelphia, PA , USA
1 School of Computer and Communication Sciences, EPFL , Lausanne , Switzerland
2 Sepehr Assadi
3 Michael Kapralov
In the subgraph counting problem, we are given a (large) input graph G(V, E) and a (small) target graph H (e.g., a triangle); the goal is to estimate the number of occurrences of H in G. Our focus here is on designing sublineartime algorithms for approximately computing number of occurrences of H in G in the setting where the algorithm is given query access to G. This problem has been studied in several recent papers which primarily focused on specific families of graphs H such as triangles, cliques, and stars. However, not much is known about approximate counting of arbitrary graphs H in the literature. This is in sharp contrast to the closely related subgraph enumeration problem that has received significant attention in the database community as the database join problem. The AGM bound shows that the maximum number of occurrences of any arbitrary subgraph H in a graph G with m edges is O(m?(H)), where ?(H) is the fractional edgecover of H, and enumeration algorithms with matching runtime are known for any H. We bridge this gap between subgraph counting and subgraph enumeration by designing a simple sublineartime algorithm that can estimate the number of occurrences of any arbitrary graph H in G, denoted by #H, to within a (1 ? ?)approximation with high probability in O( m#?(HH) ) ? poly(log n, 1/?) time. Our algorithm is allowed the standard set of queries for general graphs, namely degree queries, pair queries and neighbor queries, plus an additional edgesample query that returns an edge chosen uniformly at random. The performance of our algorithm matches those of Eden et al. [FOCS 2015, STOC 2018] for counting triangles and cliques and extend them to all choices of subgraph H under the additional assumption of edgesample queries. 2012 ACM Subject Classification Theory of computation ? Streaming, sublinear and near linear time algorithms Related Version A full version is available on arXiv [4], https://arxiv.org/abs/1811.07780. Acknowledgements We are thankful to the anonymous reviewers of ITCS 2019 for many valuable comments.
and phrases Sublineartime algorithms; Subgraph counting; AGM bound

1
Introduction
Counting (small) subgraphs in massive graphs is a fundamental algorithmic problem, with a
wide range of applications in bioinformatics, social network analysis, spam detection and
graph databases (see, e.g. [36, 8, 11]). In social network analysis, the ratio of the number of
triangles in a network to the number of length 2 paths (known as the clustering coefficient) as
well as subgraph counts for larger subgraphs H have been proposed as important metrics for
analyzing massive networks [42]. Similarly, motif counting are popular method for analyzing
proteinprotein interaction networks in bioinformatics (e.g., [36]). In this paper we consider
designing efficient algorithms for this task.
Formally, we consider the following problem: Given a (large) graph G(V, E) with m edges
and a (small) subgraph H(VH , EH ) (e.g., a triangle) and a precision parameter ? ? (0, 1),
output a (1 ? ?)approximation to the number of occurrences of H in G. Our goal is to design
an algorithm that runs in time sublinear in the number m of edges of G, and in particular
makes a sublinear number of the following types of queries to the graph G:
Degree query v: the degree dv of any vertex v ? V ;
Neighbor query (v, i): what vertex is the ith neighbor of the vertex v ? V for i ? dv;
Pair query (u, v): test for pair of vertices u, v ? V , whether or not (u, v) belongs to E.
Edgesample query: sample an edge e uniformly at random from E.
The first three queries are the standard baseline queries (see Chapter 10 of Goldreich?s
book [23]) assumed by nearly all sublinear time algorithms for counting small subgraphs such
as triangles or cliques [16, 18] (see [25] for the necessity of using pair queries for counting
subgraphs beside stars). The last query is somewhat less standard but is also considered in
the literature prior to our work, for example in [2] for counting stars in sublinear time, and
in [19] in the context of lower bounds for subgraph counting problems.
1.1
Our Contributions
For the sake of clarity, we suppress any dependencies on the approximation parameter ?,
log nterms, and the size of graph H, using the O?(?) notation. Our results are parameterized
by the fractional edgecover number of the subgraph H (see Section 3 for the formal definition).
Our goal in this paper is to approximately compute the number of occurrences #H of H in
G, formally defined as number of subgraphs H0 of G such that H and H0 are isomorphic.
I Theorem 1. There exists a randomized algorithm that given ? ? (0, 1), a subgraph H, and
a query access to the input graph G, with high probability outputs a (1 ? ?) approximation to
the number of occurrences of H in G, denoted by #H, using:
O? min
m,
m?(H)
#H
queries and O?
time.
m?(H)
#H
The algorithm uses degree, neighbor, pair, and edgesample queries.
Since the fractional edgecover number of any kclique Kk is k/2, as a corollary of Theorem 1,
we obtain sublinear algorithms for counting triangles, and in general kcliques using
m?m
#K3
O?
min
m,
and O?
min
m,
mk/2
#Kk
queries respectively. These bounds match the previous results of Eden et al. [16, 18] modulo
n n
an additive term of O?( (#K3)1/3 ) for triangles in [16] and O?( (#Kk)1/k ) for arbitrary cliques
in [18] which is needed in the absence of edgesample queries that are not used by [16, 18].
Our bounds settle a conjecture of Eden and Rosenbaum [19] in the affirmative by showing
that one can avoid the aforementioned additive terms depending on n in query complexity
by allowing edgesample queries. We now elaborate more on different aspects of Theorem 1.
AGM Bound and Database Joins. The problem of enumerating all occurrences of a graph
H in a graph G and, more generally, the database join problem, has been considered
extensively in the literature. A fundamental question here is that given a database with m
entries (e.g. a graph G with m edges) and a conjunctive query H (e.g. a small graph H),
what is the maximum possible size of the output of the query (e.g., number of occurrences of
H in G)? The AGM bound of Atserias, Grohe and Marx [5] provides a tight upper bound of
m?(H) (up to constant factors), where ?(H) is the fractional edge cover of H. The AGM
bound applies to databases with several relations, and the fractional edge cover in question
is weighted according to the sizes of the different relations. A similar bound on the number
of occurrences of a hypergraph H inside a hypergraph G with m hyperedges was proved
earlier by Friedgut and Kahn [22], and the bound for graphs is due to Alon [3]. Recent
work of Ngo et al. [37] gave algorithms for evaluating database joins in time bounded by
worst case output size for a database with the same number of entries. When instantied
for the subgraph enumeration problem, their result gives an algorithm for enumerating all
occurrences of H in a graph G with m edges in time O(m?(H)).
Our Theorem 1 is directly motivated by the connection between subgraph counting and
subgraph enumeration problems and the AGM bound. In particular, Theorem 1 provides a
natural analogue of AGM bound for estimation algorithms by stating that if the number
of occurrences H is #H ? m?(H), then a (1 ? ?)approximation to #H can be obtained in
O?( m#?(HH) ) time. Additionally, as we show in Section 4.3, Theorem 1 can be easily extended
to the more general problem of database join size estimation (for binary relations). This
problem corresponds to a subgraph counting problem in which the graphs G and H are both
edgecolored and our goal is to count the number of copies of H in G with the same colors on
edges. Our algorithm can solve this problem also in O?( m#?H(Hc) ) time where #Hc denotes the
number of copies of H with the same colors in G.
Optimality of Our Bounds. Our algorithm in Theorem 1 is optimal from different points
of view. Firstly, by a lower bound of [19] (building on [16, 18]), the bounds achieved by our
algorithm when H is any kclique is optimal among all algorithms with the same query access
(including the edgesample query). In Theorem 15, we further prove a lower bound showing
that for odd cycles as well, the bounds achieved by Theorem 1 are optimal. These results
hence suggest that Theorem 1 is existentially optimal: there exists several natural choices
for H such that Theorem 1 achieves the optimal bounds. However, there also exist choices
of H for which the bounds in Theorem 1 are suboptimal. In particular, Aliakbarpour et
al. [2] presented an algorithm for estimating occurrences of any star S` for ` ? 1 using
m
O?( (#S`)1/` ) queries in our query model (including edgesample queries) which is always
at least as good as our bound in Theorem 1, but potentially can be better. On the other
hand, in the full version of the paper [4], we show that our current algorithm, with almost
no further modification, in fact achieves this stronger bound using a different analysis.
Additionally, as we pointed out before, our algorithm can solve the more general database
join size estimation for binary relations, or equivalently the subgraph counting problem with
colors on edges. In Theorem 16, we prove that for this more general problem, our algorithm
in Theorem 1 indeed achieves optimal bounds for all choices of the subgraph H.
EdgeSample Queries. The edgesample query that we assume is not part of the standard
access model for sublinear algorithms, namely the ?general graph? query model (see, e.g. [32]).
Nonetheless, we find allowing for this query ?natural? owing to the following factors:
Theoretical implementation. Edge sampling queries can be implemented with an Oe(n/?m)
multiplicative overhead in query and time using the recent result of [20], or with an O(n)
additive preprocessing time (which is still sublinear in m) by querying degrees of all vertices.
Hence, we can focus on designing algorithms by allowing these queries and later replacing
them by either of the above implementations in a blackbox way at a certain additional cost.
Practical implementation. Edge sampling is a common practice in analyzing social
networks [34, 33] or biological networks [1]. Another scenario when random edge sampling is
possible is when we can access a random location of the memory that is used to store the
graph. To quote [2]: ?because edges normally take most of the space for storing graphs, an
access to a random memory location where the adjacency list is stored, would readily give
a random edge.? Hence, assuming edge sampling queries can be considered valid in many
scenarios.
Understanding the power of random edge queries. Edge sampling is a critical component
of various sublinear time algorithms for graph estimation [16, 17, 2, 18, 20]. However, except
for [2] that also assumed edgesample queries, all these other algorithms employ different
workarounds to these queries. As we show in this paper, decoupling these workarounds from
the rest of the algorithm by allowing edgesample queries results in considerably simpler
and more general algorithms for subgraph counting and is hence worth studying on its own.
We also mention that studying the power of edgesample queries has been cast as an open
question in [19] as well.
Applications to Streaming Algorithms. Subgraph counting is also one of the most studied
problems in the graph streaming model (see, e.g. [6, 28, 10, 31, 9, 27, 41, 35, 13, 7] and
references therein). In this model, the edges of the input graph are presented one by one
in a stream; the algorithm makes a single or a small number of passes over the stream and
outputs the answer after the last pass. The goal here is to minimize the memory used by the
algorithm (similarinspirit to minimizing the query complexity in the query model).
Our algorithm in Theorem 1 can be directly adapted to the streaming model, resulting
in an algorithm for subgraph counting that makes O(1) passes over the stream and uses
a memory of size O? min nm, m#?(HH) o . For the case of counting triangles and cliques,
the space complexity of our algorithm matches the best known algorithms of McGregor et
al. [35] and Bera and Chakrabarti [7] which are known to be optimal [7]. To the best of
our knowledge, the only previous streaming algorithms for counting arbitrary subgraphs H
are those of Kane et al. [31] and Bera and Chakrabarti [7] that use, respectively, one pass
and O?( m(2#?HE()H2) ) space, and two passes and O?( m#?(HH) ) space, where ?(H) is the integral
edgecover number of H. As ?(H) ? ?(H) ? E(H) by definition and #H ? m?(H) by
AGM bound, the space complexity of our algorithm is always at least as good as the ones
in [31, 7] but potentially can be much smaller.
1.2
Main Ideas in Our Algorithm
Our starting point is the AGM bound which implies that the number of ?potential copies?
of H in G is at most m?(H). Our goal of estimating #H then translates to counting how
many of these potential copies form an actual copy of H in G. A standard approach at this
point is the Monte Carlo method: sample a potential copy of H in G uniformly at random
and check whether it forms an actual copy of H or not; a simple exercise in concentration
inequalities then implies that we only need O( m#?(HH) ) many independent samples to get a
good estimate of #H.
This approach however immediately runs into a technical difficulty. Given only a query
access to G, it is not at all clear how to sample a potential copy of H from the list of all
potential copies. Our first task is then to design a procedure for sampling potential copies of
H from G. In order to do so, we again consider the AGM bound and the optimal fractional
edgecover that is used to derive this bound. We first prove a simple structural result that
states that an optimal fractional edgecover of H can be supported only on edges that form a
disjoint union of odd cycles and stars (in H). This allows us to decompose H into a collection
of odd cycles and stars and treat any arbitrary subgraph H as a collection of these simpler
subgraphs that are suitably connected together.
The above decomposition reduces the task of sampling a potential copy of H to sampling
a collection of odd cycles and stars. Sampling an odd cycle C2k+1 on 2k + 1 edges is as
follows: sample k edges e1, . . . , ek uniformly at random from G; pick one of the endpoints of
e1 and sample a vertex v from the neighborhood of this endpoint uniformly at random. With
some additional care, one can show that the tuple (e1, . . . , ek, v) sampled here is enough to
identify an odd cycle of length 2k + 1 uniquely. To sample a star C` with ` petals, we sample
a vertex v from G with probability proportional to its degree (by sampling a random edge
and picking one of the two endpoints uniformly), and then sample ` vertices w1, . . . , w` from
the neighborhood of v. Again, with some care, this allows us to sample a potential copy of a
star S`. We remark that these sampling procedures are related to sampling triangles in [16]
and stars in [2]. Finally, to sample a potential copy of H, we simply sample all its odd cycles
and stars in the decomposition using the method above. We should note right away that
this however does not result in a uniformly at random sample of potential copies of H as
various parameters of the graph G, in particular degrees of vertices, alter the probability of
sampling each potential copy.
The next and paramount step is then how to use the samples above to estimate the value
of #H. Obtaining an unbiased estimator of #H based on these samples is not hard as
we can identify the probability each potential copy is sampled with in this process (which
is a function of degrees of vertices of the potential copy in G) and reweigh each sample
accordingly. Nevertheless, the variance of a vanilla variant of this sampling and reweighing
approach is quite large for our purpose. To fix this, we use an idea similar to that of [16] for
counting triangles: sample a ?partial? potential copy of H first and fix it; sample multiple
?extensions? of this partial potential copy to a complete potential copy and use the average of
estimates based on each extension to reduce the variance. More concretely, this translates to
sampling multiple copies of the first cycle for the decomposition and for each sampled cycle,
recursively sampling multiple copies of the remainder of H as specified by the decomposition.
A careful analysis of this recursive process ? which is the main technical part of the paper
? allows us to bound the variance of the estimator by O(m?(H)) ? (#H). Repeating such
an estimator O( m#?(HH) ) times independently and taking the average value then gives us a
(1 ? ?)approximation to #H by a simple application of Chebyshev?s inequality.
1.3
Further Related Work
In addition to the previous work in [16, 18, 2] that are already discussed above, sublineartime
algorithms for estimating subgraph counts and related parameters such as average degree and
degree distribution moments have also been studied in [21, 24, 25, 17]. Similarly,
sublineartime algorithms are also studied for estimating other graph parameters such as weight of the
minimum spanning tree [15, 12, 14] or size of a maximum matching or a minimum vertex
cover [40, 38, 43, 26, 39] (this is by no means a comprehensive summary of previous results).
Subgraph counting has also been studied extensively in the graph streaming model (see,
e.g. [6, 28, 10, 31, 9, 27, 41, 35, 13, 7, 30, 29] and references therein). In this model, the
edges of the input graph are presented one by one in a stream; the algorithm makes a single
or a small number of passes over the stream and outputs the answer after the last pass.
The goal in this model is to minimize the memory used by the algorithm similarinspirit to
minimizing the query complexity in our query model. However, the streaming algorithms
typically require reading the entire graph in the stream which is different from our goal in
sublineartime algorithms.
2
Preliminaries
Notation. For any integer t ? 1, we let [t] := {1, . . . , t}. For any event E, I(E) ? {0, 1} is an
indicator denoting whether E happened or not. For a graph G(V, E), V (G) := V denotes the
vertices and E(G) := E denotes the edges. For a vertex v ? V , N (v) denotes the neighbors
of v, and dv := N (v) denotes the degree of v.
To any edge e = {u, v} in G, we assign two directed edges ~e1 = (u, v) and ~e2 = (v, u)
called the directed copies of e and let E~ be the set of all these directed edges. We also fix
a total ordering ? on vertices whereby for any two vertices u, v ? V , u ? v iff du < dv, or
du = dv and u appears before v in the lexicographic order. To avoid confusion, we use letters
a, b and c to denote the vertices in the subgraph H, and letters u, v and w to denote the
vertices of G.
We use the following standard variant of Chebyshev?s inequality.
I Proposition 2. For any random variable X and integer t ? 1, Pr (X ? E [X] ? t) ?
Vatr2[X] .
We also recall the law of total variance that states the for two random variables X and Y ,
Var [Y ] = E (Var [Y  X = x]) + Var [E [Y  X = x]] .
x x
(1)
Assumption on Size of Subgraph H. Throughout the paper, we assume that the size of
the subgraph H is a fixed constant independent of the size of the graph G and hence we
suppress the dependency on size of H in various bounds in our analysis using Onotation.
3
A Graph Decomposition Using Fractional EdgeCovers
In this section, we give a simple decomposition of the subgraph H using fractional edgecovers.
We start by defining fractional edgecovers formally (see also Figure 1).
I Definition 3 (Fractional EdgeCover Number). A fractional edgecover of H(VH , EH ) is
a mapping ? : EH ? [0, 1] such that for each vertex a ? VH , Pe?EH,a?e ?(e) ? 1. The
fractional edgecover number ?(H) of H is the minimum value of Pe?EH ?(e) among all
fractional edgecovers ?.
(a) The subgraph H.
?(H)
=
minimize
I Lemma 4. Any subgraph H admits an optimal fractional edgecover x? such that the
support of x?, denoted by supp(x?), is a collection of vertexdisjoint odd cycles and star
graphs, and,
1. for every odd cycle C ? supp(x?), xe? = 1/2 for all e ? C;
2. for every edge e ? supp(x?) that does not belong to any odd cycle, xe = 1.
3.1
The Decomposition
We now present the decomposition of H using Lemma 4. Let x? be an optimal fractional
edgecover in Lemma 4 and let C1, . . . , Co be the oddcycles in the support of x? and S1, . . . , Ss
be the stars. We define D(H) := {C1, . . . , Co, S1, . . . , Ss} as the decomposition of H (see
Figure 1 for an illustration).
For every i ? [o], let the length of the odd cycle Ci be 2ki + 1 (i.e., Ci = C2ki+1); we
define ?iC := ki + 1/2. Similarly, for every j ? [s], let the number of petals in Sj be `j (i.e.,
Sj = S`j ); we define ?jS := `j . By Lemma 4,
0.5
0.5
o s
X ?iC + X ?jS .
i=1 j=1
(3)
Recall that by AGM bound, the total number of copies of H possible in G is m?(H). We also
use the following simple lemma which is a direct corollary of the AGM bound.
I Lemma 5. Let I := {i1, . . . , io} and J := {j1, . . . , js} be subsets of [o] and [s], respectively.
Suppose He is the subgraph of H on vertices of the odd cycles Ci1 , . . . , Cio and stars Sj1 , . . . , Sjs .
Then the total number of copies of He in G is at most m?(He) for ?(He ) ? Pi?I ?iC + Pj?J ?jS .
Proof. Let x? denote the optimal value of LP (2) in D(H). Define y? as the projection of
x? to edges present in He . It is easy to see that y? is a feasible solution for LP (2) of He with
value Pi?I ?iC + Pj?J ?jS . The lemma now follows from AGM bound for He . J
3.2
Profiles of Cycles, Stars, and Subgraphs
We conclude this section by specifying the representation of the potential occurrences of the
subgraph H in G based on the decomposition D(H).
Odd cycles: We represent a potential occurrence of an odd cycle C2k+1 in G as follows. Let
e = (~e1, . . . , ~ek) ? E~ k be an ordered tuple of k directed copies of edges in G and suppose
~ei := (ui, vi) for all i ? [k]. Define u?e = u1 and let w be any vertex in N (u?e). We refer to any
such collection (e, w) as a profile of C2k+1 in G. We say that ?the profile (e, w) forms a cycle
C2k+1 in G? iff (i) u1 is the smallest vertex on the cycle according to ?, (ii) v1 ? w, and
(iii) the edges (u1, v1), (v1, u2), . . . , (uk, vk), (vk, w), (w, u1) all exist in G and hence there is a
copy of C2k+1 on vertices {u1, v1, u2, v2, . . . , uk, vk, w} in G. Note that under this definition
and our definition of #C2k+1, each copy of C2k+1 correspond to exactly one profile (e, w)
and vice versa. As such,
#H = X I R forms a copy of H in G ? f (H),
R
for a fixed constant f (H) depending only on H as defined below. Let ? : VH ? VH
be an automorphism of H. Let C1, . . . , Co, S1, . . . , Ss denote the cycles and stars in the
decomposition of H. We say that ? is decomposition preserving if for every i = 1, . . . , o
cycle Ci is mapped to a cycle of the same length and for every i = 1, . . . , s star Si is mapped
to a star with the same number of petals. Let the number of decomposition preserving
automorphisms of H be denoted by Z, and define f (H) = 1/Z. Define the quantity
#gH := PR I R forms a copy of H in G which is equal to #H modulo the scaling factor
of f (H). It is immediate that estimating #H and #H are equivalent to each other and hence
g
in the rest of the paper, with a slight abuse of notation, we use #H and #gH interchangeably.
4
A SublinearTime Algorithm for Subgraph Counting
We now present our sublinear time algorithm for approximately counting number of any
given arbitrary subgraph H in an underlying graph G and prove Theorem 1. The main
component of our algorithm is an unbiased estimator random variable for #H with low
#C2k+1 = X
X
I (e, w) forms a cycle C2k+1 in G .
e?E~ k w?N(u?e)
Stars: We represent a potential occurrence of a star S` in G by (v, w) where v is the center
of the star and w = (w1, . . . , w`) are the ` petals. We refer to (v, w) as a profile of S` in G.
We say that ?the profile (v, w) forms a star S` in G? iff (i) w > 1, or (ii) (` =) w = 1 and
v ? w1; in both cases there is a copy of S` on vertices v, w1, . . . , w`. Under this definition,
each copy of S` corresponds to exactly one profile (v, w). As such,
#S` = X
X
v?V w?N(v)`
I (v, w) forms a star S` in G .
Arbitrary subgraphs: We represent a potential occurrence of H in G by an (o + s)tuple
R := ((e1, w1), . . . , (eo, wo), (v1, w1), . . . , (vs, ws)) where (ei, wi) is a profile of the cycle Ci
in D(H) and (vj, wj) is a profile of the star Sj. We refer to R as a profile of H and say
that ?the profile R forms a copy of H in G? iff (i) each profile forms a corresponding copy
of Ci or Sj in D(H), and (ii) the remaining edges of H between vertices specified by R all
are present in G (note that by definition of the decomposition D(H), all vertices of H are
specified by R). As such,
(4)
(5)
(6)
variance. The algorithm in Theorem 1 is then obtained by simply repeating this unbiased
estimator in parallel enough number of times (based on the variance) and outputting the
average value of these estimators.
4.1
A Lowvariance Unbiased Estimator for #H
We present a lowvariance unbiased estimator for #H in this section. Our algorithm is a
sampling based algorithm. In the following, we first introduce two separate subroutines for
sampling odd cycles (oddcyclesampler) and stars (starsampler), and then use these
components in conjunction with the decomposition we introduced in Section 3, to present our
full algorithm. We should right away clarify that oddcyclesampler and starsampler
are not exactly sampling a cycle or a star, but rather sampling a set of vertices and edges (in
a nonuniform way) that can potentially form a cycle or star in G, i.e., they sample a profile
of these subgraphs defined in Section 3.2.
The oddcyclesampler Algorithm
We start with the following algorithm for sampling an odd cycle C2k+1 for some k ? 1. This
algorithm outputs a simple data structure, named the cyclesampler tree, that provides a
convenient representation of the samples taken by our algorithm (see Definition 6 immediately
after the description of the algorithm). This data structure can be easily avoided when
designing a cycle counting algorithm, but will be quite useful for reasoning about the recursive
structure of our sampling algorithm for general graphs H.
Algorithm 1 oddcyclesampler(G, C2k+1).
1. Sample k directed edges e := (~e1, . . . , ~ek) uniformly at random (with replacement) from
G with the constraint that for ~e1 = (u1, v1), u1 ? v1.
2. Let u?e := u1 and let d?e := du?e .
3. For i = 1 to te := dd?e/?me: Sample a vertex wi uniformly at random from N (u?e).
4. Let w := (w1, . . . , wte ). Return the cyclesampler tree T (e, w) (see Definition 6).
I Definition 6 (CycleSampler Tree). The cyclesampler tree T (e, w) for the tuple (e, w)
sampled by oddcyclesampler(G, C2k+1) is the following 2level tree T :
Each node ? of the tree contains two attributes: label[?] which consists of some of the
edges and vertices in (e, w), and an integer value[?].
For the root ?r of T , label[?r] := e and value[?r] := (2m)k/2.
(value[?r] is equal to the inverse of the probability that e is sampled by
oddcyclesampler).
The root ?r has te childnodes in T for a parameter te = dd?e/?me (consistent with
line 3 of oddcyclesampler(G, C2k+1) above).
For the ith childnode ?i of root, i ? [te], label[?i] := wi and value[?i] := d?
e
(value[?i] is equal to the inverse of the probability that wi is sampled by
oddcyclesampler, conditioned on e being sampled).
Moreover, for each roottoleaf path Pi := (?r, ?i) (for i ? [te]), define label[Pi] := label[?r] ?
label[?i] and value[Pi] := value[?r] ? value[?i] ( label[Pi] is a profile of the cycle C2k+1 as
defined in Section 3.2).
oddcyclesampler can be implemented in our query model by using k edgesample
queries (and picking the correct direction for e1 based on ? and one of the two directions
uniformly at random for the other edges) in Line (1), two degree queries in Line (2), and
one neighbor query in Line (3). This results in O(k) queries in total for one iteration of the
forloop in Line (3). As such, the total query complexity of oddcyclesampler is O(te)
(recall that k is a constant). It is also straightforward to verify that we can compute the
cyclesampler tree T of an execution of oddcyclesampler with no further queries and in
O(te) time. We bound the query complexity of this algorithm by bounding the expected
number of iterations in the forloop. The proof is postponed to the full version [4].
I Lemma 7. For the parameter te in Line (3) of oddcyclesampler, E [te] = O(1).
We now define a process for estimating the number of odd cycles in a graph using the
information stored in the cyclesampler tree and the oddcyclesampler algorithm. While
we do not use this process in a blackbox way in our main algorithm, abstracting it out
makes the analysis of our main algorithm simpler to follow and more transparent, and serves
as a warmup for our main algorithm.
Warmup: An Estimator for Odd Cycles. Let T := oddcyclesampler(G, C2k+1) be
the output of an invocation of oddcyclesampler. Note that the cyclesampler tree T
is a random variable depending on the randomness of oddcyclesampler. We define the
random variable Xi such that Xi := label[Pi] for the ith roottoleaf path iff label[Pi] forms
a copy of C2k+1 in G and otherwise Xi := 0 (according to the definition of Section 3). We
further define Y := t1e ? Pte
i=1 Xi (note that te is also a random variable). Our estimator
algorithm can compute the value of these random variables using the information stored in
the tree T plus additional O(k) = O(1) queries for each of the te roottoleaf path Pi to
detect whether (e, wi) forms a copy of H or not. Thus, the query complexity and runtime of
the estimator is still O(te) (which in expectation is O(1) by Lemma 7). The expectation and
variance of the estimator can be bounded as follows (the proof is in the full version [4]).
I Lemma 8. For the random variable Y associated with oddcyclesampler(G, C2k+1),
E [Y ] = (#C2k+1),
Var [Y ] ? (2m)k?m ? E [Y ] .
The starsampler Algorithm
We now give an algorithm for sampling a star S` with ` petals. Similar to oddcyclesampler,
this algorithm also outputs a simple data structure, named the starsampler tree, that
provides a convenient representation of the samples taken by our algorithm (see Definition 9,
immediately after the description of the algorithm). This data structure can be easily avoided
when designing a star counting algorithm, but will be quite useful for reasoning about the
recursive structure of our sampling algorithm for general graphs H.
I Definition 9 (StarSampler Tree). The starsampler tree T (v, w) for the tuple (v, w)
sampled by starsampler(G, S`) is the following 2level tree T (with the same attributes as
in Definition 6) with only two nodes:
For the root ?r of T , label[?r] := v and value[?r] := 2m/dv.
( value[?r] is equal to the inverse of the probability that v is sampled by starsampler).
The root ?r has exactly one childnode ?l in T with label[?l] = w = (w1, . . . , w`) and
value[?l] = d`v .
( value[?l] is equal to the inverse of the probability that w is sampled by starsampler,
conditioned on v being sampled).
Algorithm 2 starsampler(G, S`).
1. Sample a vertex v ? V chosen with probability proportional to its degree in G (i.e., for
any vertex u ? V , Pr (u is chosen as the vertex v) = du/2m).
2. Sample ` vertices w := (w1, . . . , w`) from N (v) uniformly at random (without
replacement).
3. Return the starsampler tree T (v, w) (see Definition 9).
Moreover, for the roottoleaf path P := (?r, ?l), we define label[P] := label[?r] ? label[?l]
and value[P] := value[?r] ? value[?l]. ( label[P] is a representation of the star S` as defined in
Section 3.2).
starsampler can be implemented in our query model by using one edgesample query
in Line (1) and then picking one of the endpoints uniformly at random, a degree query to
determine the degree of v, and ` neighbor queries in Line (2), resulting in O(`) queries in
total. It is also straightforward to verify that we can compute the starsampler tree T of an
execution of starsampler with no further queries and in O(1) time.
We again define a process for estimating the number of stars in a graph using the
information stored in the starsampler tree and the starsampler algorithm, as a warmup
to our main result in the next section.
Warmup: An Estimator for Stars. The starsampler tree T is a random variable depending
on the randomness of starsampler. We define the random variable X such that X :=
value[P] for the roottoleaf path of T iff label[P] forms a copy of S` in G and otherwise
X := 0. Our estimator algorithm can compute the value of this random variable using only
the information stored in the tree T with no further queries to the graph (by simply checking
if all wi?s in w are distinct). As such, the query complexity and runtime of the estimator
algorithm is still O(1). The proof of the following lemma is postponed to the full version [4].
I Lemma 10. For the random variable X associated with starsampler(G, S`),
E [X] = (#S`),
Var [X] ? 2m` ? E [X] .
The Estimator Algorithm for Arbitrary Subgraphs
We now present our main estimator for the number of occurrences of an arbitrary subgraph
H in G, denoted by (#H). Recall the decomposition D(H) := {C1, . . . , Co, S1, . . . , Ss} of H
introduced in Section 3. Our algorithm creates a subgraphsampler tree T (a generalization
of cyclesampler and starsampler trees in Definitions 6 and 9) and use it to estimate (#H).
We define the subgraphsampler tree T and the algorithm subgraphsampler(G, H) that
creates it simultaneously:
SubgraphSampler Tree. The subgraphsampler tree T is a zlevel tree for z := (2o + 2s)
returned by subgraphsampler(G, H). The algorithm constructs T as follows.
Sampling Odd Cycles. In subgraphsampler(G, H), we run oddcyclesampler(G, C1)
and initiate T to be its output cyclesampler tree. For every (current) leafnode ? of T ,
we run oddcyclesampler(G, C2) independently to obtain a cyclesampler tree T? (we say
that ? started the sampling of T?). We then extend the tree T with two new layers by
connecting each leafnode ? to the root of T? that started its sampling. This creates a
4level tree T . We continue like this for o steps, each time appending the tree obtained by
oddcyclesampler(G, Cj) for j ? [o], to the (previous) leafnode that started this sampling.
This results in a (2o)level tree. Note that the nodes in the tree T can have different degrees
as the number of leafnodes in the cyclesampler tree is not necessarily the same always
(not even for two different trees associated with one single Cj through different calls to
oddcyclesampler(G, Cj)).
Sampling Stars. Once we iterated over all odd cycles of D(H), we switch to processing
stars S1, . . . , Ss. The approach is identical to the previous part. Let ? be a (current) leafnode
of T . We run starsampler(G, S1) to obtain a starsampler tree T? and connect ? to T? to
extend the levels of tree by 2 more. We continue like this for s steps, each time appending
the tree obtained by starsampler(G, Sj) for j ? [s], to the (former) leafnode that started
this sampling. This results in a zlevel tree T . Note that all nodes added when sampling
stars have exactly one childnode (except for the leafnodes) as by Definition 9, starsampler
trees always contain only two nodes.
Labels and Values. Each node ? of T is again given two attributes, label[?] and value[?],
which are defined to be exactly the same attributes in the corresponding cyclesampler or
starsampler tree that was used to define these nodes (recall that each node of T is ?copied?
from a node in either a cyclesampler or a starsampler tree). Finally, for each
roottoleaf path P in T , we define label[P] := S??P label[?] and value[P] := Q??P value[?]. In
particular, label[P] := ((e1, w1), . . . , (eo, wo), (v1, w1), . . . , (vs, ws)) by definition of labels of
cyclesampler and starsampler trees. As such label[P] is a representation of the subgraph H
as defined in Section 3.2. By making O(1) additional pairqueries to query all the remaining
edges of this representation of H we determine if label[P] forms a copy of H.
This concludes the description of subgraphsampler(G, H) and its output
subgraphsampler tree T . We bound the query complexity of the algorithm in the following lemma
(the proof is postponed to the full version [4]).
I Lemma 11. The expected query complexity/ running time of subgraphsampler is O(1).
We are now ready to present our estimator algorithm using subgraphsampler and the
subgraphsampler tree T it outputs.
An Estimator for Arbitrary Subgraphs. Note that as before the subgraphsampler tree
T itself is a random variable depending on the randomness of subgraphsampler. For
any roottoleaf path Pi := ?1, . . . , ?z of T , we define the random variable Xi such that
Xi := value[Pi] iff label[Pi] forms a copy of H in G and otherwise Xi := 0. We further
define Y := ( 1t Pit=1 Xi), where t is the number of leafnodes of T (which itself is a random
variable). These random variables can all be computed from T and subgraphsampler with
at most O(1) further pairqueries per each roottoleaf path P of the tree to determine if
indeed label[P] forms a copy of H in G or not. As such, query complexity and runtime of this
algorithm is proportional to subgraphsampler (which in expectation is O(1) by Lemma 11).
In the following two lemmas, we show that Y is a lowvariance unbiased estimator of (#H).
Notation. For any node ? in T , we use T? to denote the subtree of T rooted at ?. For a
leafnode ?, we define a random variable Y? which is value[?] iff for the roottoleaf path P
ending in ?, label[P] forms a copy of H in G and otherwise Y? is 0. For an internal node ? in
T with t childnodes ?1, . . . , ?t, we define Y? = value[?] ? 1t ? Pit=1 Yi . It is easy to verify
that Y?r for the root ?r of T is the same as the estimator random variable Y defined earlier.
Furthermore, for a node ? in level ` of T , we define L? := (label[?1], label[?2], . . . , label[?`?1]),
where ?1, . . . , ?`?1 forms the path from the root of T to the parent of ?.
We analyze the expected value and the variance of the estimator.
I Lemma 12. For Y in subgraphsampler(G, H), E [Y ] = (#H).
Proof. We prove this inductively by showing that for any node ? in an odd layer of T ,
E [Y?  L?] = (#H  L?), where (#H  L?) denotes the number of copies of H in G that
contain the vertices and edges specified by L? (according to the decomposition D(H)).
E [Y?  L?] measures the value of Y? after we fix the rest of the tree T and let the subtree
T? be chosen randomly as in subgraphsampler.
The base case of the induction, i.e., for vertices in the last odd layer of T follows exactly
as in the proofs of Lemmas 8 and 10 (as will also become evident shortly) and hence we do
not repeat it here. We now prove the induction hypothesis. Fix a vertex ? in an odd layer `.
We consider two cases based on whether ` < 2o (hence ? is root of a cyclesampler tree) or
` > 2o (hence ? is root of a starsampler tree).
Case of ` < 2o. In this case, the subtree T? in the next two levels is a cyclesampler tree,
1 Xte E [Y?i  L?, e]
E [Y?  L?] = X Pr (label[?] = e) ? value[?] ? te i=1
e
!
(here, ?i?s are childnodes of ?)
= X 1 Xte E [Y?i  L?, e]
e te i=1
(as by definition, value[?] = Pr (label[?] = e)?1)
Note that each ?i has exactly one childnode, denoted by ?i. As such,
E [Y?  L?] = X 1 Xte E [Y?i  L?, e]
e te i=1
= X 1 Xte X Pr (label[?i] = w) ? value[?i] ? E [Y?i  L?, e, w]
e te i=1 w
= X 1 Xte X E [Y?i  L?i ]
e te i=1 w
(by definition value[?i] = Pr (label[?i] = w)?1 and L?i = L?, (e, w))
= X 1 Xte X(#H  L?i ) = X 1 Xte X(#H  L?, (e, w))
e te i=1 w e te i=1 w
(by induction hypothesis for oddlayer nodes ?i?s)
= X X(#H  L?, (e, w)) = (#H  L?).
e w
This concludes the proof of induction hypothesis in this case.
Case of ` > 2o. In this case, the subtree T? in the next two levels is a starsampler tree.
By the same analogy made in the proof of the previous part and Lemma 8, the proof of
this part also follows directly from the proof of Lemma 10 for starsampler trees.
We can now finalize the proof of Lemma 12, by noting that for the root ?r of T , L?r is
the emptyset and hence, E [Y ] = E [Y?r  L?r ], which by induction is equal to (#H). J
I Lemma 13. For Y in subgraphsampler(G, H), Var [Y ] = O(m?(H)) ? E [Y ].
Proof. We bound Var [Y ] using a similar inductive proof as in Lemma 12. Recall the
parameters ?1C , . . . , ?oC and ?1S, . . . , ?sS associated respectively with the cycles C1, . . . , Co and
stars S1, . . . , Ss of the decomposition D(H). For simplicity of notation, for any i ? [o + s],
we define ?i+ as follows:
o s
for all i ? o, ?i+ := X ?jC + X ?S,
j
j=i j=1
s
for all o < i ? o + s, ?i+ := X ?jS.
j=i?o
We inductively show that, for any node ? in an odd layer 2` ? 1 of T ,
Var [Y?  L?] ? 22z?2` ? m?`+ ? (#H  L?),
where (#H  L?) denotes the number of copies of H in G that contain the vertices and edges
specified by L? (according to the decomposition D(H)).
The induction is from the leafnodes of the tree to the root. The base case of the induction,
i.e., for vertices in the last odd layer of T follows exactly as in the proofs of Lemmas 8 and 10
(as will also become evident shortly) and hence we do not repeat it here. We now prove the
induction hypothesis. Fix a vertex ? in an odd layer 2` ? 1. We consider two cases based
on whether ` ? o (hence ? is root of a cyclesampler tree) or ` > o (hence ? is root of a
starsampler tree).
Case of ` ? o. In this case, the subtree T? in the next two levels is a cyclesampler tree
corresponding to the odd cycle C` of D(H). Let the number of edges in C` be (2k + 1)
(i.e., C` = C2k+1) Let e denote the label of the ?. By the law of total variance in Eq. (1)
Var [Y?  L?] = E [Var [Y?  e]  L?] + Var [E [Y?  e]  L?] .
We start by bounding the second term in Eq. (7) which is easier. By the inductive proof
of Lemma 12, we also have, E [Y?  L?, e] = (#H  L?, e). As such,
Var [E [Y?  e]  L?] = Var [(#H  L?, e)  L?] ? E (#H  L?, e)2  L?
1
mk
= X Pr (label[?] = e) ? (#H  L?, e)2 = X(#H  L?, e)2
e e
(Pr (label[?] = e) = 1/mk by definition of oddcyclesampler)
? m1k 2 = m1k (#H  L?)2
X(#H  L?, e)
e
? m?`+ ? (#H  L?).
The reason behind the last equality is that (#H  L?) is at most equal to the number of
copies of the subgraph of H consisting of C`, . . . , Co, S1, . . . , Ss, which by Lemma 5 is at
most m?`+ by definition of ?`+. We now bound the first and the main term in Eq. (7),
(7)
(8)
E [Var [Y?  e]  L?] = X Pr (label[?] = e) ? Var [Y?  e, L?]
e
= X
e
1 1 te
mk ? m2k ? t2 ? X Var [Y?i  e, L?] ,
e i=1
We thus only need to bound Var [Y?1  e, L?]. Recall that ?1 corresponds to a leafnode
in a cyclesampler tree and hence its label is a vertex w from the neighborhood of u?e
as defined in oddcyclesampler. We again use the law of total variance in Eq. (1) to
obtain,
Var [Y?1  e, L?] = E [Var [Y?1  w]  e, L?] + Var [E [Y?1  w]  e, L?]
(10)
For the first term,
E [Var [Y?1  w]  e, L?] =
Pr (label[?1] = w) ? Var [Y?1  w, e, L?]
X
w?N(u?e)
= X 1
w d? ? (d?e)2 ? Var [Y?1  w, e, L?] ,
e
where ?1 is the unique childnode of ?1 and so Y?1 = value[?1] ? Y?1 , while conditioned
on e, value[?1] = d?e. Moreover, as L?1 = (L?, e, w), and by canceling the terms,
E [Var [Y?1  w]  e, L?] = X d?e ? Var [Y?1  L?1 ]
w
? X d?e ? 22z?2`?2 ? m?(`+1)+ ? (#H  L?1 ),
w
where the inequality is by induction hypothesis for the oddlevel node ?1. We now bound
the second term in Eq. (10) as follows,
Var [E [Y?1  w]  e, L?] ? E
E [Y?1  w]
2
 e, L?
where the final equality holds because Y?i ?s are independent conditioned on e, L? and
since Y? is by definition mk times the average of Y?i ?s. Moreover, note that distribution
of all Y?i ?s are the same. Hence, by canceling the terms,
(9)
(11)
(12)
= X Pr (label[?1] = w) ? E [Y?1  w, e, L?]
w
= X 1 2
w d? ? (d?e)2 ? E [Y?1  w, e, L?]
e
= X d?e ? E [Y?1  L?1 ]
w
2
= X d?e ? (#H  L?1 )2
w
2
? X d?e ? m?(`+1)+ ? (#H  L?1 ).
w
Here, the second to last equality holds by the inductive proof of Lemma 12, and the
last equality is because (#H  L?1 ) ? m?(`+1)+ by Lemma 5, as (#H  L?1 ) is at most
equal to the total number of copies of a subgraph of H on C`+1, . . . , Co, S1, . . . , Ss (and
by definition of ?(`+1)+). We now plug in Eq. (11) and Eq. (12) in Eq. (10),
Var [Y?1  e, L?] ?
X d?e ? 22z?2`?2 ? m?(`+1)+ ? (#H  L?1 ) + m?(`+1)+ ? (#H  L?1 ) .
w
We now in turn plug this in Eq. (9),
E [Var [Y?  e]  L?]
? mk X 1 X d?e ? 22z?2`?2 ? m?(`+1)+ ? (#H  L?1 ) + m?(`+1)+ ? (#H  L?1 )
e te w
? mk?m ? X X 22z?2`?1 ? m?(`+1)+ ? (#H  L?1 )
e w
? 22z?2`?1 ? m?`+ ? X X(#H  L?1 )
e w
= 22z?2`?1 ? m?`+ ? (#H  L?).
Finally, by plugging in this and Eq. (8) in Eq. (7),
(as te ? d?e/?m)
(as L?1 = (L?, e, w))
(as ?`C = k + 1/2 and ?`+ = ?`C + ?(`+1)+ by definition)
Var [Y?  L?] = 22z?2`?1 ? m?`+ ? (#H  L?) + m?`+ ? (#H  L?)
? 22z?2` ? m?`+ ? (#H  L?),
finalizing the proof of induction step in this case. We again remark that this proof closely
followed the proof for the variance of the estimator for cyclesampler tree in Lemma 8.
Case of ` > o. In this case, the subtree T? in the next two levels is a starsampler tree. By
the same analogy made in the proof of the previous case and Lemma 8, the proof of this
part also follows the proof of Lemma 10 for starsampler trees. We hence omit the details.
To conclude, we have that Var [Y ] = Var [Y?r  L?r ] = O(m?(H))?(#H) = O(m?(H))?E [Y ]
as Y = Y?r for the root ?r of T , L?r = ?, (#H) = E [Y ] by Lemma 12, and z = O(1). J
4.2
An Algorithm for Estimating Occurrences of Arbitrary Subgraphs
We now use our estimator algorithm from the previous section to design our algorithm for
estimating the occurrences of an arbitrary subgraph H in G. In the following theorem, we
assume that the algorithm has knowledge of m and also a lower bound on the value of #H;
these assumptions can be lifted easily as we describe afterwards.
I Theorem 14. There exists a sublinear time algorithm that uses degree, neighbor, pair, and
edge sample queries and given a precision parameter ? ? (0, 1), an explicit access to a
constantsize graph H(VH , EH ), a query access to the input graph G(V, E), the number of edges m in
G, and a lower bound h ? #H, with high probability outputs a (1 ? ?)approximation to #H
using O min nm, m?h(H) ? lo?g2n o queries and O m?h(H) ? lo?g2n time, in the worstcase.
Proof. Fix a sufficiently large constant c > 0. We run subgraphsampler(G, H) for k :=
c?m?2??(hH) time independently in parallel to obtain estimates Y1, . . . , Yk and let Z := k1 Pik=1 Yi.
By Lemma 12, E [Z] = (#H). Since Yi?s are independent, we also have
k 1 ?2
Var [Z] = 1 X Var [Yi] ? k ? O(m?(H)) ? E [Z] ? 10 ? E [Z]2 ,
k2 i=1
by Lemma 13, and by choosing the constant c sufficiently larger than the constant in the
Onotation of this lemma, together with the fact that h ?Va(r#[ZH]) = E [Z]. By Chebyshev?s
inequality (Proposition 2), Pr (Z ? E [Z] ? ? ? E [Z]) ? ?2?E[Z]2 ? 110 , by the bound above
on the variance. This means that with probability 0.9, this algorithm outputs a (1 ?
?)approximation of #H. Moreover, the expected query complexity and running time of this
algorithm is O(k) by Lemma 11, which is O( m??(2H) ) (if k ? m, we simply query all edges of
the graph and solve the problem using an offline enumeration algorithm). To extend this
result to a high probability bound and also making the guarantee of query complexity and
runtime in the worstcase, we simply run this algorithm O(log n) times in parallel and stop
each execution that uses more than 10 times queries than the expectation. J
The algorithm in Theorem 14 assumes the knowledge of h which is a lower bound on
(#H). However, this assumption can be easily removed by making a geometric search on
h starting from m?(H)/2 which is (approximately) the largest value for (#H) all the way
down to 1 in factors of 2, and stopping the search once the estimates returned for a guess of
h became consistent with h itself. This only increases the query complexity and runtime of
the algorithm by polylog(n) factors. As this part is quite standard, we omit the details and
instead refer the interested reader to [16, 18]. This concludes the proof of our main result in
Theorem 1 from the introduction.
Extension to the Database Join Size Estimation Problem
The database join size estimation for binary relations can be modeled by the subgraph
estimation problem where the subgraph H and the underlying graph G are additionally
edgecolored and we are only interested in counting the copies of H in G with matching colors
on the edges. In this abstraction, the edges of the graph G correspond to the entries of the
database, and the color of edges determine the relation of the entry.
We formalize this variant of the subgraph counting problem in the following. In the
colorful subgraph estimation problem, we are given a subgraph H(VH , EH ) with a coloring
function cH : EH ? N and query access to a graph G(V, E) along with a coloring function
cG : E ? N. The set of allowed queries to G contains the degree queries, pair queries,
neighbor queries, and edgesample queries as before, with a simple change that whenever we
query an edge (through the last three types of queries), the color of the edge according to cG
is also revealed to the algorithm. Our goal is to estimate the number of copies of H in G
with matching colors, i.e., the colorful copies of H.
It is immediate to verify that our algorithm in this section can be directly applied to the
colorful subgraph estimation problem with the only difference that when testing whether a
subgraph forms a copy of H in G, we in fact check whether this subgraph forms a colorful
copy of H in G instead. The analysis of this new algorithm is exactly as in the case of
the original algorithm with the only difference that we switch the parameter #H to #Hc
that only counts the number of copies of H with the same colors in G. To summarize, we
obtain an algorithm with O?( m#?H(Hc) ) query and time complexity for the colorful subgraph
counting problem, which can in turn solves the database join size estimation problem for
binary relations.
5
Lower Bounds
We present two lower bounds that demonstrate the optimality of Theorem 1 in different
scenarios. Our first lower bound establishes tight bounds for counting odd cycles.
probability at least 2/3 requires ?( #mCk2+k+121 ) queries to G.
I Theorem 15. For any k ? 1, any algorithm A that can output any
multiplicativeapproximation to the number of copies of the odd cycle C2k+1 in a given graph G(V, E) with
Theorem 15 implies that in addition to cliques (that were previously proved [19]; see
also [16, 18]), our algorithm in Theorem 1 also achieve optimal bounds for odd cycles.
Our next lower bound targets the more general problem of database join size estimation
for which we argued that our Theorem 1 continues to hold. We show that for this more
general problem, our algorithm in Theorem 1 is in fact optimal for all choices of subgraph H.
I Theorem 16. For any subgraph H(VH , EH ) which contains at least one edge, suppose
A is an algorithm for the colorful subgraph estimation problem that given H, a coloring
cH : EH ? N, and query access to G(V, E) with m edges and coloring function cG : E ? N,
can output a multiplicativeapproximation to the number of colorful copies of H in G with
probability at least 2/3. Then, A requires ?( m#?H(Hc) ) queries, where #Hc is the number of
colorful copies of H in G. The lower bound continues to hold even if the number of colors
used by cH and cG is at most two.
The proofs of Theorems 15 and 16 are postponed to the full version of the paper [4].
1
2
3
4
5
6
7
8
9
10
25
26
27
30
31
32
33
34
35
36
37
38
39
40
41
42
43
Nesreen K. Ahmed , Jennifer Neville, and Ramana Rao Kompella. Network Sampling: From Static to Streaming Graphs . TKDD , 8 ( 2 ):7: 1  7 : 56 , 2013 .
Maryam Aliakbarpour , Amartya Shankha Biswas, Themis Gouleakis, John Peebles, Ronitt Rubinfeld, and Anak Yodpinyanee . SublinearTime Algorithms for Counting Star Subgraphs via Edge Sampling . Algorithmica, 80 ( 2 ): 668  697 , 2018 .
Noga Alon . On the number of subgraphs of prescribed type of graphs with a given number of edges . Israel Journal of Mathematics , 1981 .
Sepehr Assadi , Michael Kapralov , and Sanjeev Khanna . A Simple SublinearTime Algorithm for Counting Arbitrary Subgraphs via Edge Sampling . arXiv, abs/ 1811 .07780, 2018 . arXiv: 1811 .07780.
Albert Atserias , Martin Grohe , and D?niel Marx . Size Bounds and Query Plans for Relational Joins . In 49th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2008, October 2528 , 2008 , Philadelphia, PA, USA, pages 739  748 . IEEE Computer Society, 2008 . doi: 10 .1109/FOCS. 2008 . 43 .
Ziv BarYossef , Ravi Kumar , and D. Sivakumar . Reductions in streaming algorithms, with an application to counting triangles in graphs . In Proceedings of the Thirteenth Annual ACMSIAM Symposium on Discrete Algorithms, January 68 , 2002 , San Francisco, CA, USA., pages 623  632 , 2002 .
Suman K. Bera and Amit Chakrabarti . Towards Tighter Space Bounds for Counting Triangles and Other Substructures in Graph Streams . In 34th Symposium on Theoretical Aspects of Computer Science, STACS 2017, March 8  11 , 2017 , Hannover, Germany, pages 11 : 1  11 : 14 , 2017 .
E. Bloedorn , N. Rothleder , D. DeBarr, and L. Rosen . Relational Graph Analysis with RealWorld Constraints: An Application in IRS Tax Fraud Detection . In AAAI , 2005 .
Vladimir Braverman , Rafail Ostrovsky, and Dan Vilenchik . How Hard Is Counting Triangles in the Streaming Model? In Automata, Languages, and Programming  40th International Colloquium, ICALP 2013 , Riga, Latvia, July 8 12 , 2013 , Proceedings, Part I , pages 244  254 , 2013 .
Luciana S. Buriol , Gereon Frahling, Stefano Leonardi, Alberto MarchettiSpaccamela, and Christian Sohler . Counting triangles in data streams . In Proceedings of the TwentyFifth ACM SIGACTSIGMODSIGART Symposium on Principles of Database Systems , June 2628, 2006 , Chicago, Illinois, USA, pages 253  262 , 2006 .
S. Burt . Structural Holes and Good Ideas . The American Journal of Sociology , 110 ( 2 ): 349  399 , 2004 . doi: 10 .2307/3568221.
Bernard Chazelle , Ronitt Rubinfeld, and Luca Trevisan . Approximating the Minimum Spanning Tree Weight in Sublinear Time . SIAM J. Comput. , 34 ( 6 ): 1370  1379 , 2005 .
Graham Cormode and Hossein Jowhari . A second look at counting triangles in graph streams (corrected) . Theor. Comput. Sci. , 683 : 22  30 , 2017 .
Artur Czumaj , Funda Erg?n, Lance Fortnow, Avner Magen, Ilan Newman, Ronitt Rubinfeld, and Christian Sohler . Approximating the Weight of the Euclidean Minimum Spanning Tree in Sublinear Time . SIAM J. Comput. , 35 ( 1 ): 91  109 , 2005 .
Artur Czumaj and Christian Sohler . Estimating the weight of metric minimum spanning trees in sublineartime . In Proceedings of the 36th Annual ACM Symposium on Theory of Computing , Chicago, IL, USA, June 1316, 2004 , pages 175  183 , 2004 .
Talya Eden , Amit Levi, Dana Ron, and C. Seshadhri . Approximately Counting Triangles in Sublinear Time . In IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015 , Berkeley, CA, USA, 17  20 October, 2015 , pages 614  633 , 2015 .
Talya Eden , Dana Ron, and C. Seshadhri . Sublinear Time Estimation of Degree Distribution Moments: The Degeneracy Connection . In 44th International Colloquium on Automata, Languages, and Programming , ICALP 2017, July 1014 , 2017 , Warsaw, Poland, pages 7 : 1  7 : 13 , 2017 .
Talya Eden , Dana Ron, and C. Seshadhri . On approximating the number of kcliques in sublinear time . In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018 , Los Angeles, CA, USA, June 2529, 2018 , pages 722  734 , 2018 .
Talya Eden and Will Rosenbaum . Lower Bounds for Approximating Graph Parameters via Communication Complexity . In Approximation, Randomization, and Combinatorial Optimization . Algorithms and Techniques, APPROX/RANDOM 2018, August 2022 , 2018  Princeton, NJ, USA, pages 11 : 1  11 : 18 , 2018 .
Talya Eden and Will Rosenbaum . On Sampling Edges Almost Uniformly . In 1st Symposium on Simplicity in Algorithms, SOSA 2018, January 710 , 2018 , New Orleans, LA, USA, pages 7 : 1  7 : 9 , 2018 .
Uriel Feige . On sums of independent random variables with unbounded variance, and estimating the average degree in a graph . In Proceedings of the 36th Annual ACM Symposium on Theory of Computing , Chicago, IL, USA, June 1316, 2004 , pages 594  603 , 2004 .
Israel Journal of Mathematics , 1998 .
Oded Goldreich . Introduction to Property Testing. Cambridge University Press, 2017 .
Oded Goldreich and Dana Ron . Approximating average parameters of graphs . Random Struct. Algorithms , 32 ( 4 ): 473  493 , 2008 .
Mira Gonen , Dana Ron, and Yuval Shavitt . Counting Stars and Other Small Subgraphs in Sublinear Time . In Proceedings of the TwentyFirst Annual ACMSIAM Symposium on Discrete Algorithms, SODA 2010 , Austin, Texas, USA, January 17  19 , 2010 , pages 99  116 , 2010 .
Avinatan Hassidim , Jonathan A. Kelner , Huy N. Nguyen , and Krzysztof Onak . Local Graph Partitions for Approximation and Testing . In 50th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2009, October 2527 , 2009 , Atlanta, Georgia, USA, pages 22  31 , 2009 .
Madhav Jha , C. Seshadhri , and Ali Pinar . A space efficient streaming algorithm for triangle counting using the birthday paradox . In The 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2013 , Chicago, IL, USA, August 11 14 , 2013 , pages 589  597 , 2013 .
Hossein Jowhari and Mohammad Ghodsi . New Streaming Algorithms for Counting Triangles in Graphs . In Computing and Combinatorics , 11th Annual International Conference, COCOON 2005, Kunming, China, August 1629 , 2005 , Proceedings, pages 710  716 , 2005 .
John Kallaugher , Michael Kapralov, and Eric Price . The Sketching Complexity of Graph and Hypergraph Counting . CoRR, abs/ 1808 .04995. To appear in FOCS 2018 ., 2018 .
John Kallaugher and Eric Price . A Hybrid Sampling Scheme for Triangle Counting . In Proceedings of the TwentyEighth Annual ACMSIAM Symposium on Discrete Algorithms, SODA 2017 , Barcelona, Spain, Hotel Porta Fira, January 1619 , pages 1778  1797 , 2017 .
Daniel M. Kane , Kurt Mehlhorn, Thomas Sauerwald, and He Sun . Counting Arbitrary Subgraphs in Data Streams . In Automata, Languages, and Programming  39th International Colloquium, ICALP 2012 , Warwick , UK , July 9 13 , 2012 , Proceedings, Part II , pages 598  609 , 2012 .
Tali Kaufman , Michael Krivelevich, and Dana Ron . Tight Bounds for Testing Bipartiteness in General Graphs . SIAM J. Comput. , 33 ( 6 ): 1441  1483 , 2004 .
JuSung Lee and J?rgen Pfeffer . Estimating Centrality Statistics for Complete and Sampled Networks: Some Approaches and Complications . In 48th Hawaii International Conference on System Sciences, HICSS 2015 , Kauai, Hawaii, USA, January 5 8 , 2015 , pages 1686  1695 , 2015 .
Jure Leskovec and Christos Faloutsos . Sampling from large graphs . In Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , Philadelphia, PA, USA, August 20 23 , 2006 , pages 631  636 , 2006 .
Andrew McGregor , Sofya Vorotnikova , and Hoa T. Vu . Better Algorithms for Counting Triangles in Data Streams . In Proceedings of the 35th ACM SIGMODSIGACTSIGAI Symposium on Principles of Database Systems, PODS 2016 , San Francisco, CA, USA, June 26  July 01, 2016 , pages 401  411 , 2016 .
R. Milo , S. ShenOrr , S. Itzkovitz , N. Kashtan , D. Chklovskii , and U. Alon . Network motifs: simple building blocks of complex networks . Science , 298 ( 5594 ): 824  827 , October 2002 .
Hung Q. Ngo , Ely Porat, Christopher R?, and Atri Rudra . Worstcase Optimal Join Algorithms . J. ACM , 65 ( 3 ): 16 : 1  16 : 40 , 2018 .
Huy N. Nguyen and Krzysztof Onak . ConstantTime Approximation Algorithms via Local Improvements . In 49th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2008, October 2528 , 2008 , Philadelphia, PA, USA, pages 327  336 , 2008 .
Krzysztof Onak , Dana Ron, Michal Rosen, and Ronitt Rubinfeld . A nearoptimal sublineartime algorithm for approximating the minimum vertex cover size . In Proceedings of the TwentyThird Annual ACMSIAM Symposium on Discrete Algorithms, SODA 2012 , Kyoto, Japan, January 17  19 , 2012 , pages 1123  1131 , 2012 .
Michal Parnas and Dana Ron . Approximating the minimum vertex cover in sublinear time and a connection to distributed algorithms . Theor. Comput. Sci. , 381 ( 13 ): 183  196 , 2007 .
Olivia Simpson , C. Seshadhri , and Andrew McGregor. Catching the Head, Tail, and Everything in Between: A Streaming Algorithm for the Degree Distribution . In 2015 IEEE International Conference on Data Mining, ICDM 2015 , Atlantic City , NJ, USA, November 14  17 , 2015 , pages 979  984 , 2015 .
Johan Ugander , Lars Backstrom, and Jon Kleinberg . Subgraph Frequencies: Mapping the Empirical and Extremal Geography of Large Graph Collections . In Proceedings of the 22Nd International Conference on World Wide Web, WWW '13 , pages 1307  1318 , Republic and Canton of Geneva, Switzerland, 2013 . International World Wide Web Conferences Steering Committee. URL: http://dl.acm.org/citation.cfm?id= 2488388 . 2488502 .
Yuichi Yoshida , Masaki Yamamoto, and Hiro Ito . An improved constanttime approximation algorithm for maximum matchings . In Proceedings of the 41st Annual ACM Symposium on Theory of Computing, STOC 2009 , Bethesda , MD , USA, May 31  June 2, 2009 , pages 225  234 , 2009 .