Phase Transitions in Rate Distortion Theory and Deep Learning
Foundations of Computational Mathematics
https://doi.org/10.1007/s10208-021-09546-4
Phase Transitions in Rate Distortion Theory and Deep
Learning
Philipp Grohs1,2 · Andreas Klotz1 · Felix Voigtlaender3,4
Received: 29 September 2020 / Revised: 29 July 2021 / Accepted: 6 September 2021
© The Author(s) 2021
Abstract
Rate distortion theory is concerned with optimally encoding signals from a given signal
class S using a budget of R bits, as R → ∞. We say that S can be compressed at rate
s if we can achieve an error of at most O(R −s ) for encoding the given signal class; the
supremal compression rate is denoted by s ∗ (S). Given a fixed coding scheme, there
usually are some elements of S that are compressed at a higher rate than s ∗ (S) by the
given coding scheme; in this paper, we study the size of this set of signals. We show that
for certain “nice” signal classes S, a phase transition occurs: We construct a probability
measure P on S such that for every coding scheme C and any s > s ∗ (S), the set of
signals encoded with error O(R −s ) by C forms a P-null-set. In particular, our results
apply to all unit balls in Besov and Sobolev spaces that embed compactly into L 2 (Ω)
for a bounded Lipschitz domain Ω. As an application, we show that several existing
sharpness results concerning function approximation using deep neural networks are
Communicated by Francis Bach.
AK acknowledges funding from the FWF projects I 3403 and P 31887–N32.
FV acknowledges support by the German Research Foundation (DFG) in the context of the Emmy
Noether junior research group VO 2594/1–1.
B Felix Voigtlaender
Philipp Grohs
Andreas Klotz
1
Faculty of Mathematics, University of Vienna, Oskar-Morgenstern-Platz 1, 1090 Vienna, Austria
2
Research Platform Data Science, University of Vienna, Vienna, Austria
3
Department of Mathematics, Technical University of Munich, Boltzmannstr. 3, 85748 Garching
bei München, Germany
4
Present Address: Catholic University of Eichstätt–Ingolstadt, Mathematisch–Geographische Fakultät,
Ostenstraße 26, 85072 Eichstätt, Germany
123
Foundations of Computational Mathematics
in fact generically sharp. In addition, we provide quantitative and non-asymptotic
bounds on the probability that a random f ∈ S can be encoded to within accuracy
ε using R bits. This result is subsequently applied to the problem of approximately
representing f ∈ S to within accuracy ε by a (quantized) neural network with at most
W nonzero weights. We show that for any s > s ∗ (S) there are constants c, C such that,
no matter what kind of “learning” procedure is used to produce such a network, the
2
−1/s
probability of success is bounded from above by min 1, 2C·W log2 (1+W ) −c·ε
.
Keywords Rate distortion theory · Phase transition · Approximation rates · Sobolev
spaces · Besov spaces · Neural network approximation
Mathematics Subject Classification 41A46 · 28C20 · 68P30 · 68T07
1 Introduction
Let S be a signal class, that is, a relatively compact subset of a Banach space (X, ·X ).
Rate distortion theory is concerned with the question of how well the elements of S
can be encoded using a prescribed number R of bits. In many cases of interest, the
∗
best achievable coding error scales like R −s , where s ∗ is the optimal compression
rate of the signal class S. We show that a phase transition occurs: the set of elements
x ∈ S that can be encoded using a strictly larger exponent than s ∗ is thin; precisely,
it is a null-set with respect to a suitable probability measure P. Crucially, the measure
P is independent of the chosen coding scheme.
In order to rigorously formulate these results, we first review the needed notions of
rate-distortion theory, see also [3,4,13,15]. For later use, we state the definitions here
in the setting of general Banach spaces, although our main results only focus on the
Hilbert space L 2 (Ω).
1.1 A Crash Course in Rate Distortion Theory
To formalize the notion of encoding a signal class S ⊂ X, we define the set EncSR ,X
of encoding/decoding pairs (E, D) of code-length R ∈ N as
EncSR ,X := (E, D)
:
E : S → {0, 1} R
and
D : {0, 1} R → X .
We are interested in choosing (E, D) ∈ EncSR ,X such as to minimize the (maximal)
distortion δS ,X (E, D) := supx∈S x − D(E(x))X .
The intuition behind these definitions is that the encoder E converts any signal
x ∈ S into a bitstream of code-length R (i.e., consisting of R bits), while the decoder
D produces from a given bitstream b ∈ {0, 1} R a signal D(b) ∈ X. The goal of
rate distortion theory is to determine the minimal distortion that can be achieved
by any encoder/decoder pair of code-length R ∈ N. Typical results concerning the
relation between code-length and distortion are formulated in an asymptotic sense: One
assumes that for every code-length R ∈ N, one is given an encoding/decoding pair
123
Foundations of Computational Mathematics
(E R , D R ) ∈ EncSR ,X and then, studies the asymptotic behavior of the corresponding
distortion δS ,X (E R , D R ) as
R → ∞.
We refer to a sequence (E R , D R ) R∈N of encoding/decoding pairs as a codec, so
that the set of all codecs is
ą
CodecsS ,X :=
EncSR ,X .
R∈N
For a given signal class S in a Banach space X,
interest to find an
it is of great
asymptotically optimal codec; that is, a sequence (E R , D R ) R∈N ∈ CodecsS ,X such
that the asymptotic decay of δS ,X (E R , D R ) R∈N is, in a sense, maximal. To formalize
this, for each s ∈ [0, ∞) define the class of subsets of X that admit compression rate
s as
CompsX := S ⊂ X : ∃ (E R , D R ) R∈N ∈ CodecsS ,X : δS ,X (E R , D R ) R −s .
For a given (bounded) signal class S ⊂ X, we aim to determine the optimal compression rate for S in X, that is
sX∗ (S) := sup s ∈ [0, ∞) : S ∈ CompsX ∈ [0, ∞].
(1.1)
Although the calculation of the quantity sX∗ (S) may appear daunting for a given
signal class S, there exists in fact a large body of literature addressing this topic. A
landmark result in this area states that the JPEG2000 compression standard represents
an optimal codec for the compression of piecewise smooth signals [26]. This optimality
is typically stated more generally for the signal class S = Ball 0, 1; B αp,q (Ω) , the
unit ball in the Besov space B αp,q (Ω), considered as a subset of X = H = L 2 (Ω), for
“sufficiently nice” bounded
domains
Ω ⊂ Rd ; see [10].
For a codec C = (E R , D R ) R∈N ∈ CodecsS ,X , instead of considering the maximal
distortion of C over the entire signal class S, one can also measure the approximation
rate that the codec C achieves for each individual x ∈ S. Precisely, the class of elements
with compression rate s under C is
AsS ,X (C) := x ∈ S :
sup R s · x − D R (E R (x)) X < ∞ .
(1.2)
R∈N
If the signal class S is “sufficiently regular”—for instance if S is compact and convex—
then one can prove (see Proposition 3) that the following dichotomy is valid:
s < sX∗ (S) ⇒ ∃ C ∈ CodecsS ,X ∀ x ∈ S :
s > sX∗ (S) ⇒ ∀ C ∈ CodecsS ,X ∃ x∗ ∈ S :
x (...truncated)