Phase Transitions in Rate Distortion Theory and Deep Learning (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s10208-021-09546-4.pdf

Phase Transitions in Rate Distortion Theory and Deep Learning

Foundations of Computational Mathematics https://doi.org/10.1007/s10208-021-09546-4 Phase Transitions in Rate Distortion Theory and Deep Learning Philipp Grohs1,2 · Andreas Klotz1 · Felix Voigtlaender3,4 Received: 29 September 2020 / Revised: 29 July 2021 / Accepted: 6 September 2021 © The Author(s) 2021 Abstract Rate distortion theory is concerned with optimally encoding signals from a given signal class S using a budget of R bits, as R → ∞. We say that S can be compressed at rate s if we can achieve an error of at most O(R −s ) for encoding the given signal class; the supremal compression rate is denoted by s ∗ (S). Given a fixed coding scheme, there usually are some elements of S that are compressed at a higher rate than s ∗ (S) by the given coding scheme; in this paper, we study the size of this set of signals. We show that for certain “nice” signal classes S, a phase transition occurs: We construct a probability measure P on S such that for every coding scheme C and any s > s ∗ (S), the set of signals encoded with error O(R −s ) by C forms a P-null-set. In particular, our results apply to all unit balls in Besov and Sobolev spaces that embed compactly into L 2 (Ω) for a bounded Lipschitz domain Ω. As an application, we show that several existing sharpness results concerning function approximation using deep neural networks are Communicated by Francis Bach. AK acknowledges funding from the FWF projects I 3403 and P 31887–N32. FV acknowledges support by the German Research Foundation (DFG) in the context of the Emmy Noether junior research group VO 2594/1–1. B Felix Voigtlaender Philipp Grohs Andreas Klotz 1 Faculty of Mathematics, University of Vienna, Oskar-Morgenstern-Platz 1, 1090 Vienna, Austria 2 Research Platform Data Science, University of Vienna, Vienna, Austria 3 Department of Mathematics, Technical University of Munich, Boltzmannstr. 3, 85748 Garching bei München, Germany 4 Present Address: Catholic University of Eichstätt–Ingolstadt, Mathematisch–Geographische Fakultät, Ostenstraße 26, 85072 Eichstätt, Germany 123 Foundations of Computational Mathematics in fact generically sharp. In addition, we provide quantitative and non-asymptotic bounds on the probability that a random f ∈ S can be encoded to within accuracy ε using R bits. This result is subsequently applied to the problem of approximately representing f ∈ S to within accuracy ε by a (quantized) neural network with at most W nonzero weights. We show that for any s > s ∗ (S) there are constants c, C such that, no matter what kind of “learning” procedure is used to produce such a network, the 2 −1/s probability of success is bounded from above by min 1, 2C·W log2 (1+W ) −c·ε . Keywords Rate distortion theory · Phase transition · Approximation rates · Sobolev spaces · Besov spaces · Neural network approximation Mathematics Subject Classification 41A46 · 28C20 · 68P30 · 68T07 1 Introduction Let S be a signal class, that is, a relatively compact subset of a Banach space (X, ·X ). Rate distortion theory is concerned with the question of how well the elements of S can be encoded using a prescribed number R of bits. In many cases of interest, the ∗ best achievable coding error scales like R −s , where s ∗ is the optimal compression rate of the signal class S. We show that a phase transition occurs: the set of elements x ∈ S that can be encoded using a strictly larger exponent than s ∗ is thin; precisely, it is a null-set with respect to a suitable probability measure P. Crucially, the measure P is independent of the chosen coding scheme. In order to rigorously formulate these results, we first review the needed notions of rate-distortion theory, see also [3,4,13,15]. For later use, we state the definitions here in the setting of general Banach spaces, although our main results only focus on the Hilbert space L 2 (Ω). 1.1 A Crash Course in Rate Distortion Theory To formalize the notion of encoding a signal class S ⊂ X, we define the set EncSR ,X of encoding/decoding pairs (E, D) of code-length R ∈ N as EncSR ,X := (E, D) : E : S → {0, 1} R and D : {0, 1} R → X . We are interested in choosing (E, D) ∈ EncSR ,X such as to minimize the (maximal) distortion δS ,X (E, D) := supx∈S x − D(E(x))X . The intuition behind these definitions is that the encoder E converts any signal x ∈ S into a bitstream of code-length R (i.e., consisting of R bits), while the decoder D produces from a given bitstream b ∈ {0, 1} R a signal D(b) ∈ X. The goal of rate distortion theory is to determine the minimal distortion that can be achieved by any encoder/decoder pair of code-length R ∈ N. Typical results concerning the relation between code-length and distortion are formulated in an asymptotic sense: One assumes that for every code-length R ∈ N, one is given an encoding/decoding pair 123 Foundations of Computational Mathematics (E R , D R ) ∈ EncSR ,X and then, studies the asymptotic behavior of the corresponding distortion δS ,X (E R , D R ) as R → ∞. We refer to a sequence (E R , D R ) R∈N of encoding/decoding pairs as a codec, so that the set of all codecs is ą CodecsS ,X := EncSR ,X . R∈N For a given signal class S in a Banach space X, interest to find an it is of great asymptotically optimal codec; that is, a sequence (E R , D R ) R∈N ∈ CodecsS ,X such that the asymptotic decay of δS ,X (E R , D R ) R∈N is, in a sense, maximal. To formalize this, for each s ∈ [0, ∞) define the class of subsets of X that admit compression rate s as CompsX := S ⊂ X : ∃ (E R , D R ) R∈N ∈ CodecsS ,X : δS ,X (E R , D R ) R −s . For a given (bounded) signal class S ⊂ X, we aim to determine the optimal compression rate for S in X, that is sX∗ (S) := sup s ∈ [0, ∞) : S ∈ CompsX ∈ [0, ∞]. (1.1) Although the calculation of the quantity sX∗ (S) may appear daunting for a given signal class S, there exists in fact a large body of literature addressing this topic. A landmark result in this area states that the JPEG2000 compression standard represents an optimal codec for the compression of piecewise smooth signals [26]. This optimality is typically stated more generally for the signal class S = Ball 0, 1; B αp,q (Ω) , the unit ball in the Besov space B αp,q (Ω), considered as a subset of X = H = L 2 (Ω), for “sufficiently nice” bounded domains Ω ⊂ Rd ; see [10]. For a codec C = (E R , D R ) R∈N ∈ CodecsS ,X , instead of considering the maximal distortion of C over the entire signal class S, one can also measure the approximation rate that the codec C achieves for each individual x ∈ S. Precisely, the class of elements with compression rate s under C is AsS ,X (C) := x ∈ S : sup R s · x − D R (E R (x)) X < ∞ . (1.2) R∈N If the signal class S is “sufficiently regular”—for instance if S is compact and convex— then one can prove (see Proposition 3) that the following dichotomy is valid: s < sX∗ (S) ⇒ ∃ C ∈ CodecsS ,X ∀ x ∈ S : s > sX∗ (S) ⇒ ∀ C ∈ CodecsS ,X ∃ x∗ ∈ S : x (...truncated)