Separation of scales and a thermodynamic description of feature learning in some CNNs (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41467-023-36361-y.pdf

Separation of scales and a thermodynamic description of feature learning in some CNNs

Article https://doi.org/10.1038/s41467-023-36361-y Separation of scales and a thermodynamic description of feature learning in some CNNs Received: 23 March 2022 Inbar Seroussi 1 , Gadi Naveh2 & Zohar Ringel2 Accepted: 25 January 2023 1234567890():,; 1234567890():,; Check for updates Deep neural networks (DNNs) are powerful tools for compressing and distilling information. Their scale and complexity, often involving billions of interdependent parameters, render direct microscopic analysis difﬁcult. Under such circumstances, a common strategy is to identify slow variables that average the erratic behavior of the fast microscopic variables. Here, we identify a similar separation of scales occurring in fully trained ﬁnitely overparameterized deep convolutional neural networks (CNNs) and fully connected networks (FCNs). Speciﬁcally, we show that DNN layers couple only through the second cumulant (kernels) of their activations and pre-activations. Moreover, the latter ﬂuctuates in a nearly Gaussian manner. For inﬁnite width DNNs, these kernels are inert, while for ﬁnite ones they adapt to the data and yield a tractable data-aware Gaussian Process. The resulting thermodynamic theory of deep learning yields accurate predictions in various settings. In addition, it provides new ways of analyzing and understanding DNNs in general. Identifying slow or relevant variables is an essential step in analyzing large-scale non-linear systems. In the context of deep neural networks (DNNs), these should be some combinations of the individual weights that are weakly ﬂuctuating and obey a closed set of equations. One potential set of such variables is the DNNs’ outputs themselves. Indeed, in the limit of inﬁnitely over-parameterized DNNs these provide an elegant picture of deep learning1–3 based on a mapping to Gaussian Processes (GPs). However, these GP limits miss out on several qualitative aspects, such as feature learning4,5 and the fact that real-world DNNs are not nearly as overparameterized as required for the GP description to hold1,3,6,7. Obtaining a useful set of slow variables for describing deep learning at ﬁnite over-parameterization is thus an important open problem in the ﬁeld. Several works provide guidelines for this search. Noting that GP limits can have surprisingly good performance8 and that overparameterization is natural to deep learning9,10 we are inclined to keep some elements of the GP picture. One such element is to work in function space and study pre-activation and outputs instead of weights whose posterior distribution becomes complicated even in the GP limit11,12. Another element is the layer-wise composition of hidden layer kernels13 which yields the output kernel of the GP14. Such a layer-wise picture is also harmonious with the idea that DNN layers should not correlate strongly, to prevent co-adaptation15. Recently, it was shown that in some limited settings, making the GP kernel “dynamical” or ﬂexible, so that it adapts to the dataset, can account for differences between inﬁnite and ﬁnite DNNs3,16–19. Still, the task of ﬁnding an explicit set of equations describing this ﬂexibility in deep non-linear DNNs remains unsolved. Speciﬁcally, while in the GP limit we ﬁnd tractable algebraic expressions, involving only basic matrix manipulations, for the DNN’s prediction in the feature learning regime similar expressions only exist for deep linear16,20 or non-linear networks with one trainable layer21–23. In a related manner, while the DNN’s outputs provide a complete set of slow variables in the GP limit, in the ﬁnitewidth feature learning regime it is not clear which subset of variables governs the trained DNN’s behavior other than the entire set of weights. In this work, we identify such slow variables and use these to derive an effective theory for deep learning capable of capturing various ﬁnite channel/width (C/N) effects (such as feature learning) in convolutional neural networks (CNNs) and fully connected neural networks (FCNs). We argue that: 1 Weizmann Institute of Science, Department of Mathematics, Rehovot 7610001, Israel. 2Hebrew University, Racah Institute of Physics, Jerusalem 9190401, e-mail: Israel. Nature Communications | (2023)14:908 1 Article https://doi.org/10.1038/s41467-023-36361-y , ( ) () ∼ (0, () ) ( ) () ∼ ( (0, () ) () () () ( ) ( ) ( ) ( ) ∼ (0, ( ) ) GP limit Width → ∞ ( ) ∼ (0, ) ( ) ) Feature learning Width < ∞ Fig. 1 | Feature learning regime versus Gaussian process inﬁnite limit. Learning as described by our effective theory for antisymetric activation functions. Left: for inﬁnite width the pre-activations (h(l)), and the output (f) ﬂuctuate according to a Gaussian distribution with ﬁxed post-kernels QðlÞ 1 , Qf,∞, respectively. The complete set of slow variables are the outputs with ﬁxed kernels. Right: for large but ﬁnite width and number of samples, we obtain an approximately Gaussian distribution for the pre-activations and outputs with learned pre-kernels K(l) and Kf = Qf + σ2In, respectively. The outputs follow a Gaussian Process Regression (GPR) with kernel Qf. The complete set of slow variables here are both the outputs and the pre and post-kernels. For general activation, an additional slow variable, corresponding to the mean of the preactivation, needs to be tracked. 1. For, C, N ≫ 1 the erratic behavior of speciﬁc channels/neurons averages out and hidden layers coupled to each other only through two “slow” variables per layer: the second cumulant of the pre-activations (pre-kernel), K(l), and the second cumulant of activations (post-kernel), Q(l), of the lth layer. Furthermore, for mean square error (MSE) loss, FCNs in the so-called mean-ﬁeld (MF) scaling (where the last layer weights are scaled down) or CNNs with a large read-out layer fan-in behave effectively as a GP with a data-aware kernel determined by the second cumulant of pre-activations in the penultimate layer21.For comparison between the classical GP limit and the GP process we ﬁnd, see Fig. 1. 2. In settings where the kernels have a large density of dominant eigenvalues, the posterior (or trained) pre-activations ﬂuctuate in a nearly Gaussian manner. Following this, we use a multivariate Gaussian variational approximation for the posterior preactivations and derive explicit matrix equations (Equations of State) for the covariance matrices governing these pre-activations and the DNNs predictions. 3. We identify an emergent feature learning scale (FLS) denoted by χ, proportional to the train MSE times n2 over C (or N). This scale controls the difference between the ﬁnite C, N output kernel (Qf) and its C, N → ∞ limit and in this sense reﬂects feature learning. Due to the n2 factor, χ can be O(1) or larger even for C ≫ 1, e.g., for CNN architectures (see Fig. 2 panel c). The same holds, with C replaced by N, for FCNs in the MF scaling21. Unlike perturbation theory3,6,23,24, our theory tracks all orders of (...truncated)