Separation of scales and a thermodynamic description of feature learning in some CNNs

Nature Communications, Feb 2023

Deep neural networks (DNNs) are powerful tools for compressing and distilling information. Their scale and complexity, often involving billions of inter-dependent parameters, render direct microscopic analysis difficult. Under such circumstances, a common strategy is to identify slow variables that average the erratic behavior of the fast microscopic variables. Here, we identify a similar separation of scales occurring in fully trained finitely over-parameterized deep convolutional neural networks (CNNs) and fully connected networks (FCNs). Specifically, we show that DNN layers couple only through the second cumulant (kernels) of their activations and pre-activations. Moreover, the latter fluctuates in a nearly Gaussian manner. For infinite width DNNs, these kernels are inert, while for finite ones they adapt to the data and yield a tractable data-aware Gaussian Process. The resulting thermodynamic theory of deep learning yields accurate predictions in various settings. In addition, it provides new ways of analyzing and understanding DNNs in general.

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41467-023-36361-y.pdf

Separation of scales and a thermodynamic description of feature learning in some CNNs

Article https://doi.org/10.1038/s41467-023-36361-y Separation of scales and a thermodynamic description of feature learning in some CNNs Received: 23 March 2022 Inbar Seroussi 1 , Gadi Naveh2 & Zohar Ringel2 Accepted: 25 January 2023 1234567890():,; 1234567890():,; Check for updates Deep neural networks (DNNs) are powerful tools for compressing and distilling information. Their scale and complexity, often involving billions of interdependent parameters, render direct microscopic analysis difficult. Under such circumstances, a common strategy is to identify slow variables that average the erratic behavior of the fast microscopic variables. Here, we identify a similar separation of scales occurring in fully trained finitely overparameterized deep convolutional neural networks (CNNs) and fully connected networks (FCNs). Specifically, we show that DNN layers couple only through the second cumulant (kernels) of their activations and pre-activations. Moreover, the latter fluctuates in a nearly Gaussian manner. For infinite width DNNs, these kernels are inert, while for finite ones they adapt to the data and yield a tractable data-aware Gaussian Process. The resulting thermodynamic theory of deep learning yields accurate predictions in various settings. In addition, it provides new ways of analyzing and understanding DNNs in general. Identifying slow or relevant variables is an essential step in analyzing large-scale non-linear systems. In the context of deep neural networks (DNNs), these should be some combinations of the individual weights that are weakly fluctuating and obey a closed set of equations. One potential set of such variables is the DNNs’ outputs themselves. Indeed, in the limit of infinitely over-parameterized DNNs these provide an elegant picture of deep learning1–3 based on a mapping to Gaussian Processes (GPs). However, these GP limits miss out on several qualitative aspects, such as feature learning4,5 and the fact that real-world DNNs are not nearly as overparameterized as required for the GP description to hold1,3,6,7. Obtaining a useful set of slow variables for describing deep learning at finite over-parameterization is thus an important open problem in the field. Several works provide guidelines for this search. Noting that GP limits can have surprisingly good performance8 and that overparameterization is natural to deep learning9,10 we are inclined to keep some elements of the GP picture. One such element is to work in function space and study pre-activation and outputs instead of weights whose posterior distribution becomes complicated even in the GP limit11,12. Another element is the layer-wise composition of hidden layer kernels13 which yields the output kernel of the GP14. Such a layer-wise picture is also harmonious with the idea that DNN layers should not correlate strongly, to prevent co-adaptation15. Recently, it was shown that in some limited settings, making the GP kernel “dynamical” or flexible, so that it adapts to the dataset, can account for differences between infinite and finite DNNs3,16–19. Still, the task of finding an explicit set of equations describing this flexibility in deep non-linear DNNs remains unsolved. Specifically, while in the GP limit we find tractable algebraic expressions, involving only basic matrix manipulations, for the DNN’s prediction in the feature learning regime similar expressions only exist for deep linear16,20 or non-linear networks with one trainable layer21–23. In a related manner, while the DNN’s outputs provide a complete set of slow variables in the GP limit, in the finitewidth feature learning regime it is not clear which subset of variables governs the trained DNN’s behavior other than the entire set of weights. In this work, we identify such slow variables and use these to derive an effective theory for deep learning capable of capturing various finite channel/width (C/N) effects (such as feature learning) in convolutional neural networks (CNNs) and fully connected neural networks (FCNs). We argue that: 1 Weizmann Institute of Science, Department of Mathematics, Rehovot 7610001, Israel. 2Hebrew University, Racah Institute of Physics, Jerusalem 9190401, e-mail: Israel. Nature Communications | (2023)14:908 1 Article https://doi.org/10.1038/s41467-023-36361-y , ( ) () ∼ (0, () ) ( ) () ∼ ( (0, () ) () () () ( ) ( ) ( ) ( ) ∼ (0, ( ) ) GP limit Width → ∞ ( ) ∼ (0, ) ( ) ) Feature learning Width < ∞ Fig. 1 | Feature learning regime versus Gaussian process infinite limit. Learning as described by our effective theory for antisymetric activation functions. Left: for infinite width the pre-activations (h(l)), and the output (f) fluctuate according to a Gaussian distribution with fixed post-kernels QðlÞ 1 , Qf,∞, respectively. The complete set of slow variables are the outputs with fixed kernels. Right: for large but finite width and number of samples, we obtain an approximately Gaussian distribution for the pre-activations and outputs with learned pre-kernels K(l) and Kf = Qf + σ2In, respectively. The outputs follow a Gaussian Process Regression (GPR) with kernel Qf. The complete set of slow variables here are both the outputs and the pre and post-kernels. For general activation, an additional slow variable, corresponding to the mean of the preactivation, needs to be tracked. 1. For, C, N ≫ 1 the erratic behavior of specific channels/neurons averages out and hidden layers coupled to each other only through two “slow” variables per layer: the second cumulant of the pre-activations (pre-kernel), K(l), and the second cumulant of activations (post-kernel), Q(l), of the lth layer. Furthermore, for mean square error (MSE) loss, FCNs in the so-called mean-field (MF) scaling (where the last layer weights are scaled down) or CNNs with a large read-out layer fan-in behave effectively as a GP with a data-aware kernel determined by the second cumulant of pre-activations in the penultimate layer21.For comparison between the classical GP limit and the GP process we find, see Fig. 1. 2. In settings where the kernels have a large density of dominant eigenvalues, the posterior (or trained) pre-activations fluctuate in a nearly Gaussian manner. Following this, we use a multivariate Gaussian variational approximation for the posterior preactivations and derive explicit matrix equations (Equations of State) for the covariance matrices governing these pre-activations and the DNNs predictions. 3. We identify an emergent feature learning scale (FLS) denoted by χ, proportional to the train MSE times n2 over C (or N). This scale controls the difference between the finite C, N output kernel (Qf) and its C, N → ∞ limit and in this sense reflects feature learning. Due to the n2 factor, χ can be O(1) or larger even for C ≫ 1, e.g., for CNN architectures (see Fig. 2 panel c). The same holds, with C replaced by N, for FCNs in the MF scaling21. Unlike perturbation theory3,6,23,24, our theory tracks all orders of (...truncated)


This is a preview of a remote PDF: https://www.nature.com/articles/s41467-023-36361-y.pdf
Article home page: https://www.nature.com/articles/s41467-023-36361-y

Seroussi, Inbar, Naveh, Gadi, Ringel, Zohar. Separation of scales and a thermodynamic description of feature learning in some CNNs, Nature Communications, DOI: 10.1038/s41467-023-36361-y