Separation of scales and a thermodynamic description of feature learning in some CNNs
Article
https://doi.org/10.1038/s41467-023-36361-y
Separation of scales and a thermodynamic
description of feature learning in some CNNs
Received: 23 March 2022
Inbar Seroussi
1
, Gadi Naveh2 & Zohar Ringel2
Accepted: 25 January 2023
1234567890():,;
1234567890():,;
Check for updates
Deep neural networks (DNNs) are powerful tools for compressing and distilling information. Their scale and complexity, often involving billions of interdependent parameters, render direct microscopic analysis difficult. Under
such circumstances, a common strategy is to identify slow variables that
average the erratic behavior of the fast microscopic variables. Here, we identify
a similar separation of scales occurring in fully trained finitely overparameterized deep convolutional neural networks (CNNs) and fully connected networks (FCNs). Specifically, we show that DNN layers couple only
through the second cumulant (kernels) of their activations and pre-activations.
Moreover, the latter fluctuates in a nearly Gaussian manner. For infinite width
DNNs, these kernels are inert, while for finite ones they adapt to the data and
yield a tractable data-aware Gaussian Process. The resulting thermodynamic
theory of deep learning yields accurate predictions in various settings. In
addition, it provides new ways of analyzing and understanding DNNs in
general.
Identifying slow or relevant variables is an essential step in analyzing
large-scale non-linear systems. In the context of deep neural networks
(DNNs), these should be some combinations of the individual weights
that are weakly fluctuating and obey a closed set of equations. One
potential set of such variables is the DNNs’ outputs themselves.
Indeed, in the limit of infinitely over-parameterized DNNs
these provide an elegant picture of deep learning1–3 based on a
mapping to Gaussian Processes (GPs). However, these GP limits
miss out on several qualitative aspects, such as feature learning4,5
and the fact that real-world DNNs are not nearly as overparameterized as required for the GP description to hold1,3,6,7.
Obtaining a useful set of slow variables for describing deep
learning at finite over-parameterization is thus an important open
problem in the field.
Several works provide guidelines for this search. Noting that GP
limits can have surprisingly good performance8 and that overparameterization is natural to deep learning9,10 we are inclined to
keep some elements of the GP picture. One such element is to work in
function space and study pre-activation and outputs instead of weights
whose posterior distribution becomes complicated even in the GP
limit11,12. Another element is the layer-wise composition of hidden layer
kernels13 which yields the output kernel of the GP14. Such a layer-wise
picture is also harmonious with the idea that DNN layers should not
correlate strongly, to prevent co-adaptation15. Recently, it was shown
that in some limited settings, making the GP kernel “dynamical” or
flexible, so that it adapts to the dataset, can account for differences
between infinite and finite DNNs3,16–19. Still, the task of finding an
explicit set of equations describing this flexibility in deep non-linear
DNNs remains unsolved. Specifically, while in the GP limit we find
tractable algebraic expressions, involving only basic matrix manipulations, for the DNN’s prediction in the feature learning regime similar
expressions only exist for deep linear16,20 or non-linear networks with
one trainable layer21–23. In a related manner, while the DNN’s outputs
provide a complete set of slow variables in the GP limit, in the finitewidth feature learning regime it is not clear which subset of variables
governs the trained DNN’s behavior other than the entire set of
weights.
In this work, we identify such slow variables and use these to
derive an effective theory for deep learning capable of capturing various finite channel/width (C/N) effects (such as feature learning) in
convolutional neural networks (CNNs) and fully connected neural
networks (FCNs). We argue that:
1
Weizmann Institute of Science, Department of Mathematics, Rehovot 7610001, Israel. 2Hebrew University, Racah Institute of Physics, Jerusalem 9190401,
e-mail:
Israel.
Nature Communications | (2023)14:908
1
Article
https://doi.org/10.1038/s41467-023-36361-y
,
(
)
()
∼
(0,
()
)
(
)
()
∼
(
(0,
()
)
()
()
()
( )
( )
( )
( )
∼
(0,
( )
)
GP limit
Width → ∞
( )
∼
(0,
)
( )
)
Feature learning
Width < ∞
Fig. 1 | Feature learning regime versus Gaussian process infinite limit. Learning
as described by our effective theory for antisymetric activation functions. Left: for
infinite width the pre-activations (h(l)), and the output (f) fluctuate according to a
Gaussian distribution with fixed post-kernels QðlÞ
1 , Qf,∞, respectively. The complete
set of slow variables are the outputs with fixed kernels. Right: for large but finite
width and number of samples, we obtain an approximately Gaussian distribution
for the pre-activations and outputs with learned pre-kernels K(l) and Kf = Qf + σ2In,
respectively. The outputs follow a Gaussian Process Regression (GPR) with kernel
Qf. The complete set of slow variables here are both the outputs and the pre and
post-kernels. For general activation, an additional slow variable, corresponding to
the mean of the preactivation, needs to be tracked.
1. For, C, N ≫ 1 the erratic behavior of specific channels/neurons
averages out and hidden layers coupled to each other only
through two “slow” variables per layer: the second cumulant of
the pre-activations (pre-kernel), K(l), and the second cumulant of
activations (post-kernel), Q(l), of the lth layer. Furthermore, for
mean square error (MSE) loss, FCNs in the so-called mean-field
(MF) scaling (where the last layer weights are scaled down) or
CNNs with a large read-out layer fan-in behave effectively as a GP
with a data-aware kernel determined by the second cumulant of
pre-activations in the penultimate layer21.For comparison between
the classical GP limit and the GP process we find, see Fig. 1.
2. In settings where the kernels have a large density of dominant
eigenvalues, the posterior (or trained) pre-activations fluctuate in
a nearly Gaussian manner. Following this, we use a multivariate
Gaussian variational approximation for the posterior preactivations and derive explicit matrix equations (Equations of
State) for the covariance matrices governing these pre-activations
and the DNNs predictions.
3. We identify an emergent feature learning scale (FLS) denoted by χ,
proportional to the train MSE times n2 over C (or N). This scale
controls the difference between the finite C, N output kernel (Qf)
and its C, N → ∞ limit and in this sense reflects feature learning.
Due to the n2 factor, χ can be O(1) or larger even for C ≫ 1, e.g., for
CNN architectures (see Fig. 2 panel c). The same holds, with C
replaced by N, for FCNs in the MF scaling21. Unlike perturbation
theory3,6,23,24, our theory tracks all orders of (...truncated)