fromage
🧀 Pytorch code for the Fromage optimiser.
view repo
How far apart are two neural networks? This is a foundational question in their theory. We derive a simple and tractable bound that relates distance in function space to distance in parameter space for a broad class of nonlinear compositional functions. The bound distills a clear dependence on depth of the composition. The theory is of practical relevance since it establishes a trust region for first-order optimisation. In turn, this suggests an optimiser that we call Frobenius matched gradient descent—or Fromage. Fromage involves a principled form of gradient rescaling and enjoys guarantees on stability of both the spectra and Frobenius norms of the weights. We find that the new algorithm increases the depth at which a multilayer perceptron may be trained as compared to Adam and SGD and is competitive with Adam for training generative adversarial networks. We further verify that Fromage scales up to a language transformer with over 10^8 parameters. Please find code reproducibility instructions at: https://github.com/jxbz/fromage.
READ FULL TEXT VIEW PDF
Most popular optimizers for deep learning can be broadly categorized as
...
read it
This work is a part of ICLR Reproducibility Challenge 2019, we try to
re...
read it
In this note, we study the dynamics of gradient descent on objective
fun...
read it
This is a report for reproducibility challenge of NeurlIPS 2019 on the p...
read it
We empirically demonstrate that full-batch gradient descent on neural ne...
read it
In this work we explore the limiting dynamics of deep neural networks tr...
read it
🧀 Pytorch code for the Fromage optimiser.
Suppose that a teacher wishes to assess a student’s learning. Traditionally, they will assign the student homework and track their progress. What if, instead, they could peer inside the student’s head and observe change directly in the synapses—would that not be better for everyone?
Neural networks are usually trained by (stochastic) gradient descent. The basic premise is that gradient descent solves:
That is, gradient descent chooses the parameter perturbation to minimise a local linear approximation to the objective function , where we add the penalty to prevent from straying beyond the region where the gradient is trusted (Nocedal and Wright, 2006). For gradient descent, the penalty takes the form:
We refer to this model as Euclidean trust since a quadratic penalty is akin to assuming a Euclidean structure on the parameter space. We perform a theoretical analysis and experimental study to test this model and find evidence that for multilayer perceptrons, trust is lost not quadratically but rather quasi-exponentially in the perturbation size. Figure 1 illustrates the difference.
Our analysis exposes the following mathematical structure for the trust region of a broad family of deep neural networks with layers indexed :
Deep relative trust has two essential features: the first is a dependence on the relative magnitude of perturbations; the second is a product over the network’s layers, reflecting the product structure of the network itself. These features are both absent from Euclidean trust. In our model, relative perturbations across layers compound.
The main contributions of this paper are:
proposing that deep relative trust is an appropriate notion of distance between neural networks based on both theoretical analysis and experimental evidence.
developing an optimisation theory based on deep relative trust, and using the tools of matrix perturbation theory to study the stability of learning.
deriving a neural network optimiser called Fromage (Algorithm 1) that exploits the new theory. The algorithm has
one hyperparameter
with a clear meaning.benchmarking Fromage on popular machine learning problems such as image classification, generative adversarial networks and natural language transformers, revealing often favourable performance compared to standard optimisers such as Adam and SGD.
…so we understand each other.
The goal of this section is to review a few basics of deep learning, including heuristics commonly used in algorithm design and areas where current optimisation theory falls short. We shall also review generative adversarial learning.
We shall see that, whilst it is central to both optimisation and generative adversarial learning, finding an appropriate notion of functional distance for deep networks is not a solved problem.
Deep learning seeks to fit a neural network function with parameters to a dataset of input-output pairs . If we let measure the discrepancy between prediction and target , then learning proceeds by gradient descent on the loss: .
Though various neural network architectures exist, we shall focus our theoretical effort on the multilayer perceptron, which already contains the most striking features of general neural networks: matrices, nonlinearities, and layers.
A multilayer perceptron is a function composed of layers.
The th layer is a linear map followed by a nonlinearity that is applied elementwise.
The multilayer perceptron may be described recursively in terms of the th hidden layer as:
Since we wish to fit the network via gradient descent, we shall be interested in the gradient of the loss with respect to the
th parameter matrix. Schematically, via the chain rule:
(1) |
Let us zoom in on the second term on the righthand side, following the treatment of Pennington et al. (2017).
Consider a multilayer perceptron with layers. For , the layer--to-output Jacobian is given by:
where .
A key observation is that the network function and Jacobian share a common mathematical structure—a deep, layered composition. We shall exploit this in our theory.
For the th layer, gradient descent prescribes the update:
(2) |
where is a small perturbation parameter or learning rate chosen independent of layer.
Practitioners quickly run into a problem with this formulation known as the
vanishing and exploding gradient problem
, where the scale of updates becomes miscalibrated with the scale of parameters in different layers of the network. Common tricks to ameliorate the problem include careful choice of weight initialisation (Glorot and Bengio, 2010), dividing out the gradient scale (Kingma and Ba, 2015)(Pascanu et al., 2013). Each of the techniques has been adopted in numerous deep learning applications.Still, there is a cost to using heuristic techniques. For instance, techniques that rely on careful initialisation may break down by the end of training, leading to instabilities that are difficult to trace. Gradient clipping involves introducing and tuning a new parameter: the clipping threshold.
Euclidean trust, as set up in the introduction, is commonly justified by assuming that the loss function has Lipschitz continuous gradients, meaning that:
By a standard argument (Bottou et al., 2016), this implies a quadratic or Euclidean upper bound on the loss function:
Gradient descent as in iteratively minimises this bound.
The gradient-Lipschitz assumption is ubiquitous to the point that it is often just referred to as smoothness (Hardt et al., 2016). The assumption is a natural starting point for theory and it is used by: Hardt et al. (2016), Lee et al. (2016), Du et al. (2017) and Allen-Zhu (2018) in the context of deep learning optimisation; Bernstein et al. (2018) in the context of distributed training; Schaefer and Anandkumar (2019) in the context of generative adversarial networks.
The Lipschitz assumption played a central role in classical optimisation (Nesterov, 2014, Chapter 1). However, it is unclear the how applicable the assumption is to deep learning—in a comprehensive review on deep learning optimization, Sun (2019) writes that “neural network optimization problems do not have a global gradient Lipschitz constant” and that “the lack of global Lipschitz constants is a general challenge for non-linear optimization”.
The surest way to see that neural networks are not gradient-Lipschitz for all practical purposes is to measure the gradient empirically. We do this for a 16 layer multilayer perceptron, and find that the gradient grows roughly exponentially in the size of a perturbation (Figure 2). For more work of that ilk, Benjamin et al. (2019) empirically test the use of Euclidean distance as a proxy for functional distance, and find the relationship non-trivial and difficult to interpret.
Several classical optimisation frameworks study non-Euclidean models of functional distance. For example, mirror descent (Nemirovsky and Yudin, 1983) replaces by a Bregman divergence appropriate to the geometry of the problem. This framework was studied in relation to deep learning (Azizan and Hassibi, 2019; Azizan et al., 2019), but the design of good divergence measures remains an area of active research.
Another classical technique is natural gradient descent (Amari, 2016), which replaces by . The Riemannian metric should capture the geometry of the -dimensional function class. Unfortunately, this technique is computationally heavy since just writing down the metric takes space, and for neural networks . Whilst Martens and Grosse (2015) explore more efficient surrogates, natural gradient descent is fundamentally a quadratic model of trust. Our results suggest that trust is lost far more catastrophically in deep networks (Figure 1).
A final line of related work studies the effect of architectural decisions on signal propagation through the network (Saxe et al., 2014; Pennington et al., 2017; Yang and Schoenholz, 2017; Xiao et al., 2018; Anil et al., 2019), which inspired aspects of our work. Though these works neglect theoretical study of functional distance and curvature of the loss surface, they do carry out direct analyses of the deep neural network structure. Pennington and Bahri (2017)
, on the other hand, do study curvature of the loss surface, though they rely on random matrix models to make progress.
Neural networks can learn to generate samples from complex distributions. Generative adversarial learning (Goodfellow et al., 2014) trains a discriminator network
to classify data as real or fake, and a generator network
is trained to fool . Competition drives learning in both networks. Letting denote the success rate of the discriminator, the learning process is described as:Defining the optimal discriminator for a given generator as Then generative adversarial learning reduces to a straightforward minimisation over the parameters of the generator:
In practice this is solved as an inner-loop, outer-loop optimisation procedure where steps of gradient descent are performed on the discriminator, followed by step on the generator. For example, Miyato et al. (2018) take and Brock et al. (2019) take .
For small , this procedure is only well founded if the perturbation to the generator is small so as to induce a small perturbation in the optimal discriminator. In symbols, we hope that
But what does mean? In what sense should it be small? Again, we realise that we are lacking an appropriate notion of functional distance for neural networks.
We would like to establish a meaningful notion of functional distance for neural networks. The main pitfall of the Euclidean distance on parameters is that it does not reflect the product structure of the network.
To guide intuition, consider a simple network that multiplies its input by two scalars . That is . Also consider perturbed function where and . By expanding the square and bounding the cross-terms with Young’s inequality, we find that the relative difference obeys:
We flesh out this important derivation in the appendix. The following theorem, also proved in the appendix, generalises this argument to the deep, nonlinear case.
Let be a multilayer perceptron with nonlinearity and weight matrices . Likewise consider perturbed network with weight matrices . For convenience, we define perturbation matrices .
Let the dimension of the th hidden layer be , meaning that . We define the maximum width .
Suppose that the following conditions hold:
Fixed point. The nonlinearity satisfies .
Transmission. There exist such that :
Conditioning. Each of the unperturbed weight matrices has condition number bounded by .
For all non-zero inputs we have:
where we have defined .
In words, Theorem 1 says that the change of a multilayer perceptron in function space is controlled by deep relative trust (Definition 2). As deep relative trust goes to zero, the relative change in function space goes to zero too.
Bounding the relative change in function in terms of the relative change in parameters is reminiscent of a concept from numerical analysis known as the relative condition number. The relative condition number of a numerical technique measures the sensitivity of the technique to input perturbations. This suggests that we may think of Theorem 1 as defining the relative condition number of a neural network with respect to parameter perturbations.
We must discuss the plausibility of the assumptions. The first two conditions are on the nonlinearity and are both satisfied by the “leaky relu” function, where for
:Setting yields the “relu” function, which only satisfies the second condition with for which the bound diverges. We may suspect that for inputs that occur in practice, the second assumption may hold for relu with an . We leave detailed investigation for future work.
As for the third condition, in general may be infinite—rendering the bound vacuous. However, we know by smoothed analysis of the condition number (Sankar et al., 2006; Bürgisser and Cucker, 2010) that
is finite with probability
for an iid Gaussian initialisation, and continues to be so throughout training provided a small amount of iid Gaussian noise is added to the updates.In the last section we studied the relative functional difference between two neural networks and found that it depends on deep relative trust. Here we will focus on the relative difference in gradient, so that we may establish a trust region for optimisation. We shall see that the relative functional difference and relative gradient difference are connected.
We are interested in the relative change in the gradient expression (1). Tackling the product of the three terms on the right-hand side directly is challenging, not least because the loss function is unknown and arbitrary. As a result, we will tackle each term individually.
We will argue that both the first term and the third term depend on the output of a hidden layer, and since a hidden layer is itself the last layer of a sub-network, these terms are connected to deep relative trust via Theorem 1.
To realise this argument, observe that the first term depends on the network output . For example, for the squared error loss we have and . Similarly, the third term depends on the output of layer . To see this, note that and therefore schematically we have that .
The final term to tackle is the middle term in (1): . This is the layer--to-output Jacobian. As detailed in Proposition 1, it is a product of matrices. We proffer the following theorem to bound its relative change:
Let be a multilayer perceptron with nonlinearity and weight matrices . Likewise consider perturbed network with weight matrices . For convenience, we define perturbation matrices .
Let the dimension of the th hidden layer be , so that . We define the maximum width .
Suppose that the following conditions hold:
Transmission. There exist such that :
Conditioning. Each of the unperturbed weight matrices has condition number bounded by .
Then we have that:
where we have defined constants:
Notice that the assumptions are a subset of those made in Theorem 1. The proof is given in the appendix.
Up until this point in the paper, we have introduced the concept of deep relative trust and shown theoretically how it connects to both the relative functional difference and relative gradient difference for a broad class of neural networks. What significance does this have for optimisation?
The most striking prediction of the theory is that for large depth , a neural network diverges quasi-exponentially in the relative size of the parameter perturbation. To see this, we compare deep relative trust to the product form of :
We visualise this prediction in Figure 1. We test it by comparing the loss and gradient along parameter slices for a 2-layer and 16-layer multilayer perceptron. The results are given in Figure 2 and seem to support the idea of a catastrophic breakdown in trust.
The time has come to derive algorithms. We wish to solve:
(3) |
Solving (3) exactly is challenging because of the coupling across layers. Whilst one can imagine various approximation schemes such as a mean-field theory in depth, a solution via perturbation series or even a numerical solution, we prefer to keep matters simple in this work.
We introduce a surrogate to deep relative trust to decouple the effect of perturbations across layers for tractability.
To understand the use of this surrogate, observe first that it depends on the relative size of the perturbations, second it is a polynomial of the same order as deep relative trust, and third for large perturbations of constant relative size across layers, the two concepts of trust are the same. To see this, consider perturbations of relative size , meaning that for all layers . Then as :
We compare deep relative trust and its surrogate in Figure 3. The comparison is for a 20 layer network assuming a fixed perturbation size across layers.
Then let us replace (3) by its surrogate. We define and obtain the following optimisation problem:
Notice that the optimisation problem conveniently decouples over layers. For each layer , we have:
For the th layer, it is clear that the minimiser is of the form for some , since the gradient is the only direction in the problem, and would be inappropriate. We substitute in and minimise over to obtain:
A natural way to obtain a depth-independent algorithm is to let the depth . We adopt the scaling so that is kept in the limit. We arrive at:
(4) |
We see that our theoretical arguments have recovered a special form of “gradient clipping”. You et al. (2017) proposed a similar update rule based on empirical observations. Unfortunately, there is still an issue with this update rule, in that the update tends to increase weight norms. To see this, consider an update that is orthogonal to the matrix . Then, by (4), the norm of the updated weights is given by:
This is just Pythagoras’ theorem, as visualised in the inset figure. We see that the Frobenius norm of the parameters tends to grow by a factor .
This effect can be serious when the model class is invariant to the parameter scale as is the case for common weight normalisation schemes (Ioffe and Szegedy, 2015; Miyato et al., 2018). Under these schemes, the loss function provides no incentive to control the parameter scale and the norm will grow without bound.
One of the attractive features of Algorithm 1 is that there is only one hyperparameter and its meaning is obvious. Neglecting the second order correction, we have that for every layer , the algorithm’s update satisfies:
(5) |
In words: the algorithm induces a relative change of in each layer of the neural network. If we set , then the weight matrices are allowed to change by per iteration. In practice, we find this value to be a good default.
The contrast to SGD and Adam is stark. For these algorithms, the learning rate has little intrinsic meaning, and the effective perturbation strength depends on a complicated interplay between four factors: initial weight scale, weight decay hyperparameter, weight growth during training and the user-prescribed learning rate hyperparameter.
We may say more about Fromage by appealing to Mirsky’s theorem—a basic result in matrix perturbation theory.
Let and be two matrices in . Let and
respectively denote their ordered singular values. Then we have that
We apply this result to the th network layer. Let denote the singular values of and denote the singular values of . Then dividing Theorem 3 through by the root mean square singular value , we obtain:
where we have substituted in (5). In words: the learning rate controls a relative notion of spectral shift.
Spectral instabilities were found by Brock et al. (2019) in the context of large-scale generative adversarial network training with the Adam algorithm. Fromage’s natural ability to control spectral shift therefore seems desirable.
Detailed instructions to reproduce these experiments are here: https://github.com/jxbz/fromage.
To test the main prediction of our theory—that the function and gradient of a deep network break down quasi-exponentially in the size of the perturbation—we directly study the behaviour of a multilayer perceptron trained on the MNIST dataset (Lecun et al., 1998) under parameter perturbations. Perturbing along the gradient direction, we find that the change in gradient and objective function is indeed quasi-exponential for a deep network (see Figure 2).
The theory also predicts that the geometry of trust for a deep network becomes increasingly pathological as the network gets deeper, and Fromage is specifically designed to account for this. In Figure 4, we find that Adam and SGD are unable to train multilayer perceptrons over 25 layers deep whereas Fromage was able to train up to at least depth 50.
To test the predictions about the Frobenius norm stability of Fromage, we train a class-conditional generative adversarial network (Miyato et al., 2018) on the CIFAR-10 dataset (Krizhevsky, 2009). We find (Figure 5) that Fromage almost perfectly stabilises the Frobenius norms, whereas when training with Adam the norms wander significantly.
Next, we benchmark Fromage on three canonical deep learning tasks: generative adversarial image generation, image classification and natural language processing.
We find that Fromage outperforms Adam for training a class-conditional generative adversarial network on the CIFAR-10 dataset. The results are given in Figure 5. Next, when training a resnet50 network to classify the Imagenet dataset (Deng et al., 2009), Fromage outperforms SGD without weight decay and matches SGD with weight decay (Figure 6), meaning that Fromage requires less tuning in this setting. Finally, when fine-tuning a transformer on SQuAD1.0 (Rajpurkar et al., 2016), Fromage marginally outperforms Adam and SGD in evaluation score (Figure 7).
It is common practice in deep learning to randomly subsample data to evaluate the gradient. Our theory is limited in that it neglects this stochasticity entirely. In one of our experiments (Figure 8) we witnessed an instability in Fromage at small batch size. Whilst we found that introducing a form of momentum fixed the problem, future work could investigate the theory of stochastic Fromage more thoroughly.
Our theory is also limited in that it only applies to the multilayer perceptron—the model organism
of deep learning theory. Neural networks found in the wild depart from this basic structure in several key ways. Residual connections
(He et al., 2016) and batch normalisation (Ioffe and Szegedy, 2015) have been found to stabilise deep network training in numerous applications. Using our tools to analyse these techniques could be a fruitful direction in which to head.We have written down a distance on deep neural networks and studied the implications of this distance for optimisation. We are optimistic that deep relative trust may also help in studying convergence and generalisation in deep learning.
The authors would like to thank Dillon Huff, Jeffrey Pennington and Florian Schaefer for useful conversations. They made heavy use of a codebase built by Jiahui Yu. They are much obliged to Sivakumar Arayandi Thottakara, Jan Kautz, Sabu Nadarajan and Nithya Natesan for infrastructure support. JB is supported by an NVIDIA fellowship.
International Conference on Artificial Intelligence and Statistics
, Cited by: §2.On the difficulty of training recurrent neural networks
. In International Conference on Machine Learning, Cited by: §2.Dynamical isometry and a mean field theory of CNNs: how to train 10,000-layer vanilla convolutional neural networks
. In International Conference on Machine Learning, Cited by: §2.We begin by fleshing out the analysis of the two-layer scalar network, since this example already goes a long way to exposing the relevant mathematical structure.
Consider defined by for . Also consider perturbed function where and . The relative difference obeys:
We already see the presence of strong interactions between the two layers. But let us simplify the expression by using Young’s inequality on the cross-terms. We obtain:
Our two main theorems generalise this argument to far more involved cases. See 1
To aid in the proof of this result, we shall first state and prove two useful lemmas.
Let be a matrix in with singular values . Assume that has bounded condition number . Then for all ,
Observe that
Since , we have that and , from which the result follows. ∎
Under the same conditions as Theorem 1, we have that for the th hidden layer :
First observe that a trivial consequence of the first two assumptions is that for any . Now recall that we have defined the maximum width of the network as . Then we may relax Lemma 1 to:
This fact will prove its worth in the following argument:
(assumption on ) | ||||
(Lemma 1) | ||||
The lemma follows from an obvious induction on depth. ∎
With these tools in hand, let us proceed to Theorem 1.
Comments
There are no comments yet.