Why do neural networks generalise in the overparameterised regime?Ard Louis, University of Oxford, 12:00 EDT
Abstract: One of the most surprising properties of deep neural networks (DNNs) is that they typically perform best in the overparameterised regime. Physicists are taught from a young age that having more parameters than datapoints is a terrible idea. This intuition can be formalised in standard learning theory approaches, based for example on model capacity, which also predict that DNNs should heavily over-fit in this regime, and therefore not generalise at all. So why do DNNs work so well? We use a version of the coding theorem from Algorithmic Information Theory to argue that DNNs are generically biased towards simple solutions. Such an inbuilt Occam’s razor means that they are biased towards solutions that typically generalise well. We further explore the interplay between this simplicity bias and the error spectrum on a dataset to develop a detailed Bayesian theory of training and generalisation that explains why and when SGD trained DNNs generalise, and when they should not. This picture also allows us to derive tight PAC-Bayes bounds that closely track DNN learning curves and can be used to rationalise differences in performance across architectures. Finally, we will discuss some deep analogies between the way DNNs explore function space, and biases in the arrival of variation that explain certain trends observed in biological evolution.
Based on a sequence of papers:
- Generic predictions of output probability based on complexities of inputs and outputs
- Neural networks are a priori biased towards Boolean functions with low entropy
- Deep learning generalizes because the parameter-function map is biased towards simple functions
- Input–output maps are strongly biased towards simple outputs