Neural Tangent Kernel
[1] proposes a new explanation for neural network. During training, the network function follows a descent along the kernel gradient w.r.t. the Neural Tangent Kernel (NTK).
The paper sutdy the network function
Neural networks
Setting: ANN
- layers numbered from
(input) to (output), each containing neurons; - a Lipschitz, twice differentiable nonlinearity function
, with bounded second derivative; - ANN realization function
, mapping parameters to functions in a space ; - for a fixed distribution
on the input space , the function space is defined as ; - The seminorm
, defined in terms of the bilinear form
- the network function is defined by
, where the functions (preactivations) and (activations)
Kernel gradient
During training, the network function
setting
functional cost
; the composite cost is in general highly non-convex;
a multi-dimensional kernel
is a function ; - this kernel defines a bilinear map on
, taking the expectation over independent :
- this kernel defines a bilinear map on
the dual of with respect to ; the set of linear forms of the form for some ; , and . , define a map mapping a dual element to the function with values:
###gradient
A finite dataset
Kernel gradient
Random functions approximation
Setting: A kernel
These functions define a random linear parametrization
The partial derivatives of the parametrization are given by
Optimizing the cost
Neural tangent kernel
During training, the network function
with respect to the neural tangent kernel (NTK)
Next will show that in the infinite-width limit, the NTK becomes deterministic at initialization and stays constant during training.
Initialization
Proposition 1. For a network of depth
The firstly key result: in the same limit, the Neural Tangent Kernel (NTK) converges in probability to an explicit deterministic limit.
Theorem 1. For a network of depth
The scalar kernel
Training
Second key result is that the NTK stays asymptotically constant during training.
Theorem 2. Assume that
As a consequence, in this limit, the dynamics of
Least-squares regression
Given a goal function
References
[1] @inproceedings{jacot2018neural, title="Neural Tangent Kernel: Convergence and Generalization in Neural Networks", author="Arthur {Jacot} and Franck {Gabriel} and Clément {Hongler}", booktitle="Advances in Neural Information Processing Systems", volume="31", pages="8571--8580", notes="Sourced from Microsoft Academic - https://academic.microsoft.com/paper/2809090039", year="2018" }