0%

Neural Tangent Kernel

Neural Tangent Kernel

[1] proposes a new explanation for neural network. During training, the network function follows a descent along the kernel gradient w.r.t. the Neural Tangent Kernel (NTK).

The paper sutdy the network function of an ANN. In the limit as the widths of the hidden layers tend to infinity, the network function at initialization, converges to a Gaussian distribution. The behavior of ANNs during training is described by a related kernel, which we call the neural tangent kernel (NTK).

Neural networks

Setting: ANN

  • layers numbered from (input) to (output), each containing neurons;
  • a Lipschitz, twice differentiable nonlinearity function , with bounded second derivative;
  • ANN realization function , mapping parameters to functions in a space ;
  • for a fixed distribution on the input space , the function space is defined as ;
  • The seminorm , defined in terms of the bilinear form

  • the network function is defined by , where the functions (preactivations) and (activations)

Kernel gradient

During training, the network function follows a descent along the kernel gradient with respect to the Neural Tangent Kernel (NTK). This makes it possible to study the training of ANNs in the function space , on which the cost is convex.

setting

  • functional cost ; the composite cost is in general highly non-convex;

  • a multi-dimensional kernel is a function ;

    • this kernel defines a bilinear map on , taking the expectation over independent :

  • the dual of with respect to ; the set of linear forms of the form for some ; , and .

  • , define a map mapping a dual element to the function with values:

###gradient

A finite dataset , the cost functional only depends on the values of at the data points. The (functional) derivative of the cost at a point can be viewed as an element of , which we write . We denote by , a corresponding dual element, such that .

Kernel gradient is defined as . The kernel gradient generalize to values outside the dataset thanks to the kernel :

Random functions approximation

Setting: A kernel can be approximated by a choice of random functions sampled independently from any distribution on whose covariance is given by the kernel :

These functions define a random linear parametrization

The partial derivatives of the parametrization are given by

Optimizing the cost lin through gradient descent, the parameters follow the ODE:

Neural tangent kernel

During training, the network function evolves along the negative kernel gradient

with respect to the neural tangent kernel (NTK)

Next will show that in the infinite-width limit, the NTK becomes deterministic at initialization and stays constant during training.

Initialization

Proposition 1. For a network of depth at initialization, with a Lipschitz nonlinearity , and in the limit as , the output functions , for , tend (in law) to iid centered Gaussian processes of covariance .

The firstly key result: in the same limit, the Neural Tangent Kernel (NTK) converges in probability to an explicit deterministic limit.

Theorem 1. For a network of depth at initialization, with a Lipschitz nonlinearity , an in the limit as the layers width , the NTK converges in probability to a deterministic limiting kernel:

The scalar kernel is defined recursively.

Training

Second key result is that the NTK stays asymptotically constant during training.

Theorem 2. Assume that is a Lipschitz, twice differentiable nonlinearity function, with bounded second derivative. For any such that the integral stays stochastically bounded, as , we have, uniformly for ,

As a consequence, in this limit, the dynamics of is described by the differential equation

Least-squares regression

Given a goal function and input distribution , the least-squares regression cost is

References

[1] @inproceedings{jacot2018neural, title="Neural Tangent Kernel: Convergence and Generalization in Neural Networks", author="Arthur {Jacot} and Franck {Gabriel} and Clément {Hongler}", booktitle="Advances in Neural Information Processing Systems", volume="31", pages="8571--8580", notes="Sourced from Microsoft Academic - https://academic.microsoft.com/paper/2809090039", year="2018" }

[2]