Large-width asymptotics and training dynamics of $\alpha$-Stable ReLU neural networks
There is a recent and growing literature on large-width properties of Gaussian neural networks (NNs), namely NNs whose weights are Gaussian distributed. In such a context, two popular results are: i) the characterization of the large-width asymptotic behavior of NNs in terms of Gaussian stochastic processes; ii) the characterization of the large-width training dynamics of NNs in terms of the so-called neural tangent kernel (NTK), showing that, for a sufficiently large width, the gradient descent achieves zero training error at a linear rate. We present large-width asymptotics and training dynamics of $\alpha$-Stable NNs, namely NNs whose weights are distributed according to $\alpha$-Stable distributions, with $\alpha\in(0,2]$. First, for $\alpha$-Stable NNs with a ReLU activation function, we show that if the NN's width goes to infinity then a rescaled NN converges weakly to an $\alpha$-Stable stochastic process, generalizing Gaussian processes. As a difference with respect to the Gaussian setting, our result shows that the choice of the activation function affects the scaling of the NN, that is: to achieve the infinitely wide $\alpha$-Stable process, the ReLU activation requires an additional logarithmic term in the scaling with respect to sub-linear activations. Then, we characterize the large-width training dynamics of $\alpha$-Stable ReLU-NNs in terms of a random kernel, referred to as the $\alpha$-Stable NTK, showing that, for a sufficiently large width, the gradient descent achieves zero training error at a linear rate. The randomness of the $\alpha$-Stable NTK is a further difference with respect to the Gaussian setting, that is: within the $\alpha$-Stable setting, the randomness of the NN at initialization does not vanish in the large-width regime of the training. An extension of our results to deep $\alpha$-Stable NNs is discussed.
Area: CS24 - Neural Networks at initialization (Michele Salvi)
Keywords: $\alpha$-Stable stochastic process; gradient descent; infinitely wide limit; large-width training dynamics; neural network; neural tangent kernel; ReLU activation function
Please Login in order to download this file