Gradient Descent

Consider optimization problem:

Stochastic Gradient Descent

Suppose is the parameter. The Stochastic Gradient Descent method is simplied as this:

Theorem Suppose the function is convex and differentiable, and that its gradient is Lipschitz continuous with constant , i.e. we have that for any . Then if we run gradient descent for iterations with a fixed step size , it will yield a solution which satisfies

where is the optimal value. Intuitively, this means that gradient descent is guaranteed to converge and that it converges with rate .

Proof: Our assumption that is Lipschitz continuous with constant implies , or equivalently that is a negative semidefinite matrix. Using this fact, we can perform a quadratic expansion of around and obtain the following inequality:

Now let's plug in the gradient descent update by letting . We then get:

Using , we know that . Plugging this in to above, we can conclude the following:

Since will always be positive unless , this inequality implies that the objective function value strictly decreases with each iteration of gradient descent until it reaches the optimal value . Note that this convergence result only holds when we choose to be small enough, i.e., . This explains why we observe in practice that gradient descent diverges when the step size is too large.

Next, we can bound , the objective value at the next iteration, in terms of , the optimal objective value. Since is convex, we can write

where the first inequality yields the second through simple rearrangement of terms. Plugging this in to above, we obtain:

where the final inequality is obtained by observing that expanding the square of yields . Notice that by definition we have . Plugging this in to above, yields:

This inequality holds for on every iteration of gradient descent. Summing over iterations, we get:

where the summation on the right-hand side disappers because it is a telescoping sum. Finally, using the fact that decreasing on every iteration, we can conclude that

where in the final step, we plug in it to get the inequality from - that we were trying to prove.

Theorem Convergence of SGD for strongly convex problem; fixed stepsizes. Under the assumption of -strongly convex, -smooth, if , then SGD achieves

Momentum-based Gradient Descent

Nesterov Accelerated Gradient Descent

References

[1] https://towardsdatascience.com/manifold-learning-the-theory-behind-it-c34299748fec

[2]

Ye Yuan

Convergence of Gradient Descent Methods

Gradient Descent

Stochastic Gradient Descent

Momentum-based Gradient Descent

Nesterov Accelerated Gradient Descent

References