Suppose is the parameter. The
Stochastic Gradient Descent method is simplied as this:
Theorem Suppose the function
is convex and differentiable, and that its gradient is Lipschitz
continuous with constant ,
i.e. we have that for any . Then if we run gradient descent for
iterations with a fixed step size
, it will yield a solution which satisfies
where is the optimal
value. Intuitively, this means that gradient descent is guaranteed to
converge and that it converges with rate .
Proof: Our assumption that is Lipschitz continuous with
constant implies , or
equivalently that is a negative semidefinite matrix. Using this fact, we can
perform a quadratic expansion of
around and obtain the
following inequality:
Now let's plug in the gradient descent update by letting . We then get:
Using , we know that
. Plugging this in to
above, we can conclude the following:
Since will always be positive unless , this inequality implies
that the objective function value strictly decreases with each iteration
of gradient descent until it reaches the optimal value . Note that this
convergence result only holds when we choose to be small enough, i.e., . This explains why we observe
in practice that gradient descent diverges when the step size is too
large.
Next, we can bound , the
objective value at the next iteration, in terms of , the optimal objective value.
Since is convex, we can write
where the first inequality yields the second through simple
rearrangement of terms. Plugging this in to above, we obtain:
where the final inequality is obtained by observing that expanding
the square of yields . Notice that by definition we have . Plugging this in
to above, yields:
This inequality holds for on
every iteration of gradient descent. Summing over iterations, we
get:
where the summation on the right-hand side disappers because it is a
telescoping sum. Finally, using the fact that decreasing on every iteration, we can
conclude that
where in the final step, we plug in it to get the inequality from -
that we were trying to prove.
Theorem Convergence of SGD for strongly convex
problem; fixed stepsizes. Under the assumption of -strongly convex, -smooth, if , then
SGD achieves