Optimization Using (Stochastic) Gradient Decent
Gradient Decent is applied to optimization problems based on (locally) differentiable objective functions. This algorithm defines an iterative descent to the minimum of the objective function
f(x), if needed it can be trivially adopted to search a maximum. Starting form the more or less well educated
x₀ we use the negative gradient
-∇f(x) to step in the direction of steepest descend.
xₙ₊₁ = xₙ - γ∇f(x)
The step size along the negative gradient is defined by
γ. This parameter defines how fast the algorithm can search for the minimum especial when
x₀ is far from minimum. Unfortunately to high
γ can lead to oscillation effects. Since the choice of lambda is crucial for the success of the algorithm one common variation is to make lambda slowly decay with each training iteration.
Stochastic Gradient Decent (SGD)
The stochastic extension of the Gradient Descent is well for optimization problems where the objection function is composed of large number
m of summed subfunction.
f(x) = m⁻¹∑ᵢ₌₁...ₘ f(x)ᵢ
In order to combat high computational cost the gradient is calculated based on a smaller random subset of sub functions. When training neural networks SGD is often used to minimize the training error. Composed error functions of a large number of training instances lead to an optimization problem well suitable for SGD.