Skip to content

Optimization Algorithms

Better optimization algorithms

These perform progressively better than simple gradient descent.

In the following, superscript $i$ on $v^i$ denotes $i$th component of the vector $v$.

1. Simple gradient descent

\begin{eqnarray}
w_n^i = w_{n-1}^i – \left. \varepsilon \frac{\partial C}{\partial w^i}\right\rvert_{w=w_{n-1}}
\end{eqnarray}

Disadvantages are

  • Stuck at local minimum
  • Oscillates at elongated local minimums, as gradient descent does store any memory.about its previous behaviour.

2. Momentum

\begin{eqnarray}
q_n^i &=& \alpha q_{n-1}^i + (1-\alpha) \left.  \frac{\partial C}{\partial w^i}\right\rvert_{w=w_{n-1}} \\
w_n^i &=& w_{n-1}^i – \varepsilon q_n^i
\end{eqnarray}

  • Does a moving average filtering on gradient values to filter out fast oscillations.
  • Good for jumping over local minima.

3. RMSProp

Adapts the learning rate for each parameter individually.

Consider the gradient, ie, the derivative of the cost function with respect to parameters. The system may be sensitive to some parameters, yielding large gradient elements, while it may be insensitive to other parameters, yielding small gradient elements. Applying the same learning rate to all elements may result in exploding/diminishing gradients.

\begin{eqnarray}
v_n^i &=& \beta v_{n-1}^i + (1-\beta) \left.  \left( \frac{\partial C}{\partial w^i} \right)^2 \right\rvert_{w=w_{n-1}} \\
w_n^i &=& w_{n-1}^i – \frac{\varepsilon}{\sqrt{v_n^i + \delta}} \left. \frac{\partial C}{\partial w^i}  \right\rvert_{w=w_{n-1}}
\end{eqnarray}

4. Adam

\begin{eqnarray}
q_n^i &=& \alpha q_{n-1}^i + (1-\alpha) \left.  \frac{\partial C}{\partial w^i}\right\rvert_{w=w_{n-1}} \\
\bar{q}^i_n &=& \frac{q_n^i}{\sqrt{1-\alpha^n}}\\
v_n^i &=& \beta v_{n-1}^i + (1-\beta) \left.  \left( \frac{\partial C}{\partial w^i} \right)^2 \right\rvert_{w=w_{n-1}} \\
\bar{v}^i_n &=& \frac{v_n^i}{\sqrt{1-\alpha^n}}\\
w_n^i &=& w_{n-1}^i – \frac{\varepsilon}{\sqrt{\bar{v}_n^i + \delta}} \bar{q}^i_n
\end{eqnarray}

$\bar{q}$ and $\bar{v}$ are defined to address the cold start problem.