You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Optimization Methods are widely used and needed for for deep learning algorithms.
The general Task is the following: given a function $f : \mathbb{R}^n \to \mathbb{R}$ find $x^* := \underset{x\in \mathbb{R}^n}{ \text{argmin} } f(x)$ that minimizes $f$. How do we solve this problem?
We start at some initial value $x_{0}$ and iteratively move in the direction of steepest descent $u$ from there until convergence.
The update rule: $x_{i+1} \leftarrow x_{i} + \varepsilon u$
The following questions arise:
How to find the direction of steepest descent $u$?
How to find a good stepsize $\varepsilon$?
Gradient and Directional Derivative
Recap of some basics from multidimensional analysis:
Partial derivative$\frac{\partial}{\partial x_i}f(x)$: derivative of $f$ with respect to $x_{i}$
Gradient$\nabla_x f(x) := \left(\frac{\partial}{\partial x_1}f(x),\ldots,\frac{\partial}{\partial x_n}f(x)\right)^{\intercal}$: vector of all partial derivatives of $f$
Directional derivative in direction $u$: To obtain the directional derivative of $f$ in direction $u$ we compute $\frac{\partial}{\partial \alpha}f(x + \alpha u)\ \text{evaluated at } \alpha = 0$. This is equal to $u^\intercal \nabla_x f(x)$. We want to find the direction $u$ with minimal directional derivative in order to minimize $f$.
$\rightarrow$ Hence, our task is to find $\underset{u, |u| = 1}{\text{argmin}}\ u^\intercal \nabla_x f(x)$.
($\alpha$ denotes the angle between $u$ and $\nabla_x f(x)$)
The $cos(\alpha)$ is minimized when $u$ points into the opposite direction of the gradient.
$\rightarrow$ Set $u := - \nabla_x f(x)$.
Jacobian and Hessian Matrix
Consider a function $g: \mathbb{R}^{n} \rightarrow \mathbb{R}^{m}$. This means that $g$ consists of $m$ functions $g_1, \ldots , g_m: \mathbb{R}^{n} \rightarrow \mathbb{R}$. The Jacobian matrix of $g$ is defined as: $J \in \mathbb{R}^{m\times n}$, $J_{ ij } :=$$\frac{\partial f_{i}}{\partial x_{j}}$
Consider $f: \mathbb{R}^{n} \rightarrow \mathbb{R}$. Its Hessian is defined as $H \in \mathbb{R}^{n \times n}$, $H_{ ij } :=$$\frac{ \partial }{ \partial x_{i} \partial x_{j} } f$. It contains information about the curvature of $f$.
The optimal stepsize $\varepsilon$
To find the optimal steptsize $\varepsilon$ we do a second order Taylor approximation of $f$:
$f(x^{(i+1)})$$\approx f(x^{(i)}) + (x^{(i+1)} - x^{(i)})g + \frac{1}{2}(x^{(i+1)} - x^{(i)})^{\intercal}H(x^{(i+1)} - x^{(i)})$$= f(x^{(i)}) - \varepsilon g^{\intercal} g + \frac{1}{2} \varepsilon^{2} g^{\intercal} H g$,
where $g := \nabla_{x}f(x^{(i)})$ and $H := H(f)(x^{(i)})$.
Since we want to minimize $f$ we increase $\varepsilon$ if $g^{\intercal} Hg \leq 0$ else we set $\varepsilon := \frac{g^\intercal g}{g^\intercal H g}$.
Issues of Gradient Descent
An ill-conditioned Hessian matrix leads to poor performance of the gradient descent algorithm. We can resolve this problem using Newton's method.
Newton's Method
We again use a second order Taylor approximation of $f$:
$f(x^{(i+1)}) \approx f(x^{(i)}) + (x - x^{(i)})^\intercal \nabla_{x}f(x^{(i)})
+ \frac{1}{2}(x - x^{(i)})^\intercal H(f)(x^{(i)})(x - x^{(i)})$
Hence, the optimum is $x^{(i+1)} = x^{(i)} - H(f)(x^{(i)})^{-1}\nabla_{x}f(x^{(i)})$
Constrained Optimization (4.4)
We again want to minimize $f: \mathbb{R}^{n} \rightarrow \mathbb{R}$ but this time with additional conditions.
The constraints are given by:
$g_{i}(x) \leq 0$ for $i = 1,...,m$
$h_{i}(x) = 0$ for $j = 1,...,k$
with $g_{i}, h_{j}: \mathbb{R}^n \rightarrow \mathbb{R}$
We want to find $x$ that minimizes $f$ under the given conditions. We are going to do that by translating our initial constrained optimization problem into an unconstrained one.
The KKT-approach uses the general Lagrangian
$\mathcal{L}(x, \lambda, \mu) := f(x)+ \sum_{i}\lambda_{i}g_{i}(x) + \sum_{j}\mu_{j}h_{j}(x)$ where $\lambda \in \mathbb{R}^m_{\geq 0}, \ \mu\in\mathbb{R}^k$.
Our intial problem is then equivalent to solving $\underset{x}{\min} \ \underset{\mu}{\max} \ \underset{\lambda, \lambda \geq 0}{\max} \ \mathcal{L}(x, \lambda, \mu)$.
It can be shown that necessary conditons for a local optimum $(x , \lambda , \mu )$ are:
all constraints are satisfied by $(x , \lambda , \mu )$
Newton's method: Converges in 1 step because the optimization problem is strictly convex.
KKT-approach: Suppose we have an additonal constraint $x^\intercal x \leq 1$.
Then $\mathcal{L}(x, \lambda) = f(x) + \lambda (x^\intercal x - 1)$.
Gradient-Based Learning (6.2)
Backpropagation-Algorithms (6.5)
We now want to minimize the cost function $J(\theta)$ using gradient descent. It is thus necessary to compute the gradient $\nabla_{\theta}J(\theta)$. We will do this by letting information flow backwards through the layers of the network. This method is called backpropagation.
The Chain Rule of Calculus
Given to functions $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$ and $g: \mathbb{R}^m \rightarrow \mathbb{R}$ we want to compute the partial derivatives of their composition.
Set $y = f(x)$, $z = g(y) = g(f(x))$.
The Chain Rule of Calculus claims $\frac{\partial z}{\partial x_i} = \sum_{j=1}^{m}\frac{\partial z}{\partial y_j}\frac{\partial y_j}{\partial x_i}$.
Forward Propagation in Fully Connected MLP's
The information first flows forward through the Network. For input $x$ the network outputs an estimate $\hat{y}$ of $y$ and the cost $J(\hat{y}, y, \theta)$ is computed.
Input: Network with depth $l$, $(x,y)$, $W^{(i)}, b^{(i)}$ for $i = 1,\ldots, l$
Set $h^{(0)} := x$For$i = 1, \ldots, l$
We now backpropagate the information through the network and recursively compute the gradient of the cost function $J$ with respect to $b_{i}, W_{i}$ in the i-th layer. We start at the last layer and proceed backwards through the network until we reach the first layer.