Back propagation with manual derivation

Let's consider a neural network

$L$ is total number of layers

$s_l$ is number of units (except bais) in layer $l$.

$(W,b)=(W_{ij}^{(l)},b_i^{(l)})_{l=1}^{n_L}$ is model parameters, where we write $W_{ij}^{(l)}$ to denote the weight associated with the connection between unit $j$ in layer $l$, unit $i$ in layer $l+1$. $b_i^{(l)}$ is the bias associated with unit $i$ in layer $l+1$. $W^{(l)}$ has size $s_{l+1}\times s_l$ and $b^{(l)}$ has length $s_{l+1}$.

Forward propagation

$a^{(i)}=\sigma(z^{(i)})$

$z^{(i)}=W^{(i-1)}a^{(i-1)}+b^{(i-1)}$

goal of training: learn parameters s. t. $h_{Wb}(x^{(k)})=y^{(k)}$

Back propagation

intuition $\delta_j^{(l)}$ = error of unit $j$ in layer $l$.

$\delta^{(4)}=(a^{(4)}-y^{(k)})\odot\sigma'(z^{(4)})$

$\delta^{(i)}=(W^{(i)})^T\delta^{(i+1)}\odot\sigma'(z^{(i)})$

$\frac{\partial J(W,b)}{\partial W_{ij}^{(l)}}=a_j^{(l)}\delta_i^{(l+1)}$

$\frac{\partial J(W,b)}{\partial b_i^{(l)}}=\delta_i^{(l+1)}$

Proof

last layer

using chain rule!

$\delta_j^{(L)}=(a_j^{(L)}-y_j)\sigma'(z_j^{(L)})$

$J(W,b)=\frac{1}{2}\sum_k(h_{W,b}(x)_k-y_k)^2=\frac{1}{2}\sum_k(a_k^{(L)}-y_k)^2$

derivative r.w.t. $a_j^{(L)}$: $\frac{\partial J(W,b)}{\partial a_j^{(L)}}=a_j^{(L)}-y_j$

define error of unit j in layer l is: $$ \delta_j^{(l)}=\frac{\partial J(W,b)}{\partial z_j^{(l)}}\ =\sum_k\frac{\partial J}{\partial a_j^{(l)}}\times\frac{\partial a_j^{(l)}}{\partial z_j^{(l)}}\ =(a_j^{{l}-y_j)\sigma'(z_j}{(l)}) $$ proof of intermediate layers is the same

Weight update

$\frac{\partial J(W,b)}{\partial W_{ij}^{(l)}}=a_j^{(l)}\delta_i^{(l+1)}$

$\frac{\partial J}{\partial W}=\frac{\partial J}{\partial z}\times\frac{\partial z}{\partial W}$

Bias update

$\frac{\partial J(W,b)}{\partial b_{i}^{(l)}}=\delta_i^{(l+1)}$

$\frac{\partial J}{\partial b}=\frac{\partial J}{\partial z}\times\frac{\partial z}{\partial b}$

Training NNs in practice

Saturation

aka 饱和

以 sigmoid 函数为例，当输入比较大时，激活函数的函数值改变很小。因此计算的值就会很小，网络更新缓慢。

选择非 sigmoid 函数解决这个问题。

leaky ReLU, Maxout, ELU, ReLU, tanh......

Vanishing gradient problem

饱和会导致梯度消失

gradient of the loss function can approach zero. 导致传播时，梯度越传越小，更新不动网路参数。

possible solutions: batch normalization and residual network 批次正则化，残差连接

Over-fitting

too many parameters, too few training data.

fail to generalize to new data.

L2 regualrization

weight decay

$J(W,b)=\frac{1}{2}\frac{1}{n}\sum_i\sum_k(h(x^{(i)}_k)-y_k^{(i)})+\frac{\lambda}{2n}\sum_l\sum_j\sum_i(W_{ij}^{(l)})$

the partial derivative gets an extra term:

$\frac{\partial J(W,b)}{\partial W_{ij}^{(l)}}=\frac{1}{n}\Delta W_{ij}^{(l)}+\lambda W_{ij}^{(l)}$

other types of regularization

dropout

early stopping

up to a certain point, gradient descent improves the models performance on data outside of the training set
provide guidance as to how many iterations can be run before the model begins to overfit.

batch normalization

data augmentation

stochastic gradient descent

for large training set, computing loss and gradient for entire training set is very slow.
generally each parameter update in SGD is computed w.r.t. a few training examples or a minibatch as opposed to a single example.

using other activation function

detecting over-fitting

if the loss function on training set always decrease over time, showing no tendency to convergence.

meanwhile, loss on validation set increases.