Skip to content

Back propagation with manual derivation

Let's consider a neural network

\(L\) is total number of layers

\(s_l\) is number of units (except bais) in layer \(l\).

\((W,b)=(W_{ij}^{(l)},b_i^{(l)})_{l=1}^{n_L}\) is model parameters, where we write \(W_{ij}^{(l)}\) to denote the weight associated with the connection between unit \(j\) in layer \(l\), unit \(i\) in layer \(l+1\). \(b_i^{(l)}\) is the bias associated with unit \(i\) in layer \(l+1\). \(W^{(l)}\) has size \(s_{l+1}\times s_l\) and \(b^{(l)}\) has length \(s_{l+1}\).

Forward propagation

\(a^{(i)}=\sigma(z^{(i)})\)

\(z^{(i)}=W^{(i-1)}a^{(i-1)}+b^{(i-1)}\)

goal of training: learn parameters s. t. \(h_{Wb}(x^{(k)})=y^{(k)}\)

Back propagation

intuition \(\delta_j^{(l)}\) = error of unit \(j\) in layer \(l\).

\(\delta^{(4)}=(a^{(4)}-y^{(k)})\odot\sigma'(z^{(4)})\)

\(\delta^{(i)}=(W^{(i)})^T\delta^{(i+1)}\odot\sigma'(z^{(i)})\)

\(\frac{\partial J(W,b)}{\partial W_{ij}^{(l)}}=a_j^{(l)}\delta_i^{(l+1)}\)

\(\frac{\partial J(W,b)}{\partial b_i^{(l)}}=\delta_i^{(l+1)}\)

Proof

last layer

using chain rule!

\(\delta_j^{(L)}=(a_j^{(L)}-y_j)\sigma'(z_j^{(L)})\)

\(J(W,b)=\frac{1}{2}\sum_k(h_{W,b}(x)_k-y_k)^2=\frac{1}{2}\sum_k(a_k^{(L)}-y_k)^2\)

derivative r.w.t. \(a_j^{(L)}\): \(\frac{\partial J(W,b)}{\partial a_j^{(L)}}=a_j^{(L)}-y_j\)

define error of unit j in layer l is: $$ \delta_j^{(l)}=\frac{\partial J(W,b)}{\partial z_j^{(l)}}\ =\sum_k\frac{\partial J}{\partial a_j^{(l)}}\times\frac{\partial a_j^{(l)}}{\partial z_j^{(l)}}\ =(a_j{l}-y_j)\sigma'(z_j{(l)}) $$ proof of intermediate layers is the same

Weight update

\(\frac{\partial J(W,b)}{\partial W_{ij}^{(l)}}=a_j^{(l)}\delta_i^{(l+1)}\)

\(\frac{\partial J}{\partial W}=\frac{\partial J}{\partial z}\times\frac{\partial z}{\partial W}\)

Bias update

\(\frac{\partial J(W,b)}{\partial b_{i}^{(l)}}=\delta_i^{(l+1)}\)

\(\frac{\partial J}{\partial b}=\frac{\partial J}{\partial z}\times\frac{\partial z}{\partial b}\)

Training NNs in practice

Saturation

aka 饱和

以 sigmoid 函数为例,当 输入比较大时,激活函数的函数值改变很小。因此计算的值就会很小,网络更新缓慢。

选择非 sigmoid 函数解决这个问题。

leaky ReLU, Maxout, ELU, ReLU, tanh......

Vanishing gradient problem

饱和会导致梯度消失

gradient of the loss function can approach zero. 导致传播时,梯度越传越小,更新不动网路参数。

possible solutions: batch normalization and residual network 批次正则化,残差连接

Over-fitting

too many parameters, too few training data.

fail to generalize to new data.

L2 regualrization

weight decay

\(J(W,b)=\frac{1}{2}\frac{1}{n}\sum_i\sum_k(h(x^{(i)}_k)-y_k^{(i)})+\frac{\lambda}{2n}\sum_l\sum_j\sum_i(W_{ij}^{(l)})\)

the partial derivative gets an extra term:

\(\frac{\partial J(W,b)}{\partial W_{ij}^{(l)}}=\frac{1}{n}\Delta W_{ij}^{(l)}+\lambda W_{ij}^{(l)}\)

other types of regularization

dropout

early stopping

  • up to a certain point, gradient descent improves the models performance on data outside of the training set
  • provide guidance as to how many iterations can be run before the model begins to overfit.

batch normalization

data augmentation

stochastic gradient descent

  • for large training set, computing loss and gradient for entire training set is very slow.
  • generally each parameter update in SGD is computed w.r.t. a few training examples or a minibatch as opposed to a single example.

using other activation function

detecting over-fitting

if the loss function on training set always decrease over time, showing no tendency to convergence.

meanwhile, loss on validation set increases.