Gradient Descent and Backpropagation Algorithm
Lecture 2 · Yann LeCun · New York University · Spring 2020 · ·If there is any error, please raise an issue.
If there is any error, please raise an issue.
Link: Prof. Canziani’s ipynb
nn.Linear()
) can rotate, reflect, stretch and compress, but cannot curve, hence, non-linearity is required.W02L: Gradient Descent and Backpropagation Algorithm
Average Loss (\(\mathcal{L}(S, w)\)): loss over all the samples -> Actually minimized during the training
\[\mathcal{L}(S, w) = \frac{1}{P} \sum_{p = 0}^{P-1} L((x[p], y[p]), w)\]ML is basically all about optimizing the functions: Minimizing (usual), Maximizing (reward functions in RL)
For non-differentiable functions, zero-th order methods or gradient free methods are used.
Instead of full gradient (over all samples), use gradient descent for a single example as:
\[w \leftarrow w - \eta \frac{\partial L((x[p], y[p]), w)}{\partial w}\]Since, doing SGD on single example will give a noisy trajectory for reaching minima, hence, people do it in batches for parallelization (thereby improving efficiency of computation).
For a neural net with forward pass:
\[x \rightarrow f(x, w_f) \rightarrow z_f \rightarrow g(z_f, w_g) \rightarrow z_g \rightarrow C(z_g, y) \rightarrow C\]Chain rule for vector vector functions will be
\[z_g:[d_g \times 1], \ z_f:[d_f \times 1]\] \[\frac{\partial C}{\partial z_f} = \frac{\partial C}{\partial z_g} \frac{\partial z_g}{\partial z_f}\] \[[1 \times d_f] = [1 \times d_g] \text{*} [d_g \times d_f]\]What is \(\frac{\partial z_g}{\partial z_f}\)?
Ans: Jacobian Matrix (basic idea) (detailed)
Partial Derivative of i-th output w.r.t j-th output. If you twiddle with j-th input, it would affect the whole output matrix.
\[\Bigg(\frac{\partial z_g}{\partial z_f}\Bigg)_{ij} = \frac{(\partial z_g)_{i}}{(\partial z_f)_j}\]Linear
\[Y = W.X; \ \ \ \frac{\text{d}C}{\text{d}X} = W^\top \frac{\text{d}C}{\text{d}Y}\]ReLU
\[y = \text{ReLU}(x); \ \ \ \frac{\text{d}C}{\text{d}X} = \left\{ \begin{array}{ll} 0, & \text{if } x < 0\\ \frac{\text{d}C}{\text{d}Y} & \text{otherwise} \end{array} \right.\]Duplicate
\[Y_1 = X, Y_2 = X; \ \ \ \frac{\text{d}C}{\text{d}X} = \frac{\text{d}C}{\text{d}Y_1} + \frac{\text{d}C}{\text{d}Y_2}\]Note that, the gradients are summed. Although, PyTorch does it for me, but take note.
Add
\[Y = X_1 + X_2; \ \ \ \frac{\text{d}C}{\text{d}X_1} = \frac{\text{d}C}{\text{d}Y}, \frac{\text{d}C}{\text{d}X_2} = \frac{\text{d}C}{\text{d}Y}\]Max
\[y = \max({x_1, x_2}); \ \ \ \frac{\text{d}C}{\text{d}x_1} = \left\{ \begin{array}{ll} \frac{\text{d}C}{\text{d}y}, & \text{if } x_1 > x_2\\ 0 & \text{otherwise} \end{array} \right.\]LogSoftmax
\[Y_i = X_i - \log \left[\sum_{j} \exp (X_j) \right]\]