Introduction to Neural Networks
Logistic Regression
Goal: Find cats in images (binary classification).
Suppose we have an image of a cat with $64 \times 64$ pixels. We represent it as a feature vector of size $64 \times 64 \times 3$.
Train Step of Logistic Regression:
- Initialize $w$ and $b$.
- Find the optimal $w$ and $b$ by minimizing the cost function.
- Use $\hat{y} = \sigma(w^T x + b)$ to predict.
In deep learning, we have neurons and models as building blocks.
\[\begin{aligned} \text{neuron} &= \text{linear} + \text{activation} \\ \text{model} &= \text{architecture} + \text{parameters} \end{aligned}\]Goal2: Find cat/lion/iguana in images (multi-class classification).
As we did in logistic regression, we can convert the image into a feature vector. In deep learning, we connect every element of the feature vector to each neuron in the first layer and calculate the neuron’s output, which is the logistic regression predictor. Each neuron represents the presence of cat/lion/iguana in the image, and they are independent and do not share information.
The loss function is:
\[L(\hat{y}, y) = - \sum_{i=1}^{m} \left[y^{(i)} \log(\hat{y}^{(i)}) + \bigl(1 - y^{(i)}\bigr) \log\bigl(1 - \hat{y}^{(i)}\bigr)\right]\]Goal3: Find cat/lion/iguana in images with the constraint that there is only one animal in each image.
As mentioned, neurons are a combination of linear and activation. Let $z_{i}^{(j)} = w_{i}^{(j)}x + b_{i}^{(j)}$ represent the linear part. Then we use the softmax function to normalize the output. We set the neuron with the highest probability to 1 and the rest to 0 to satisfy the constraint.
Neural Network
Goal: images $\implies$ cat vs. no cat ${1, 0}$
A neural network has an input layer, hidden layer, and output layer. The input layer is the feature vector, the hidden layer contains the neurons, and the output layer is the prediction.
An end-to-end neural network can learn the features of the image and predict the output without constraints.
Propagation equations
\[\begin{aligned} z^{[1]} &= w^{[1]} x + b^{[1]} \\ a^{[1]} &= \sigma\bigl(z^{[1]}\bigr) \\ z^{[2]} &= w^{[2]} a^{[1]} + b^{[2]} \\ a^{[2]} &= \sigma\bigl(z^{[2]}\bigr) \\ z^{[3]} &= w^{[3]} a^{[2]} + b^{[3]} \\ a^{[3]} &= \sigma\bigl(z^{[3]}\bigr) \end{aligned}\]Optimizing $w^{[1]}, b^{[1]}, w^{[2]}, b^{[2]}, w^{[3]}, b^{[3]}$, where []
indicates the layer index.
Define the loss/cost function:
\[\begin{aligned} J(\hat{y}, y) &= \frac{1}{m} \sum_{i=1}^{m} L\bigl(\hat{y}^{(i)}, y^{(i)}\bigr) \\ L(\hat{y}, y) &= - \sum_{i=1}^{m} \left[y^{(i)} \log\bigl(\hat{y}^{(i)}\bigr) + \bigl(1 - y^{(i)}\bigr) \log\bigl(1 - \hat{y}^{(i)}\bigr)\right] \end{aligned}\]One reason we use this loss is that it can be parallelized on GPUs.
Backpropagation
$\forall l \in {1,2,3}$:
\[\begin{aligned} w^{[l]} &:= w^{[l]} - \alpha \frac{\partial J}{\partial w^{[l]}} \\ b^{[l]} &:= b^{[l]} - \alpha \frac{\partial J}{\partial b^{[l]}} \end{aligned}\]where $\alpha$ is the learning rate.
Then:
\[\begin{aligned} \frac{\partial J}{\partial w^{[3]}} &= \frac{\partial J}{\partial a^{[3]}} \cdot \frac{\partial a^{[3]}}{\partial z^{[3]}} \cdot \frac{\partial z^{[3]}}{\partial w^{[3]}} \\ \frac{\partial J}{\partial w^{[2]}} &= \frac{\partial J}{\partial a^{[3]}} \cdot \frac{\partial a^{[3]}}{\partial z^{[3]}} \cdot \frac{\partial z^{[3]}}{\partial a^{[2]}} \cdot \frac{\partial a^{[2]}}{\partial z^{[2]}} \cdot \frac{\partial z^{[2]}}{\partial w^{[2]}} \end{aligned}\]We also have
\[\begin{aligned} \frac{\partial J}{\partial w^{[3]}} &= -[y^{(i)} \frac{\partial}{\partial w^{[3]}} \log\bigl(\hat{y}^{(i)}\bigr) + (1 - y^{(i)}) \frac{\partial}{\partial w^{[3]}} \log\bigl(1 - \hat{y}^{(i)}\bigr)] \\ &= -[y^{(i)} \frac{1}{a^{[3]}} a^{[3]}(1 - a^{[3]})a^{[2]T} + (1 - y^{(i)}) \frac{1}{1 - a^{[3]}} -a^{[3]}(1 - a^{[3]})a^{[2]T}] \\ &= -[y^{(i)}(1 - a^{[3]})a^{[2]T} - (1 - y^{(i)})a^{[3]}a^{[2]T}] \\ &= -a^{[2]}(y^{(i)} - a^{[3]}) \end{aligned}\] \[\begin{aligned} \frac{\partial J}{\partial w^{[2]}} &= \frac{\partial J}{\partial a^{[3]}} \cdot \frac{\partial a^{[3]}}{\partial z^{[3]}} \cdot \frac{\partial z^{[3]}}{\partial a^{[2]}} \cdot \frac{\partial a^{[2]}}{\partial z^{[2]}} \cdot \frac{\partial z^{[2]}}{\partial w^{[2]}} \\ &= (a^{[3]} - y^{(i)}) \cdot \frac{\partial z^{[3]}}{\partial a^{[2]}} \cdot \frac{\partial a^{[2]}}{\partial z^{[2]}} \cdot \frac{\partial z^{[2]}}{\partial w^{[2]}} \\ &= (a^{[3]} - y^{(i)}) \cdot w^{[3]T} \cdot \frac{\partial a^{[2]}}{\partial z^{[2]}} \cdot \frac{\partial z^{[2]}}{\partial w^{[2]}} \\ &= (a^{[3]} - y^{(i)}) \cdot w^{[3]T} \circ {a^{[2]}(1 - a^{[2]})} \cdot \frac{\partial z^{[2]}}{\partial w^{[2]}} \\ &= (a^{[3]} - y^{(i)}) \cdot w^{[3]T} \circ {a^{[2]}(1 - a^{[2]})} \cdot a^{[1]T} \\ &= w^{[3]T} \circ a^{[2]} (a^{[3]} - y^{(i)}) (1 - a^{[2]}) a^{[1]T} \end{aligned}\]Here $\circ$ denotes element-wise multiplication.
Improve NNs
Activation function
Sigmoid, ReLU, tanh, etc.
\[\begin{aligned} Sigmoid(z) &= \frac{1}{1 + e^{-z}} \\ \\ ReLU(z) &= \begin{cases} z & \text{if } z > 0 \\ 0 & \text{otherwise} \end{cases} \\ \\ tanh(z) &= \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}} \end{aligned}\]If we do not use the activation function in our NN, then the network is just a linear regression model like $a^{[2]} = w^{[2]} (w^{[1]}x + b^{[1]}) + b^{[2]}$. In other words, the complexity of the NN comes from the activation function.
Initialization methods
Normalization
Process the input by standardized normalization.
\[\tilde{x} = \frac{x - \mu}{\sigma}\]Vanishing/exploding gradients
If the weights are too large, the gradients will explode. If the weights are too small, the gradients will vanish.
Experimentally initialization: $w_{i} \sim \frac{1}{n}$.
For layer weights, $w^{[l]} \sim \text{np.random.randn}(shape) \times \sqrt{\frac{1}{n^{[l-1]}}}$.
If we use sigmoid, we can use $w^{[l]} \sim \text{np.random.randn}(shape) \times \sqrt{\frac{2}{n^{[l-1]}}}$.
Xavier initialization
It proposed that $w^{[l]} \sim \sqrt{\frac{1}{n^{[l-1]}}}$ for tanh.
He initialization
It proposed that $w^{[l]} \sim \sqrt{\frac{2}{n^{[l]} + n^{[l-1]}}}$.
Optimization
Mini-batch gradient descent
For batch gradient descent, you can use vectorization to give a batch of inputs, forward propagate. And stochastic gradient descent you can have faster updates.
Consider $X = (x^{(1)}, x^{(2)}, \ldots, x^{(m)})$ and $Y = (y^{(1)}, y^{(2)}, \ldots, y^{(m)})$.
We split the data into mini-batches of size $T$ like $X = (X^{(1)}, X^{(2)}, \ldots, X^{(T)})$ and $Y = (Y^{(1)}, Y^{(2)}, \ldots, Y^{(T)})$, where $X^{(i)}$ and $Y^{(i)}$ are the $i$-th mini-batch.
For iteration t = 1, 2, ...:
Select batch (X^{(t)}, Y^{(t)})
Forward propagation
Backprop Batch
GD with momentum
In short, it can do the gradient descent faster.
\[\begin{aligned} v &= \beta v - (1-\beta) \frac{\partial J}{\partial w} \\ w &= w - \alpha v \end{aligned}\]