Linear Regression and Gradient Descent

31 Dec 2024

Linear Regression

Size	Price
2104	400
1416	232
1534	315
852	178
…	…

Linear Regression is to fit the data into a straight line, so the process of supervised learning is that we have a training set, and we have a learning algorithm, and the learning algorithm will output a function that maps the input to the output.

\[\text{Training Set} \\ \downarrow \\ \text{Learning Algorithm} \\ \downarrow \\ \text{input} \to \text{function(hypothesis)} \to \text{output}\]

So when we design a learning algorithm, we need to know that how to represent the hypothesis. In the case of linear regression, the hypothesis is represented as:

\[h(x) = \theta_0 + \theta_1 x\]

Suppose there are more than one feature, for example, the number of bedrooms

Size	Bedrooms	Price
2104	3	400
1416	2	232
1534	3	315
852	2	178
…	…

Given that $x_1$ is the size of the house, $x_2$ is the number of bedrooms, the hypothesis is represented as:

\[h(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2\]

or normally we write it as:

\[h(x) = \sum_{i=0}^{n} \theta_i x_i\]

where $n$ is the number of features, $x_i$ is the feature value, in this case $n=2$ and $x_0 = 1$.

So here the $\theta$ becomes a three-dimensional parameter:

\[\theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \end{bmatrix}\]

And the $x$ becomes a three-dimensional vector:

\[x = \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ \end{bmatrix}\]

$\theta$ is called the parameters of the learning algorithm, and the job of the learning algorithm is to choose parameters $\theta$ that allows the good prediction of the hypothesis function.

What we will do is to choose $\theta$ s.t. $h(x)$ is close to $y$ for our training examples $(x, y)$.

In the linear regression, also called ordinary least squares, we will want to minimize the squared of the difference between the hypothesis and the actual output:

\[\min_{\theta} J(\theta) = \min_{\theta} \frac{1}{2}\sum_{i}^{m} (h_{\theta}(x^{(i)}) - y^{(i)})^2\]

Here, $1/2$ is just for the convenience of the derivative by convention. $J(\theta)$ is called the cost function or squared error function.

Terms

$\theta$ is the parameters of the learning algorithm
$n$ is the number of features
$m$ is the number of training examples
$x$ is the inputs / features
$y$ is the output / target variable
$(x, y)$ is a training example
$(x^{(i)}, y^{(i)})$ is the $i^{th}$ training example

Batch / Stochastic Gradient Descent

(Batch) Gradient Descent

Start with some initial $\theta$ (Say $\theta = \vec{0}$)

Keep changing $\theta$ to reduce $J(\theta)$ by repeating the following equation until convergence:

\[\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)\]

where $\alpha$ is the learning rate in $(0, 1)$, and $\frac{\partial}{\partial \theta_j} J(\theta)$ is the partial derivative of the cost function $J(\theta)$ with respect to the parameter $\theta_j$.

\[\frac{\partial}{\partial \theta_j} J(\theta) = (h_{\theta}(x) - y)x_j = \sum_{i=1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)})x_j^{(i)}\]

The method looks at every example in the entire training set on every step, and is called batch gradient descent.

TIPS: For decision of learning rate, if the learning rate is too larger, then it may run past the minimum and will not converge, if the learning rate is too small, then it will increase the time of the algorithm.

Stochastic Gradient Descent

Repeat {
    for i = 1 to m {
        theta_j := theta_j - alpha * (h_theta(x^{(i)}) - y^{(i)}) * x_j^{(i)}
    }
}

In stochastic gradient descent, it will never converge, but it will get close to the minimum. In the large dataset, it is much faster than the batch gradient descent.

\[\theta_j := \theta_j - \alpha (h_{\theta}(x^{(i)}) - y^{(i)})x_j^{(i)}\]

Normal Equation

\[\nabla_{\theta} J(\theta) = \begin{bmatrix} \frac{\partial}{\partial \theta_0} J(\theta) \\ \frac{\partial}{\partial \theta_1} J(\theta) \\ \frac{\partial}{\partial \theta_2} J(\theta) \\ \end{bmatrix}\]

Given that $\theta \in \mathbb{R}^{n+1}$, $J(\theta)$ is the cose function.

\[X = \begin{bmatrix} 1 & (x^{(1)})^{T} \\ 1 & (x^{(2)})^{T} \\ 1 & \vdots \\ 1 & (x^{(m)})^{T} \\ \end{bmatrix}\]

Here, $X$ is called the design matrix. $(x^{(i)})^{T}$ represents all the feature values of the $i^{th}$ training example.

\[X \theta = \begin{bmatrix} 1 & (x^{(1)})^{T} \\ 1 & (x^{(2)})^{T} \\ 1 & \vdots \\ 1 & (x^{(m)})^{T} \\ \end{bmatrix} \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \end{bmatrix} = \begin{bmatrix} h_{\theta}(x^{(1)}) \\ h_{\theta}(x^{(2)}) \\ \vdots \\ h_{\theta}(x^{(m)}) \\ \end{bmatrix}\]

Given that $\vec{y} \in \mathbb{R}^{m}$, $y$ is the output of the training set.

\[\begin{align*} J(\theta) &= \frac{1}{2} (X \theta - \vec{y})^{T} (X \theta - \vec{y}) \\ &= \frac{1}{2} \left( \theta^{T} X^{T} X \theta - \theta^{T} X^{T} \vec{y} - \vec{y}^{T} X \theta + \vec{y}^{T} \vec{y} \right) \\ &= \frac{1}{2} \left( \theta^{T} X^{T} X \theta - 2 \theta^{T} X^{T} \vec{y} + \vec{y}^{T} \vec{y} \right). \end{align*}\]

Therefore, the gradient of $J(\theta)$ is:

\[\begin{align*} \nabla_{\theta} J(\theta) &= \nabla_{\theta} \left( \frac{1}{2} \left( \theta^{T} X^{T} X \theta - 2 \theta^{T} X^{T} \vec{y} + \vec{y}^{T} \vec{y} \right) \right) \\ &= \nabla_{\theta} \left( \frac{1}{2} \theta^{T} X^{T} X \theta - \theta^{T} X^{T} \vec{y} \right) \\ &= X^{T} X \theta - X^{T} \vec{y}. \end{align*}\]

To minimize the cose function $J(\theta)$, we set the gradient to zero:

\[X^{T} X \theta = X^{T} \vec{y}\]

which is the normal equation. so

\[\theta = (X^{T} X)^{-1} X^{T} \vec{y}\]

TIPS: Given that $A \in \mathbb{R}^{n \times n}$

\[tr(A) = \sum_{i=1}^{n} A_{ii} = tr(A^{T})\] \[\nabla_A tr(AA^{T}) = 2A\]

Given that $B \in \mathbb{R}^{n \times m}$, then

\[tr(AB) = tr(BA)\] \[\nabla_A tr(AB) = B^{T}\] \[\nabla_A tr(AA^{T}B) = BA + B^{T}A\]

Given that $C \in \mathbb{R}^{m \times n}$, then

\[tr(ABC) = tr(CAB) = tr(BCA)\]

Given that $f(A): \mathbb{R}^{n \times n} \to \mathbb{R}$, then

\[\nabla_A f(A) = \begin{bmatrix} \frac{\partial}{\partial A_{11}} f(A) & \frac{\partial}{\partial A_{12}} f(A) & \cdots & \frac{\partial}{\partial A_{1n}} f(A) \\ \frac{\partial}{\partial A_{21}} f(A) & \frac{\partial}{\partial A_{22}} f(A) & \cdots & \frac{\partial}{\partial A_{2n}} f(A) \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial}{\partial A_{n1}} f(A) & \frac{\partial}{\partial A_{n2}} f(A) & \cdots & \frac{\partial}{\partial A_{nn}} f(A) \\ \end{bmatrix}\]