Locally Weighted Logistic Regression

31 Dec 2024

Locally Weighted Regression

Parametric learning algorithm: Fit a fixed set of parameters ($\theta$) to the data.
Non-parametric learning algorithm: The amount of data/parameters you need to keep grows (linearly) with the size of the data.

To evaluate $h$ at a certain point $x$, in linear regression, we need to fit $\theta$ to minimize the cost function: $\frac{1}{2}\sum_{i=1}^{m}(y^{(i)} - \theta^T x^{(i)})^2$.

In locally weighted regression, we fit $\theta$ to minimize the new cost function to make the hypothesis at $x$ has a higher weight and smaller weights at far away points:

\[\sum_{i=1}^{m} w^{(i)}(y^{(i)} - \theta^T x^{(i)})^2\]

where $w^{(i)}$ is a weighting function. The default choice of $w^{(i)}$ is $exp(-\frac{(x^{(i)} - x)^2}{2\tau^2})$.

If $\vert{x^{(i)} - x}\vert$ is small, then $w^{(i)}$ is close to 1.
If $\vert{x^{(i)} - x}\vert$ is large, then $w^{(i)}$ is close to 0.

The bandwidth $\tau$ is a hyperparameter that controls how quickly the weights decrease as you move away from $x$.

Probabilistic Interpretation

Assume that $y^{(i)} = \theta^T x^{(i)} + \epsilon^{(i)}$, where $\epsilon^{(i)}$ is a random variable that represents the noise in the data.

$\epsilon^{(i)} \sim N(0, \sigma^2)$
$P(\epsilon^{(i)}) = \frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(\epsilon^{(i)})^2}{2\sigma^2})$

This implies that:

$y^{(i)} \vert x^{(i)};\theta \sim N(\theta^T x^{(i)}, \sigma^2)$
$P(y^{(i)} \vert x^{(i)};\theta) = \frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(y^{(i)} - \theta^T x^{(i)})^2}{2\sigma^2})$

The likelihood of $\theta$ given the data is

\[\begin{aligned} L(\theta) &= P(\vec{y}|X;\theta) \\ &= \prod_{i=1}^{m} P(y^{(i)}|x^{(i)};\theta) \\ &= \prod_{i=1}^{m} \frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(y^{(i)} - \theta^T x^{(i)})^2}{2\sigma^2}) \\ &= (\frac{1}{\sqrt{2\pi}\sigma})^m exp(-\frac{1}{2\sigma^2}\sum_{i=1}^{m}(y^{(i)} - \theta^T x^{(i)})^2) \\ \end{aligned}\]

TIPS: The reason for using the term likelihood is that we view it as a function of parameters holding the data fixed, then we call it a likelihood function.

The log-likelihood function is

\[\begin{aligned} l(\theta) &= log(L(\theta)) \\ &= mlog(\frac{1}{\sqrt{2\pi}\sigma}) - \frac{1}{2\sigma^2}\sum_{i=1}^{m}(y^{(i)} - \theta^T x^{(i)})^2 \end{aligned}\]

In statistics, we often use maximum likelihood estimation (MLE) to estimate the parameters. In this case, choose $\theta$ to maximize $l(\theta)$. It is actually the same as choosing $\theta$ to minimize $\sum_{i=1}^{m}(y^{(i)} - \theta^T x^{(i)})^2$, which is the cost function in linear regression.

Logistic Regression

For (binary) classification problems, i.e. $y \in {0, 1}$, linear regression is not a good choice. Therefore, we introduce logistic regression.

\[h_{\theta}(x) = g(\theta^T x)\]

where $g(z) = \frac{1}{1 + e^{-z}}$ is the sigmoid function.

For binary classification, we can interpret $h_{\theta}(x)$ as the probability that $y = 1$ given $x$ and $\theta$, and hence $1 - h_{\theta}(x)$ as the probability that $y = 0$ given $x$ and $\theta$.

$P(y=1 \vert x;\theta) = h_{\theta}(x)$
$P(y=0 \vert x;\theta) = 1 - h_{\theta}(x)$

We can combine these two functions into one:

\[P(y|x;\theta) = (h_{\theta}(x))^y(1 - h_{\theta}(x))^{1-y}\]

The likelihood function can therefore be written as:

\[\begin{aligned} L(\theta) &= \prod_{i=1}^{m} P(y^{(i)}|x^{(i)};\theta) \\ &= \prod_{i=1}^{m} (h_{\theta}(x^{(i)}))^{y^{(i)}}(1 - h_{\theta}(x^{(i)}))^{1-y^{(i)}} \end{aligned}\]

Also, the log-likelihood function is:

\[\begin{aligned} l(\theta) &= \log(L(\theta)) \\ &= \sum_{i=1}^{m} \left[ y^{(i)} \log(h_{\theta}(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_{\theta}(x^{(i)})) \right] \end{aligned}\]

Finally, to choose $\theta$ to maximize $l(\theta)$, we can use the gradient ascent algorithm.

\[\theta_j := \theta_j + \alpha \frac{\partial l(\theta)}{\partial \theta_j}\]

Take the derivative of $l(\theta)$ with respect to $\theta_j$:

\[\begin{aligned} \frac{\partial l(\theta)}{\partial \theta_j} &= \sum_{i=1}^{m} \left[ y^{(i)} \frac{1}{h_{\theta}(x^{(i)})} \frac{\partial h_{\theta}(x^{(i)})}{\partial \theta_j} - (1 - y^{(i)}) \frac{1}{1 - h_{\theta}(x^{(i)})} \frac{\partial h_{\theta}(x^{(i)})}{\partial \theta_j} \right] \\ &= \sum_{i=1}^{m} \left[ y^{(i)} \frac{1}{h_{\theta}(x^{(i)})} h_{\theta}(x^{(i)})(1 - h_{\theta}(x^{(i)}))x_j^{(i)} - (1 - y^{(i)}) \frac{1}{1 - h_{\theta}(x^{(i)})} h_{\theta}(x^{(i)})(1 - h_{\theta}(x^{(i)}))x_j^{(i)} \right] \\ &= \sum_{i=1}^{m} \left[ y^{(i)}(1 - h_{\theta}(x^{(i)})) - (1 - y^{(i)}) h_{\theta}(x^{(i)}) \right]x_j^{(i)} \\ &= \sum_{i=1}^{m} (y^{(i)} - h_{\theta}(x^{(i)})) x_j^{(i)} \end{aligned}\]

Therefore, the update rule is:

\[\theta_j := \theta_j + \alpha \sum_{i=1}^{m} (y^{(i)} - h_{\theta}(x^{(i)})) x_j^{(i)}\]

The reason why we have + instead of - is that we are using MLE to estimate the parameters instead of MSE.

TIPS: There is no equivalent of normal equations for logistic regression. In general, there is no known way to have a closed-form solution to logistic regression. Thus, we have to use an iterative optimization algorithm to solve it.

Newton’s Method

\[\theta_{1} := \theta_{0} - \Delta\]

where $\Delta = \frac{f(\theta_{0})}{f’(\theta_{0})}$.

By induction, we have

\[\theta_{t+1} := \theta_{t} - \Delta\]

where $\Delta = \frac{f(\theta_{t})}{f’(\theta_{t})}$.

Let $f(\theta) = l’(\theta)$, then

\[\theta_{t+1} := \theta_{t} - \frac{l'(\theta_{t})}{l''(\theta_{t})}\]

$l’(\theta) = \sum_{i=1}^{m} (y^{(i)} - h_{\theta}(x^{(i)}))x^{(i)}$
$l’’(\theta) = -\sum_{i=1}^{m} h_{\theta}(x^{(i)})(1 - h_{\theta}(x^{(i)}))(x^{(i)})^2$

Newton’s method has a quadratic convergence rate, which is much faster than gradient ascent. Informally, if one iteration of Newton’s method has $0.01$ error, then the next iteration will have $0.0001$ error.

When $\theta$ is a vector, the update rule is:

\[\theta := \theta- H^{-1}\nabla_{\theta}l(\theta)\]

where $H$ is the Hessian matrix of $l(\theta)$.

\[H_{ij} = \frac{\partial^2}{\partial \theta_i \partial \theta_j} l(\theta)\]

As we can see, if in higher dimensions, Newton’s method is computationally expensive.