Support Vector Machines & Kernels

03 Jan 2025

Support Vector Machines

Support Vector Machine (SVM) is a algorithm to find potentially non-linear decision boundaries. It maps the feature vector into a higher dimensional space and then apply a linear classifier in that space.

Consider the Logistic Regression model, to generate a linear classifier, we hope that

$\text{If } y=1, \text{then } \theta^{T}x^{(i)} \gg 0$.
$\text{If } y=0, \text{then } \theta^{T}x^{(i)} \ll 0$.

In SVM, we set $y \in {-1, +1}$ and have $h_{\theta}=g$ output value in ${-1, +1}$ s.t.

\[g(z) = \begin{cases} 1 & \text{if } z \geq 0 \\ -1 & \text{otherwise} \end{cases}\]

TIPS: $h_{\theta}(x) = h_{w,b}(x)$, where $\theta = [b, w]^{T}, b=\theta_{0}, w = [\theta_{1}, \cdots ,\theta_{n}]^{T}$.

Functional margin of hyperplane defined by $(w, b)$ with respect to $(x^{(i)}, y^{(i)})$ is:

\[\hat{\gamma}^{(i)} = y^{(i)}(w^{T}x^{(i)} + b)\]

$\text{If } y^{(i)} = 1, \text{want } w^{T}x^{(i)} + b \gg 1$.
$\text{If } y^{(i)} = -1, \text{want } w^{T}x^{(i)} + b \ll -1$.

Functional margin w.r.t. training set is:

\[\hat{\gamma} = \min_{i=1,\cdots, m} \hat{\gamma}^{(i)}\]

The Geometric margin of hyperplane is:

\[\begin{aligned} \gamma^{(i)} &= y^{(i)}(\frac{w^{T}x^{(i)} + b}{\Vert w \Vert}) \\ \gamma &= \min_{i=1,\cdots, m} \gamma^{(i)} \end{aligned}\]

Optimization Problem

By previous information, the optimal margin classifier is to choose $(w, b)$ to maximize

\[\max_{\gamma, w, b} {\gamma} \\ \text{s.t. } \frac{y^{(i)}({w^{T}x^{(i)} + b})}{\Vert w \Vert ^{2}} \geq \gamma, i=1,\cdots, m\]

Notice that maximizing $\frac{\hat{\gamma}}{\Vert w \Vert} = \frac{1}{\Vert w \Vert}$ is equivalent to maximize $\frac{1}{\Vert w \Vert^{2}}$ by scaling $\hat{\gamma} = 1$. Thus, the optimization problem can be written as:

\[\max_{w, b} \frac{1} {\Vert w \Vert ^ {2}} \\ \text{s.t. } y^{(i)}({w^{T}x^{(i)} + b}) \gamma \geq \gamma, i=1,\cdots, m\]

which is equivalent to:

\[\min_{w, b} \frac{1}{2} \Vert w \Vert ^ {2} \\ \text{s.t. } y^{(i)}({w^{T}x^{(i)} + b}) \geq 1, i=1,\cdots, m\]

Given that $x^{(i)} \in \mathbb{R}^{100}$

Suppose $w = \sum_{i=1}^{m} \alpha_{i}y^{(i)}x^{(i)}$, substitute into the optimzation problem, we have:

\[\begin{aligned} &\min \frac{1}{2} (\sum_{i=1}^{m} \alpha_{i}y^{(i)}x^{(i)})^{T}(\sum_{j=1}^{m} \alpha_{j}y^{(j)}x^{(j)}) \\ = &\min \frac{1}{2} \sum_{i=1}^{m} \sum_{j=1}^{m} \alpha_{i}\alpha_{j}y^{(i)}y^{(j)} \langle x^{(i)}, x^{(j)} \rangle \\ &\text{s.t. } y^{(i)}(\sum_{j=1}^{m} \alpha_{j}y^{(j)} \langle x^{(j)}, x^{(i)} \rangle + b) \geq 1 \\ = &\max \sum_{i=1}^{m} \alpha_{i} - \frac{1}{2} \sum_{i=1}^{m} \sum_{j=1}^{m} \alpha_{i}\alpha_{j}y^{(i)}y^{(j)} \langle x^{(i)}, x^{(j)} \rangle \\ &\text{s.t. } \alpha_{i} \geq 0, \sum_{i=1}^{m} \alpha_{i}y^{(i)} = 0 \end{aligned}\]

Finally, the way we make predictions is:

Solve for $\alpha_{i}$’s or $b$.
To make prediction, compute $h_{w,b}(x) = g(\sum_{i=1}^{m} \alpha_{i}y^{(i)} \langle x^{(i)}, x \rangle + b)$.

Kernels

Write algorithm in terms of $\langle x^{(i)}, x^{(j)} \rangle$ or $\langle x, z \rangle$.
Let there be mapping function $x \mapsto \phi(x)$.
Find the way to compute $K(x, z) = \phi(x)^{T}\phi(z)$.
Replace $\langle x, z \rangle$ in algorithm with $K(x, z)$.

Suppose $x \in \mathbb{R}^{n}$ and $\phi(x) \in \mathbb{R}^{n^{2}}, K(x,z) = (x^{T}z)^{2}$ then

\[\begin{aligned} K(x,z) &= \phi(x)^{T}\phi(z) \\ &= (\sum_{i=1}^{n}x_{i}z_{i})(\sum_{j=1}^{n}x_{j}z_{j}) \\ &= \sum_{i=1}^{n} \sum_{j=1}^{n} x_{i}x_{j}z_{i}z_{j} \\ &= \langle x, z \rangle^{2} \end{aligned}\]

Mercer’s Theorem

Let ${x^{1}, \cdots, x^{(d)}}$ be points, $K \in \mathbb{R}^{d \times d}$ which is called Kernel matrix. $K_{ij} = K(x^{(i)}, x^{(j)})$.

Given any vector $z$, if $K$ is a valid kernel function then we have:

\[\begin{aligned} z^{T}Kz &= \sum_{i=1}^{d} \sum_{j=1}^{d} z_{i}K_{ij}z_{j} \\ &= \sum_{i=1}^{d} \sum_{j=1}^{d} z_{i}\phi(x^{(i)})^{T}\phi(x^{(j)})z_{j} \\ &= \sum_{i=1}^{d} \sum_{j=1}^{d} z_{i}\sum_{k=1}^{d}(\phi(x^{(i)})_{k}^{T}\phi(x^{(j)})_{k})z_{j} \\ &= \sum_{k=1}^{d} \sum_{i=1}^{d} \sum_{j=1}^{d} z_{i}\phi(x^{(i)})_{k}^{T}\phi(x^{(j)})_{k}z_{j} \\ &= \sum_{k}(\sum_{i}z_{i}\phi(x^{(i)})_{k})^{2} \\ &\geq 0 \end{aligned}\]

which also showed that $K$ is a semi-positive definite matrix.

Theorem: If $K$ is a valid kernel function, if and only if for any $d$ points ${x^{(1)}, \cdots, x^{(n)}}$, the corresponding matrix $K \geq 0$ (positive semi-definite).

It turns out the Gaussian Kernel is a valid kernel function.

\[K(x, z) = \exp(-\frac{\Vert x - z \Vert^{2}}{2\sigma^{2}})\]

Representer Theorem

L1-norm soft margin SVM

To make the hyperplane less precise and accept some misclassification, we can use L1 norm to penalize the misclassification.

\[\begin{aligned} &\min_{w, b, \xi} \frac{1}{2} \Vert w \Vert^{2} + C \sum_{i=1}^{m} \xi_{i} \\ &\text{s.t. } y^{(i)}(w^{T}x^{(i)} + b) \geq 1 - \xi_{i} \\ &\text{where }\xi_{i} \geq 0, i=1,\cdots, m \end{aligned}\]