Support Vector Machines & Kernels
Support Vector Machines
Support Vector Machine (SVM) is a algorithm to find potentially non-linear decision boundaries. It maps the feature vector into a higher dimensional space and then apply a linear classifier in that space.
Consider the Logistic Regression model, to generate a linear classifier, we hope that
-
$\text{If } y=1, \text{then } \theta^{T}x^{(i)} \gg 0$.
-
$\text{If } y=0, \text{then } \theta^{T}x^{(i)} \ll 0$.
In SVM, we set $y \in {-1, +1}$ and have $h_{\theta}=g$ output value in ${-1, +1}$ s.t.
\[g(z) = \begin{cases} 1 & \text{if } z \geq 0 \\ -1 & \text{otherwise} \end{cases}\]TIPS: $h_{\theta}(x) = h_{w,b}(x)$, where $\theta = [b, w]^{T}, b=\theta_{0}, w = [\theta_{1}, \cdots ,\theta_{n}]^{T}$.
Functional margin of hyperplane defined by $(w, b)$ with respect to $(x^{(i)}, y^{(i)})$ is:
\[\hat{\gamma}^{(i)} = y^{(i)}(w^{T}x^{(i)} + b)\]- $\text{If } y^{(i)} = 1, \text{want } w^{T}x^{(i)} + b \gg 1$.
- $\text{If } y^{(i)} = -1, \text{want } w^{T}x^{(i)} + b \ll -1$.
Functional margin w.r.t. training set is:
\[\hat{\gamma} = \min_{i=1,\cdots, m} \hat{\gamma}^{(i)}\]The Geometric margin of hyperplane is:
\[\begin{aligned} \gamma^{(i)} &= y^{(i)}(\frac{w^{T}x^{(i)} + b}{\Vert w \Vert}) \\ \gamma &= \min_{i=1,\cdots, m} \gamma^{(i)} \end{aligned}\]Optimization Problem
By previous information, the optimal margin classifier is to choose $(w, b)$ to maximize
\[\max_{\gamma, w, b} {\gamma} \\ \text{s.t. } \frac{y^{(i)}({w^{T}x^{(i)} + b})}{\Vert w \Vert ^{2}} \geq \gamma, i=1,\cdots, m\]Notice that maximizing $\frac{\hat{\gamma}}{\Vert w \Vert} = \frac{1}{\Vert w \Vert}$ is equivalent to maximize $\frac{1}{\Vert w \Vert^{2}}$ by scaling $\hat{\gamma} = 1$. Thus, the optimization problem can be written as:
\[\max_{w, b} \frac{1} {\Vert w \Vert ^ {2}} \\ \text{s.t. } y^{(i)}({w^{T}x^{(i)} + b}) \gamma \geq \gamma, i=1,\cdots, m\]which is equivalent to:
\[\min_{w, b} \frac{1}{2} \Vert w \Vert ^ {2} \\ \text{s.t. } y^{(i)}({w^{T}x^{(i)} + b}) \geq 1, i=1,\cdots, m\]Given that $x^{(i)} \in \mathbb{R}^{100}$
Suppose $w = \sum_{i=1}^{m} \alpha_{i}y^{(i)}x^{(i)}$, substitute into the optimzation problem, we have:
\[\begin{aligned} &\min \frac{1}{2} (\sum_{i=1}^{m} \alpha_{i}y^{(i)}x^{(i)})^{T}(\sum_{j=1}^{m} \alpha_{j}y^{(j)}x^{(j)}) \\ = &\min \frac{1}{2} \sum_{i=1}^{m} \sum_{j=1}^{m} \alpha_{i}\alpha_{j}y^{(i)}y^{(j)} \langle x^{(i)}, x^{(j)} \rangle \\ &\text{s.t. } y^{(i)}(\sum_{j=1}^{m} \alpha_{j}y^{(j)} \langle x^{(j)}, x^{(i)} \rangle + b) \geq 1 \\ = &\max \sum_{i=1}^{m} \alpha_{i} - \frac{1}{2} \sum_{i=1}^{m} \sum_{j=1}^{m} \alpha_{i}\alpha_{j}y^{(i)}y^{(j)} \langle x^{(i)}, x^{(j)} \rangle \\ &\text{s.t. } \alpha_{i} \geq 0, \sum_{i=1}^{m} \alpha_{i}y^{(i)} = 0 \end{aligned}\]Finally, the way we make predictions is:
-
Solve for $\alpha_{i}$’s or $b$.
-
To make prediction, compute $h_{w,b}(x) = g(\sum_{i=1}^{m} \alpha_{i}y^{(i)} \langle x^{(i)}, x \rangle + b)$.
Kernels
-
Write algorithm in terms of $\langle x^{(i)}, x^{(j)} \rangle$ or $\langle x, z \rangle$.
-
Let there be mapping function $x \mapsto \phi(x)$.
-
Find the way to compute $K(x, z) = \phi(x)^{T}\phi(z)$.
-
Replace $\langle x, z \rangle$ in algorithm with $K(x, z)$.
Suppose $x \in \mathbb{R}^{n}$ and $\phi(x) \in \mathbb{R}^{n^{2}}, K(x,z) = (x^{T}z)^{2}$ then
\[\begin{aligned} K(x,z) &= \phi(x)^{T}\phi(z) \\ &= (\sum_{i=1}^{n}x_{i}z_{i})(\sum_{j=1}^{n}x_{j}z_{j}) \\ &= \sum_{i=1}^{n} \sum_{j=1}^{n} x_{i}x_{j}z_{i}z_{j} \\ &= \langle x, z \rangle^{2} \end{aligned}\]Mercer’s Theorem
Let ${x^{1}, \cdots, x^{(d)}}$ be points, $K \in \mathbb{R}^{d \times d}$ which is called Kernel matrix. $K_{ij} = K(x^{(i)}, x^{(j)})$.
Given any vector $z$, if $K$ is a valid kernel function then we have:
\[\begin{aligned} z^{T}Kz &= \sum_{i=1}^{d} \sum_{j=1}^{d} z_{i}K_{ij}z_{j} \\ &= \sum_{i=1}^{d} \sum_{j=1}^{d} z_{i}\phi(x^{(i)})^{T}\phi(x^{(j)})z_{j} \\ &= \sum_{i=1}^{d} \sum_{j=1}^{d} z_{i}\sum_{k=1}^{d}(\phi(x^{(i)})_{k}^{T}\phi(x^{(j)})_{k})z_{j} \\ &= \sum_{k=1}^{d} \sum_{i=1}^{d} \sum_{j=1}^{d} z_{i}\phi(x^{(i)})_{k}^{T}\phi(x^{(j)})_{k}z_{j} \\ &= \sum_{k}(\sum_{i}z_{i}\phi(x^{(i)})_{k})^{2} \\ &\geq 0 \end{aligned}\]which also showed that $K$ is a semi-positive definite matrix.
Theorem: If $K$ is a valid kernel function, if and only if for any $d$ points ${x^{(1)}, \cdots, x^{(n)}}$, the corresponding matrix $K \geq 0$ (positive semi-definite).
It turns out the Gaussian Kernel is a valid kernel function.
\[K(x, z) = \exp(-\frac{\Vert x - z \Vert^{2}}{2\sigma^{2}})\]Representer Theorem
L1-norm soft margin SVM
To make the hyperplane less precise and accept some misclassification, we can use L1 norm to penalize the misclassification.
\[\begin{aligned} &\min_{w, b, \xi} \frac{1}{2} \Vert w \Vert^{2} + C \sum_{i=1}^{m} \xi_{i} \\ &\text{s.t. } y^{(i)}(w^{T}x^{(i)} + b) \geq 1 - \xi_{i} \\ &\text{where }\xi_{i} \geq 0, i=1,\cdots, m \end{aligned}\]