Generative Learning Algorithms & Naive Bayes

02 Jan 2025

Generative & Discriminative Comparison

Discriminative algorithms learn $P(y \vert x)$ directly or learn a direct map from inputs $x$ to outputs $y$.

  • Choose parameters $\theta$ to maximize $P(y \vert x; \theta)$

  • Weaker assumptions (Cannot implies Generative)

Generative algorithms learn $P(x \vert y)$ and $P(y)$, and then use Bayes rule to compute $P(y \vert x)$, where $x$ is the feature, and $y$ is the class.

  • Choose parameters $\theta$ to maximize $P(x, y)$

  • Stronger assumptions (Can implies Logistic Regression)

TIPS:

Conditional Probability is defined as:

\[P(x, y) = P(x \vert y)P(y) = P(y \vert x)P(x)\]

Bayes Rule is defined as:

\[\begin{aligned} P(y \vert x) &= \frac{P(x \vert y)P(y)}{P(x)} \\ P(x) &= \sum_{y} P(x \vert y)P(y) \end{aligned}\]
  • $P(y)$ is the prior probability of class y.

  • $P(x \vert y)$ is the likelihood of the data given the class.

  • $P(x)$ is the evidence.

Gaussian Discriminant Analysis (GDA)

Suppose $x \in \mathbb{R}^n$ (drop $x_0 = 1$ by convention).

Assume $P(x \vert y)$ is Gaussian:

\[z \sim \mathcal{N}(\vec{\mu}, \Sigma)\]

where $z \in \mathbb{R}^n, \vec{\mu} \in \mathbb{R}^n, \Sigma \in \mathbb{R}^{n \times n}$

Then we have:

  • $E[z] = \vec{\mu}$

  • $Cov[z] = E[(z - \vec{\mu})(z - \vec{\mu})^T] = E[zz^T] - \vec{\mu}\vec{\mu}^T$

  • $P(z) = \frac{1}{(2\pi)^{n/2} \vert \Sigma \vert^{1/2}} exp(-\frac{1}{2}(z - \vec{\mu})^T \Sigma^{-1} (z - \vec{\mu}))$

implies:

\[P(x \vert y = 0) = \frac{1}{(2\pi)^{n/2} \vert \Sigma \vert^{1/2}} exp(-\frac{1}{2}(x - \vec{\mu}_0)^T \Sigma^{-1} (x - \vec{\mu}_0)) \\ P(x \vert y = 1) = \frac{1}{(2\pi)^{n/2} \vert \Sigma \vert^{1/2}} exp(-\frac{1}{2}(x - \vec{\mu}_1)^T \Sigma^{-1} (x - \vec{\mu}_1))\]

We know that $y \in {0,1} \implies y \sim Bernoulli(\phi)$, where $\phi = P(y = 1)$.

\[P(y) = \phi^y(1 - \phi)^{1 - y}\]

Given the training set ${(x^{(i)}, y^{(i)})}_{i=1}^m$, in order to fit the parameters $\vec{\mu}_0, \vec{\mu}_1, \Sigma, \phi$, we need to maximize the joint likelihood:

\[\begin{aligned} L(\vec{\mu}_0, \vec{\mu}_1, \Sigma, \phi) &= \prod_{i=1}^m P(x^{(i)}, y^{(i)}) \\ &= \prod_{i=1}^m P(x^{(i)} \vert y^{(i)})P(y^{(i)}) \\ \end{aligned}\]

By MLE, we have:

\[\begin{aligned} \vec{\mu}_0 &= \frac{\sum_{i=1}^m \mathbb{I}\{y^{(i)} = 0\}x^{(i)}}{\sum_{i=1}^m \mathbb{I}\{y^{(i)} = 0\}} \\ \vec{\mu}_1 &= \frac{\sum_{i=1}^m \mathbb{I}\{y^{(i)} = 1\}x^{(i)}}{\sum_{i=1}^m \mathbb{I}\{y^{(i)} = 1\}} \\ \phi &= \frac{1}{m} \sum_{i=1}^m \mathbb{I}\{y^{(i)} = 1\} \\ \Sigma &= \frac{1}{m} \sum_{i=1}^m (x^{(i)} - \vec{\mu}_{y^{(i)}})(x^{(i)} - \vec{\mu}_{y^{(i)}})^T \end{aligned}\]

The prediction is:

\[\arg \max_y P(y \vert x) = \arg \max_y P(x \vert y)P(y)\]

Naive Bayes

Natural Language Processing (NLP) is a typical application of Naive Bayes. We need to map the text to a feature vector $x$.

Consider building an email spam filter using Naive Bayes. We wish to classify messages according to whether they are spam or not.

Suppose we have a dictionary of $n$ words, and in the feature vector $x \in {0,1}^{n}$, if the word $i$ appears in the text, then $x_i = 1$, otherwise $x_i = 0$. We will represent an email via a feature vector $x$.

Assume that $x_{i}$ are conditionally independent given $y$:

\[P(x \vert y) = \prod_{i=1}^n P(x_i \vert y)\]

The parameters are:

\[\begin{aligned} \phi_{j \vert y = 1} &= P(x_j = 1 \vert y = 1) \\ \phi_{j \vert y = 0} &= P(x_j = 1 \vert y = 0) \\ \phi_y &= P(y = 1) \end{aligned}\]

The joint likelihood is:

\[L(\phi_{y}, \phi_{j \vert y}) = \prod_{i=1}^m P(x^{(i)}, y^{(i)}; \phi_y, \phi_{j \vert y})\]

Take MLE, we have:

\[\begin{aligned} \phi_{y} &= \frac{1}{m} \sum_{i=1}^m \mathbb{I}\{y^{(i)} = 1\} \\ \phi_{j \vert y = 1} &= \frac{\sum_{i=1}^m \mathbb{I}\{x_j^{(i)} = 1, y^{(i)} = 1\}}{\sum_{i=1}^m \mathbb{I}\{y^{(i)} = 1\}} \\ \end{aligned}\]

Laplace Smoothing

If some words do not appear in the training set, then the likelihood will be zero. To avoid this, we can use Laplace Smoothing:

\[\begin{aligned} \phi_{j \vert y = 1} &= \frac{\sum_{i=1}^m \mathbb{I}\{x_j^{(i)} = 1, y^{(i)} = 1\} + \alpha}{\sum_{i=1}^m \mathbb{I}\{y^{(i)} = 1\} + 2\alpha} \\ \phi_{j \vert y = 0} &= \frac{\sum_{i=1}^m \mathbb{I}\{x_j^{(i)} = 1, y^{(i)} = 0\} + \alpha}{\sum_{i=1}^m \mathbb{I}\{y^{(i)} = 0\} + 2\alpha} \\ \end{aligned}\]

where $\alpha$ is the smoothing parameter.

Events Model

Multivariate Bernoulli Event Model

In this model, we assume that the way an email is generated is that first it is randomly determined (according to $P(y)$) whether the email is spam or not, and then the person sending the email runs through the dictionary, deciding whether to include each word $j$ in that email independently.

\[\begin{aligned} P(x \vert y) &= \prod_{j=1}^n P(x_j \vert y) \\ P(x_j \vert y) &= \phi_{j \vert y}^{x_j} (1 - \phi_{j \vert y})^{(1 - x_j)} \\ P(y \vert x) &= \frac{P(x \vert y)P(y)}{P(x)} \\ &\propto P(y) \prod_{j=1}^n P(x_j \vert y) \\ \end{aligned}\]

Multinomial Event Model

In the multinomial event model, we assume that the way an email is generated is via a random process in which spam/non-spam is first determined as before. Then, the sender of the email writes the email by first generating $x_1$ from some multinomial distribution, then generating $x_2$ from some multinomial distribution, and so on.

Thus the probability of generating a word $j$ in an email is given by:

\[P(y \vert x) \propto P(y) \prod_{j=1}^n P(x_j \vert y)\]

The parameters are:

\[\begin{aligned} \phi_{y} &= P(y) \\ \phi_{k \vert y = 1} &= P(x_{j} = k \vert y = 1) \\ \phi_{k \vert y = 0} &= P(x_{j} = k \vert y = 0) \\ \end{aligned}\]

If we are given a training set ${(x^{(i)}, y^{(i)})}{i=1}^m$, where $x^{(i)} = (x_1^{(i)}, x_2^{(i)}, \cdots, x{n_{i}}^{(i)})$, here, $n_{i}$ is the number of words in the $i$-training example.

The likelihood of the data is given by

\[\begin{aligned} L(\phi_{y}, \phi_{k \vert y = 0}, \phi_{k \vert y = 1}) &= \prod_{i=1}^m P(x^{(i)}, y^{(i)}) \\ &= \prod_{i=1}^m P(y^{(i)}; \phi_{y}) \prod_{j=1}^{n_{i}} P(x_{j}^{(i)} \vert y^{(i)}; \phi_{k \vert y = 0}, \phi_{k \vert y = 1}) \end{aligned}\]

Take MLE yields:

\[\begin{aligned} \phi_{y} &= \frac{1}{m} \sum_{i=1}^m \mathbb{I}\{y^{(i)} = 1\} \\ \phi_{k \vert y = 1} &= \frac{\sum_{i=1}^{m} \mathbb{I\{y^{(i)}=1\}} \sum_{j=1}^{n_{i}}\mathbb{I}\{x_{j}^{(i)}=k\}}{n_{i}\sum_{i=1}^{m} \mathbb{I\{y^{(i)}=1\}}}\\ \end{aligned}\]

By applying Laplace Smoothing, we have:

\[\begin{aligned} \phi_{k \vert y = 1} &= \frac{\sum_{i=1}^{m} \mathbb{I\{y^{(i)}=1\}} \sum_{j=1}^{n_{i}}\mathbb{I}\{x_{j}^{(i)}=k\} + \alpha}{n_{i}\sum_{i=1}^{m} \mathbb{I\{y^{(i)}=1\}} + \vert V \vert \alpha}\\ \end{aligned}\]
  • $x_{j}^{(i)}$ is the $j$-th word in the $i$-training example (email).
  • $m$ is the number of training examples.
  • $n_{i}$ is the number of words in the $i$-training example (email).
  • $k$ is the index of the word in the vocabulary.
  • $\vert V \vert$ is the size of the vocabulary.