Contents
Word Vectors
One Hot Vectors
One hot vectors are a simple way to represent words as vectors. Each word in the vocabulary is represented by a vector of length equal to the size of the vocabulary, with a 1 in the position corresponding to the word and 0 elsewhere.
For example, if our vocabulary consists of the words [“cat”, “dog”, “people”], the one hot vectors would be:
\[\begin{align*} \text{cat} &= [1, 0, 0] \\ \text{dog} &= [0, 1, 0] \\ \text{people} &= [0, 0, 1] \end{align*}\]In this way, the vector dimension is equal to the size of the vocabulary. However, every vector is orthogonal to each other, which means one hot vectors do not capture any natural, inherent sense of the meaning of words.
WordNet
There are substantial existing resources in English and a few other languages for various kinds of annotated information about words.WordNet is a lexical database that annotates words for synonyms, hyponyms, and other semantic relationships, which represents word semantics not as one-hot vectors, but instead as a collection of features and relationships to linguistic categories and other words.
Word2Vec(Skip-gram Model)
A promise of deep learning is to learn rich representations of complex objects from data. Increasingly relevant in NLP is the idea that we can unsupervisedly learn rich representations from data. Unsupervised (or lately, “self-supervised”) learning takes data and attempts to learn learn properties of the elements of that data, often by taking part of the data (maybe a word in a sentence) and attempting to predict other parts of the data (other words) with it.
“You shall know a word by the company it keeps.” - J.R. Firth
The Word2vec framework operates on the supposition that we have a large corpus of text (a long list of words) where every word in a fixed vocabulary is represented by a vector. The process functions by going through each position $t$ in the text, identifying a centre word $c$ and its corresponding context words $o$. By using the similarity of the word vectors for $c$ and $o$, the model can calculate the probability of $o$ given $c$. To refine the model, we Keep adjusting the word vectors in an iterative process to maximize the probability of the actual context words given the centre word.
Here, the centre word is “playing” and the context words are “Connor”, “likes”, “computer”, and “games”
For each position $t=1, 2, \dots, T$, predict context words within a window of fixed size $m$, given centre word $w_t$, the datalikelihood function is defined as:
\[L(\theta) = \prod_{t=1}^{T} \prod_{ {-m \leq j \leq m} \atop {j \neq 0}} P(w_{t+j} | w_t; \theta)\]The objective function $J(\theta)$ is the average negative log likelihood of the data:
\[J(\theta) = - \frac{1}{T} log L(\theta) = - \frac{1}{T} \sum_{t=1}^{T} \sum_{ {-m \leq j \leq m} \atop {j \neq 0}} log P(w_{t+j} | w_t; \theta)\]To calculate the conditional probability $P(w_{t+j} \mid w_t; \theta)$, we will use two vectors to represent each word in the vocabulary:$v_w$ when $w$ is the centre word, and $u_w$ when $w$ is a context word. The conditional probability is defined using the softmax function:
\[P(o|c) = \frac{exp(u_o^T v_c)}{\sum_{w \in V} exp(u_w^T v_c)}\]Context vectors implicitly encode a word’s predictive role rather than explicit position. Asymmetric embeddings separate role information from semantics, while symmetric embeddings collapse them, reducing directional expressiveness.
In common practice, we will calculate the average value of the two vectors to represent the word, but in fact, the value of two vectors are closely related to each other.
| In real corpus training, the vocabulary size $ | V | $ can be very large, making the computation of the softmax function expensive. To address this, techniques such as Negative Sampling are often used to approximate the softmax function and reduce computational complexity. |
In specific, for each real pair $(o, c)$, model is willing to distinguish it from real and positive samples. In this way, we randomly sample $K$ words from the noise distribution $P(w)$ as negative samples. The objective function for negative sampling is defined as:
\[\begin{align*} J_{neg-sample}(\theta) &= - \log{\sigma(u_o^T v_c)} - \sum_{k=1}^{K} \log{\sigma(-u_k^T v_c)} \\ P(w) &= \frac{U(w)^{3/4}}{\sum_{w' \in V} U(w')^{3/4}} \end{align*}\]GloVe
It is worth noting that it is computationally expensive to iterate through the whole corpus (perhaps many times) to train the word vectors. GloVe is a method that capture the co-occurrence statistics of words in a corpus, and use matrix factorization techniques to learn word vectors from the co-occurrence matrix.
Firstly, it constructs a word-word co-occurrence matrix $X$, where each entry $X_{ij}$ represents the number of times word $j$ appears in the context of word $i$. The context is defined as a fixed-size window around the centre word. The conditional probability of word $j$ appearing in the context of word $i$ is defined as:
\[P_{ij} = P(j \mid i) = \frac{X_{ij}}{X_i}\]It is worth to noted that the probability ratio is better able to distinguish relevant words from irrelevant words and it is also better able to discriminate between the two relevant words.
Therefore, given three words $i$, $j$, and $k$, the most general model takes the form:
\[F(w_i, w_j, \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}\]where $F$ depends only on depend only on the difference of the two target words.
To move from multiplicative to additive relations, we take the logarithm of the probability ratio:
\[w_i^T \tilde{w}_k = \log{P_{ik}} = \log{X_{ik}} - \log{X_i} \\ w_i^T \tilde{w}_k + b_i + \tilde{b}_k = \log{X_{ik}}\]Here, $\log{X_i}$ is irelevant to $k$ and is absorbed into the bias term $b_i$.
The reason of using the bias terms least squares is to avoid $\log{X_{ik}}$ diverging when $X_{ik}=0$. Also, $X$ is usually very sparse, model will be dominated by plenty of rare co-occurrences if we treat all non-zero $X_{ik}$ equally.
Finally, the cost function is defined as:
\[J = \sum_{i}^{V} \sum_{j}^{V} f(X_{ij}) (w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log{X_{ij}})^2\]where $f(x)$ should satisfy the following conditions: When $x \rightarrow 0$, $f(x) \rightarrow 0$ to avoid dominance of rare and zero co-occurrences.