Mastering large language models – Part II: Words and vector spaces

In the last post, we have looked at how a text is pre-processed to make it accessible for a neural network and have seen that the first step is to convert a text into a sequence of numbers, where each number is the index of the corresponding word in a vocabulary. Let us now discuss how we can convert each of these numbers into a vector.

Most machine learning models which are used for natural language processing have a property called the model dimension which we will abbreviate by D. A model dimension of, say, 768, simply means that internally, all words are represented by vectors in a vector space of dimension 768. Thus, a single word is a one-dimensional tensor of length D = 768. Our task is therefore to assign to each word in a vocabulary of size V a vector in a D-dimensional space. This assignment, called the embedding, can be nicely represented as a matrix of dimension D x V, so that the column at position i represents the word with index i in the vocabulary.

Of course there are endless possibilities to construct such an embedding. In most cases, the embedding is a learned parameter, i.e. we start training with a randomly initialized embedding and then apply gradient descent to the embedding matrix as to any other parameter during learning. However, it has become increasingly popular to use an embedding which has already been pre-trained so that training does not start from zero and the model hopefully converges faster. One method that facilitates such a pre-training is called the word2vec algorithm.

The idea of the word2vec algorithm (and of many other approaches to constructing embeddings) is to start with a larger model that contains the embedding we wish to train and a second model part that is adapted to a certain downstream task. We then train the entire model on this downstream task, hoping that the embedding layer will capture not only the specific information required for the downstream task, but more general patterns that are useful for other tasks as well. We then throw away the upper part of the model and reuse the embedding layer for other tasks.

The diagram above shows this architecture. The model consists of an embedding layer which translates a word represented by an index between 0 and V – 1 into a vector of dimension D, the internal model dimension. You can think of this as the combination of a one-hot encoding that turns the index into a vector of dimension V and a linear layer without bias that projects onto D dimensions. This part of the model is, once trained, the actual artefact that we might reuse in other, more complex models.

The upper part of the model is adapted to the specific downstream task on which word2vec has been trained. The original paper actually explains two downstream tasks called CBOW and Skipgram, we will focus on CBOW in this post.

Before describing CBOW, let us first try to explain the underlying objective of the training. We want to construct embeddings that capture not only a word itself, but the meaning of a word. Put differently, we want words that have a similar meaning to end up as nearby vectors. To make this precise, we have to define a notion of similarity for our embeddings, i.e. for D-dimensional vectors, and for words.

For vectors, this is easy. In two-dimensional linear algebra, we would call two vectors similar if they point roughly in the same direction, i.e. if the angle between them is small, or in other words if the cosine of the angle is close to one. There is no good notion of an angle in D-dimensional space, but there is a good replacement for the cosine, namely the dot product. So to measure the similary of two vectors x and y, we can take the normed dot product and simply define this to be the cosine

$\textrm{cos}(x,y) = \frac{\langle x, y \rangle}{|x| \cdot |y|}$

Defining similarity between words is a bit more complicated. The approach that word2vec takes is to assume that two words have a similar meaning if they tend to appear in the same context, i.e. surrounded by similar sets of words.

To explain this, let us consider a simple sentence (I did not make this up, this sentence actually appears almost verbatim in the training data that we will use later).

“A team of 24 players was selected from an initial pool of 49 candidates”

Let us pick a word in this sentence, say “from”. We can then define the context of this center word to be the set of all words in the sentence that appear within in certain range around the center word. For example if we choose a window size of four, the region that makes up the context extends by two words to the left and two words to the right of the center word. Thus, the context words for the center word “from” are

“was”, “selected”, “an”, “initial”

So the context of a center word is simply the set of all words in the window around the center without the center word itself. The idea of word2vec is that the meaning of a word can be captured by the context words that appear in combination with it. If two center words appear most of the time surrounded by the same context words, then they are considered to have a similar meaning.

To see that this makes sense, consider another example – the sentence “The mighty king is sitting on a golden throne”. If we replace king by “ruler”, the resulting sentence would still be likely to appear in a large corpus of text. As the words “king” and “ruler” can replace each other in their respective context while still making sense, we would consider them to have a similar meaning.

To turn this idea into a training objective, the CBOW algorithm proceeds as follows. First, we go through our data and for each center word, we determine the context as above. Each pair of center word and context will constitute one training sample. We now train the model to predict the center word from the given context. More specifically, we first turn the context into a single vector by feeding each context word into the embedding layer and taking the average of the resulting vectors (this is why the model is called CBOW which is the abbreviation for “continuous bag of words”, as taking the average ignores the position of the word in the context). We now have a single vector of dimension D which we can use as input for a second linear layer which turns it back into a vector of dimension V. This vector is then the input for a softmax so that we eventually obtain an index in the range between 0 and V – 1. The target is our center word and we can apply the usual cross entropy loss function. So CBOW is essentially a classification problem in which the label is the center word and the input is the averaged context.

Note that both the embedding layer and the linear layer do not have a bias, so that they are fully determined by their weight matrices, say U and V. The function to which we apply the softmax is then essentially the matrix product of U with the transpose of V. This in turn is the dot product of the rows of U and the rows of V, which we can both interpret as embeddings. If we write this out in terms of scalar products, you see the cosines emerging and develop an intuition why this training objective does indeed foster the learning of similarities. But instead of diving deeper into this, let us go ahead and discuss the implementation in PyTorch.

To implement the embedding layer without having to make the one-hot encoding explicit, we can use the torch.nn.Embedding class provided by PyTorch. The forward method of this module accepts an index or a sequence of indices and returns a vector or a sequence of vectors, which is exactly what we need. The output layer is an ordinary linear layer. Following the usual practice, we do not add the final softmax layer to our model but use the cross entropy loss function which includes the calculation of the softmax. With this, our model is rather simple:

EMBED_MAX_NORM = 1

class CBOW(torch.nn.Module):
    
    def __init__(self, model_dim, V, bias = False):
        super().__init__()
        self.embedding = torch.nn.Embedding(
            num_embeddings=V,
            embedding_dim=model_dim,
            max_norm=EMBED_MAX_NORM,
        )
        self.linear = torch.nn.Linear(
            in_features=model_dim,
            out_features=V,
            bias = bias
        )
        
        
    def forward(self, Y):
        E = self.embedding(Y).mean(axis=1)
        U = self.linear(E)
        return U

Note the max_norm parameter which re-normalizes the embeddings if they exceed a certain size. Setting this parameter turns out to be helpful during training (a fact that I discovered after reading this excellent blog post by O. Chernytska which turned out to be a very valuable resource when training my model).

We see that the forward method does what we have sketched earlier – we first apply the embeddings which will give us a tensor of shape (B, W, D) where B is the batch size, W is the window size and D is the model dimension. We then take the mean along the middle dimension, i.e. the mean of embedding of all words in the context, which will give us a tensor of shape (B, D). We then apply the output layer to get a batch of shape (B, V) which we use as input to our loss function.

Let us now discuss our data and the preprocessing. To train our model, we will use the WikiText2 dataset which consists of roughly 2 million token taken from Wikipedia and is available via the Torchtext library. Each item in the dataset is a paragraph. Some paragraphs consist of a title only which we remove. We then apply a tokenizer to the remaining paragraphs and collect them in one large list, in which each item is again a list of token.

ds = torchtext.datasets.WikiText2(split="train")
tokenizer = torchtext.data.utils.get_tokenizer("basic_english")
paragraphs = []
for item in ds:
    # Remove trailing whitespace and special characters
    item = re.sub("^\s+", "", item)
    item = re.sub("@", "", item)
    if not re.match("^=", item):
        p = tokenizer(item)
        if len(p):
            paragraphs.append(p)

Next, we build a vocabulary using again the torchtext library. We add a special token “<unk>” to the vocabulary that stands for an unknown word (and is in fact already present in the input data). We also only add token to the vocabulary which appear more than a given number of times, i.e. have a certain minimum frequency.

vocab = torchtext.vocab.build_vocab_from_iterator(paragraphs, 
                                                  min_freq=min_freq,
                                                  specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

We can then encode the paragraphs as usual using our vocabulary. Next, we need to create pairs of center word and context out of our training data. Here, it is helpful that we maintain the paragraph structure, as a context that spans across a paragraph is probably less useful, so we want to avoid this. One way to generate the center/context pairs is using a Python generator function, like this.

def yield_context(paragraphs, window_size = 8):
    for p in paragraphs:
        half = window_size // 2
        #
        # If we are not yet at the last token in the paragraph, 
        # yield window and advance center. 
        #
        for index, center in  enumerate(p):
            context = p[max(0, index - half):index]
            context.extend(p[index + 1:min(len(p), index + half + 1)])
            yield center, context

Here we visit each position in each paragraph and use it as center position. We then carve out half of the window size to the left of the center token and half of the window size to the right and concatenate the result lists, which we return along with the center.

To train our model, we can put all this code into a PyTorch dataset so that we can use it along with a PyTorch data loader. Training is now straightforward. I have used an initial learning rate of 0.1 for a model dimension of 300 and a batch size of 20000. I apply a linear rate scheduler and use an Adam optimizer.

To make it a bit easier for you to try this out, I put together all the code in a notebook that is available in the GitHub repository for this series. To run this, you have several options. First, you can run it locally (or in your favorite cloud environment, of course) by cloning the repository and following the instructions in the README, i.e.

git clone https://github.com/christianb93/MLLM
cd MLLM
python3 -m venv .venv
source .venv/bin/activate
pip3 install -r requirements.txt
pip3 install ipykernel
python3 -m ipykernel install --user --name=MLLM
jupyter-lab

Then navigate to the notebook in the word2vec directory and run it. Alternatively, you can run it in Google Colab by simply clicking on this link. Note that the second cell will install the portalocker package if not yet present, so you might have to restart the runtime afterwards to make sure that the package can be used (use Runtime – restart and run all if you get an error in the third cell saying that portalocker cannot be found).

With the chosen parameters, training is rather smooth and takes less than 10 epochs to achieve reasonable results. In the notebook, I train for 7 epochs to achieve a mean training loss of a bit more than 4.5 in the last epoch (to keep things simple, I do not measure the validation loss).

Once we have trained the model, we can check that it does what we are up to – making sure that similar words or semantically related words receive similar embeddings. To verify this, let us pick a word that appears in the vocabulary and try to find those words in the vocabulary that are closest to it, i.e. have the largest cosines. To do this, we can extract the weights (which PyTorch stores internally with shape (V, D) so that the rows are the embeddings) using the attribute weight of torch.nn.Embedding. Given the embedding of a fixed token, we now need to take the dot product of all rows of the weight matrix with the vector representing the fixed token, which can be conveniently organized as a matrix multiplication. We can then sort the resulting vector and extract the five largest entries. Here is a short piece of code doing this.

 def print_most_similar(token, embeddings, vocab):
    #
    # Normalize embeddings
    #
    _embeddings = torch.nn.functional.normalize(embeddings, dim = 1)
    #
    # get u, the embedding of our token
    #
    u = _embeddings[vocab[token], :]
    #
    # do dot products as one large matrix multiplication
    #
    v = torch.matmul(_embeddings, u)
    #
    # Sort this
    #
    values, indices = torch.sort(v, descending=True)
    print(f"Most similar token for {token}")
    for i in range(5):
        print(f"      {vocab.lookup_token(indices[i])} -- {values[i]}")

If we run this after training for the word “king”, the results we get (which are to some extent random, so you might get slightly different results if you run this yourself) are

king -- 0.9999999403953552
son -- 0.5954159498214722
earl -- 0.59091717004776
archbishop -- 0.57264244556427
pope -- 0.5617164969444275

This is not bad for two minutes of training! Except the second one, all others are clearly some sort of ruler and in that sense will probably appear in similar semantic roles as the word “king”.

There is one more experiment that we can make. Remember that the second layer of our model converts the vectors from the internal dimension back into the vocabulary. This is a linear layer with a weight matrix that has the same shape as that of the embedding! Thus we actually learn two embeddings – the embeddings that we have modelled as a torch.nn.Embedding layer and that we apply to the context vectors (the context embedding) and the embedding that is implicit in the linear layer of type torch.nn.Linear that one might call the center embedding. We can repeat the test above with the center embedding (again, look at the notebook for the details) and get a very similar output.

 king -- 1.0
 pope -- 0.6784377694129944
 lord -- 0.6100903749465942
 henry -- 0.5989689230918884
 queen -- 0.5779016017913818

With this, let us close this post for today. If you want to read more on word2vec, the Skipgram mechanism that we have not presented and more advanced versions of the algorithm, I have listed a few valuable reads below. In this series, we will continue with an introduction to RNNs, a generation of network architectures that are important to understand as many training methods and terms used for transformers as well go back to them.

[1] Chapter 6 of “Speech and Language Processing” by Jurafsky and Martin
[2] Section 15.1 of “Dive into Deep Learning”
[3] The original paper introducing word2vec available here
[4] A PyTorch implementation by O. Chernytska
[5] The illustrated word2vec by J. Alammar

Mastering large language models – Part II: Words and vector spaces

Published by christianb93

2 Comments

Leave a Comment Cancel reply

Share this:

Related

Published by christianb93

2 Comments

Leave a Comment Cancel reply