Mastering large language models – Part XIII: putting it all together

Today, we will apply what we have learned in the previous posts on BPE, attention and transformers h- we will implement and train a simple decoder-only transformer model from scratch.

First, let us discuss our model. The overall structure of the model is in fact quite similar to the model we have used when training an LSTM on “War and Peace” – we will have an embedding layer turning words into vectors, the actual network and an output layer of the dimension of the vocabulary from which we will sample our next word. However, there are a few notable differences.

The most obvious and most important one is that we use a transformer instead of an LSTM. More specifically, we will use a decoder-only model, i.e. there is no input from an encoder. Instead, the model applies self-attention on the input in each layer. Recall that technically, the transformer blocks in this model will be instances of torch.nn.TransformerEncoderLayer even though we will use them as decoder building blocks. We will also have to use masking to avoid that our model can peek ahead in the self-attention layers.

The second difference is that instead of character-level encoding, we will apply a BPE tokenizer trained with a fixed number of merges. We will use sinusoidal positional embeddings as in the original transformer paper. With these modifications, our model looks as follows.

Using the sinusoidal embeddings discussed in one of our previous posts, we can now easily code our model in Python.

class Model(torch.nn.Module):
    
    def __init__(self, vocab_size, model_dim = MODEL_DIM, context_size = CONTEXT_SIZE, ff_dim = FF_DIM, heads = HEADS, layers = LAYERS, dropout = DROPOUT):
        super().__init__()
        self._word_embedding = torch.nn.Embedding(vocab_size, model_dim)
        self._pe_embedding = PosEmbeddingLayer(context_size, model_dim)
        layer = torch.nn.TransformerEncoderLayer(d_model = model_dim, nhead = heads, dim_feedforward = ff_dim, dropout = dropout)
        self._transformer = torch.nn.TransformerEncoder(layer, num_layers = layers)
        self._linear = torch.nn.Linear(in_features = model_dim, out_features = vocab_size)
        self._model_dim = model_dim
        self._context_size = context_size
        self._vocab_size = vocab_size
        cached_mask = torch.tril(torch.ones(context_size, context_size)*(-1)*float('inf'), diagonal = -1).t()
        self.register_buffer("_cached_mask", cached_mask)
    
    #
    # Create a causal self-attention mask
    #
    def get_self_attention_mask(self):
        return self._cached_mask
        
    #
    # Shape of input: (L, B)
    # 
    def forward(self, x):
        assert len(x.shape) == 2, "Expecting two-dimensional input"
        (L, B) = x.shape
        x = self._word_embedding(x) # shape (L, B, model_dim)
        x = self._pe_embedding(x) 
        #
        # Mask input. As we have registered this is a buffer, it
        # should already be on the same device as the model
        #
        mask = self.get_self_attention_mask()[:L, :L]
        x = self._transformer(x, mask = mask)
        return self._linear(x)        


Note the causal self-attention mask that we use. In order to avoid having to recalculate this mask for every input, we create the mask once and add it as a buffer to the model, which will ensure that the mask will also be transferred if we move our model to the GPU.

We also follow the convention to use the second dimension as batch dimension. So our initial input will have shape (L, B), where each column of length L is a tensor whose elements are the indices of a token in the vocabulary, which we then pass through the embedding layers to obtain a tensor of shape (L, B, D). We then pass this through all transformer layers and finally through the output layer which will project this back into the vocabulary, i.e. we now have a tensor of shape (L, B, V). As usual, we do not include the final softmax, so that we have to apply the softmax before we sample.

Let us now discuss our dataset and the preprocessing. As for our word2vec example, we will use the WikiText dataset. This dataset comes in two flavors: WikiText2 with a bit more than 2 million token and WikiText103 with a bit more than 100 million token. During preprocessing, we need to pre-tokenize the dataset, build a BPE vocabulary and a rule set from it, encode the data and finally split it into a training dataset and a validation dataset. Training then proceeds as usual using teacher forcing. As in the GPT-3 paper, we use a cosine learning rate schedule to decrease the learning rate over time from its original value by 90%.

If you want to train the model from scratch but do not have access to a CUDA-enable machine, you can play with the notebook that I have created for this post on Google CoLab. For this notebook, I use the smaller WikiText2 dataset and train for only three epochs. The runtime does of course depend on the GPU (and the CPU) that Google will assign to your runtime. When I tried this, running the BPE merge and encoding the entire dataset took roughly 5 minutes in total, and the training took less than 30 minutes.

Here is a short sample created using the model trained on Google CoLab (which you can also find in the notebook itself).

Federer also received the first title of the Australian Open in the finals, including the Madonna Open, the Duke of Florence, and his first tournament. Federer won the final, losing to the finals, having scored a record nine times to five consecutive weeks before losing 19, in the Bir ‘Shoot , an injury

Of course this does not really make sense, but is still a bit impressive for less than 30 minutes of training. Most parts of this sentence are grammatically meaningful, and some snippets do even sound reasonable.

To get better results, we can try to train the model on a larger dataset, for instance the full WikiText103 data set instead of the small WikiText2 dataset. This goes beyond what we can reasonably do with a free GPU on Google CoLab. If, however, you have access to a decent GPU, you can do the training yourself. To create the training dataset, run the following commands (after activating a matching Python environment as described in the README)

#
# Clone repository if not yet done
#
git clone https://github.com/christianb93/MLLM
cd MLLM/wiki
#
# Remove existing data and run preprocessing
#
rm -f vocab.dat rules.data data.json
python3 preprocess.py

This takes a few minutes as we will have to build a BPE vocabulary and encode the entire dataset (do not trust the initial ETCs during encoding – as we implement a cache, the speed will go up with every word already decoded, and the entire process should take around 15 – 20 min, depending of course on your environment). The results of the preprocessing are five files – the vocabulary vocab.dat, the merge rules rules.dat, the fulll encoded dataset data.json as well as files train.json and val.json for training and validation (if the vocabulary, rules and data file exist, they will be reused, so make sure to remove existing copies if you do not want this).

Once the preprocessing is complete, we can start the training. I trained the model for two epochs on an A10 machine on Lambdalabs, using the command

python3 train.py --epochs=2 --compile --autocast

On this machine, one batch took 0.05 seconds, resulting in a runtime of 7.5 hours for two epochs. I have added the resulting model checkpoint model.pt to the repository (make sure to delete this file if you want to train from scratch, as otherwise training will restart from this file).

We can now sample from the model (either from the pretrained model or from a model from a custom training run) by simply running

python3 predict.py

By default, this will apply nucleus sampling with p=0.95 and a temperature of 0.7. Here is an output generated using the pretrained model that is part of the repository.

His wedding did not hear the explanation of the title, as his term ” Alan Keegan ” was in a press release. The first episode was broadcast on July 13, 2007. The episode was written by Matt Groening, and was written by Stella Packard. The episode was broadcast on BBC One in the United States on July 11, 2009, and was watched by 2. 93 million viewers, and it was watched by 7. 50 million viewers in the United States. It was the first time that the episode was viewed by 4. 21 million viewers and was viewed by 4. 3 million households. The episode received a 3. 1 rating / 9 share among viewers between the ages of 18 and 49. The episode was watched by approximately 7. 8 million viewers. It received generally positive reviews from television critics. It received generally positive reviews from television critics, according to Nielsen Media Research…

The first sentence does not make too much sense, but the second and the next few sentences do at least make partial sense. We also see, however, that the model tends to repeat itself, which is probably due to the comparatively low temperature. Here is another sample generated with temperature 0.9.

The series was a single nine – part series against Ivor on its official website. Two seasons earlier, the original series was released on May 2, 2014, was released as a partnership with Clifton Notes in the United States. These were a number of reprint edition hardcover footage. Many reviews were positive and negative and positive. Loves made Yerne the first film in the series by Dirty, featuring BeyoncĂ© ‘s voice acting. The lead character Ron Fisk of Hope called the film ” a rogue look inside “. Lyrically, the crew got off in a course at the end of the scene. Sacred Kimball, who played by the actors of the film, left England at the end of the episode, was shot in a short story of a movie about that film. The film was influenced by David Duchovny

Clearly, this is still not GPT – on the other hand, our model is tiny (it has only roughly 5 million parameters, compared to the 175 billion parameters used for GPT-3) and so is our data (a bit more than 100 million token, whereas GPT-3 was trained on more than 300 billion token). Theoretically, we could now proceed to train larger models on more data, but the cost associated with the required infrastructure will soon become prohibitive.

Fortunately, a few models and weights are available to the public. Probably the easiest way to download and run these models is the Huggingface Transformers library. In our next post, we will start to work with this library to download popular models like GPT-2 and play with them.

1 Comment

Leave a Comment