Mastering large language models – Part XIV: Huggingface transformers

In the previous post, we have completed our journey to being able to code and train a transformer-based language model from scratch. However, we have also seen that in order to obtain professional results, we need an enormous amount of resources, i.e. training data and compute power. Luckily, there are some pre-trained transformer models that we can use, be it directly for inference or as a basis for further fine-tuning. Today, we will take a look at the probably most prominent platform that exists for that purpose – the Huggingface platform.

Huggingface is a company that offers a variety of services around machine learning. In our context, we will mainly be interested in the wo of their products – the transformers library and the Huggingface hub, which is a platform to which scientists and enthusiasts can upload weights and trained models to make them available for everyone using the Huggingface transformer library.

As a starting point, let us first make sure that we have the library installed. For this post (and the entire series) I have used version 4.27.4, which you can install as usual.

pip3 install transformers==4.27.4

The recommended starting point for working with the library is a pipeline, which is essentially a container that includes everything which we typically need, most notably a tokenizer and a model. As the intention of this post, however, is to make contact with the models which we have implemented so far, we will dig a bit deeper and start directly with a model instead.

In the transformers library, pretrained models are usually downloaded from the hub and identified by name, which – by convention – is the name of the provider followed by a slash and the name of the model. Here is a short code snippet that you can execute in a notebook or in an interactive Python session which will download the weights for the 1.3 billion parameter version of the GPT-Neo model. GPT-Neo is a series of models developed and trained by EleutherAI and made available as open source, including the weights.

import transformers
import torch
model_name="EleutherAI/gpt-neo-1.3B"
model = transformers.AutoModelForCausalLM.from_pretrained(model_name)
print(model)
print(isinstance(model, torch.nn.Module))

Executing this for the first time will trigger the download of the model to ~/.cache/huggingface/hub. The entire download is roughly 5 GB MB, so this might take a second. If you want to avoid the large download, you can also use the much smaller 125m version – just change the last part of the model name accordingly. From the output, we can see that this model is a torch.nn.Module, so we can use it as any other PyTorch model.

We can also see that the model actually consists of two components, namely an instance of the class GPTNeoModel defined here and a linear layer – the so-called head – that is sitting on top of the actual transformer and, as usual, transforms between the model inner dimension which is 2048 and the one-hot encoded vocabulary. This decomposition of the full model into a core model and a task-specific head is common to most Huggingface models, as it allows us to use the same set of weights for different tasks.

Even though the model does not use the transformer blocks that are part of PyTorch, it is a decoder-only transformer model as we would expect it – there are two learned embeddings for positions and words and 24 transformer blocks consisting of self-attention and linear layers. There is a minor difference, though – every second layer in the GPT-Neo model uses local self-attention, which basically means that to calculate the attention at a given position, only keys and values within a sliding window are used, not all keys and values in the context – see the comment here.

Apart from this, our model is very close to the models that we have implemented and trained previously. We know how to sample from such a model. But before we can do this, we need to turn our attention to the second part of a pipeline, namely the tokenizer.

Obviously, a pretrained transformer model will only produce reasonable results if we use the same vocabulary and the same tokenizer for inference that we have used for training. Therefore, the Huggingface hub contains a matching tokenizer for every pretrained model. This tokenizer can be downloaded and initialized similar to what we have done for the model.

tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

Our tokenizer will contain everything that we need to convert forth and back between text and sequences of token. Specifically, the main components of a tokenizer are as follows.

First, there is method encode which accepts a text and returns a sequence of token IDs, and a method decode which conversely translates a sequence of token IDs into a text.

To convert between the string representation of a token and the ID, there are the methods convert_ids_to_tokens and convert_tokens_to_ids. In addition, the tokenizer obviously contains a copy of the vocabulary and some special token like a token that marks the end of a sentence (EOS token), the beginning of a sentence (BOS token), unknow token and so forth – this should in general match the token defined in the model config model.config

In our case, the tokenizer is an instance of the GPT2 tokenizer which in turn is derived from the PretrainedTokenizer class.

Having a tokenizer at our disposal, we can now sample from the model in exactly the same way as from our own models. There are two little details that we need to keep in mind. First, in the Huggingface world, the batch dimension is the first dimension. Second, the forward method of our model returns a dictionary in which the logits are stored under a key with the same name. Also, the context size is part of the model config and called max_position_embeddings. Thus our sampling method looks as follows, assuming that we have a method do_p_sampling that draws from a distribution p.

def predict(model, prompt, length, tokenizer, temperature = 0.7,  p_val = 0.95):
    model.eval()
    with torch.no_grad():
        sample = []
        device = next(model.parameters()).device
        #
        # Turn prompt into sequence of token IDs
        # 
        encoded_prompt  = tokenizer.encode(prompt)        
        encoded_sample = encoded_prompt
        encoded_prompt = torch.tensor(encoded_prompt, dtype = torch.long).unsqueeze(dim = 0)
        with torch.no_grad():
            out = model(encoded_prompt.to(device)).logits # shape B x L x V
            while (len(encoded_sample) < length):
                #
                # Sample next character from last output. Note that we need to remove the
                # batch dimension to obtain shape (L, V) and take the last element only
                #
                p = torch.nn.functional.softmax(out[0, -1, :] / temperature, dim = -1)
                #
                # Sample new index and append to encoded sample
                #
                encoded_sample.append(do_p_sampling(p, p_val))
                #
                # Feed new sequence
                #
                input = torch.tensor(encoded_sample[-model.config.max_position_embeddings:], dtype=torch.long)
                input = torch.unsqueeze(input, dim = 0)
                out = model(input.to(device)).logits
                print(tokenizer.decode(encoded_sample))

        return tokenizer.decode(encoded_sample)

This works, but it is awfully slow, even on a GPU. There is a reason for that, which is similar to what we have observed when sampling from an LSTM.

Suppose that our prompt has length L, and that we have already sampled the first token after the prompt, so that our full sample now has length L + 1. We now want to sample the next token. For that purpose, we only need the logits at position L + 1, so it might be tempting to feed only token L + 1 into our model. Let us quickly recall why this does not work.

To derive the output at position L + 1, each attention block will run the attention value at position L + 1 (which is a tensor of shape B x V) through the feed-forward network. If q denotes the query at position L + 1, this attention value is given by

\frac{1}{\sqrt{d}} \sum_{i=0}^{L} (q \cdot k_i) v_i

So we see that we need all keys and values to calculate the attention – which is not surprising, as this is the way how the model takes the context into account – but the query is only needed for position L + 1 (which is index 0, as we use zero-based indexing). Put differently, the only reason why we need to feed the previously generated token again and again into the model is that we need to key and value pairs from the past.

This hints at an opportunity to speed up inference – we could try to cache these values. To realize this, the Huggingface model returns the keys and values of all layers as an item past_key_values in the output and allows us to provide this as an additional argument to the next call. Thus instead of passing the full tensor of input IDs of length L + 1, we can as well pass only the last value, i.e. the ID of the currently last token, plus in addition the key value pairs from the previous call. Here is the updated sampling function.

def predict(model, prompt, length, tokenizer, temperature = 0.7,  p_val = 0.95):
    model.eval()
    with torch.no_grad():
        sample = []
        device = next(model.parameters()).device
        #
        # Turn prompt into sequence of token IDs
        # 
        encoded_prompt  = tokenizer.encode(prompt)        
        encoded_sample = encoded_prompt
        input_ids = torch.tensor(encoded_prompt, dtype = torch.long).unsqueeze(dim = 0)
        with torch.no_grad():
            #
            # First forward pass- use full prompt
            #
            out = model(input_ids = input_ids.to(device))
            logits = out.logits[:, -1, :] # shape B x V
            past_key_values = out.past_key_values
            while (len(encoded_sample) < length):
                #
                # Sample next character from last output
                #
                p = torch.nn.functional.softmax(logits[0, :] / temperature, dim = -1)
                #
                # Sample new index and append to encoded sample
                #
                idx = do_p_sampling(p, p_val)
                encoded_sample.append(idx)
                #
                # Feed new token and old keys and values
                #
                input_ids = torch.tensor(idx).unsqueeze(dim = 0)
                out = model(input_ids = input_ids.to(device), past_key_values = past_key_values)
                logits = out.logits
                past_key_values = out.past_key_values
                print(tokenizer.decode(encoded_sample))

        return tokenizer.decode(encoded_sample)

This should be significantly faster and provide a decent performance even on a CPU – when running this, you will also realize that the first pass that feeds the entire prompt takes some time, but all subsequent passes are much faster.

That closes our post for today. We have bridged the gap between the models that we have implemented and trained so far and the – much more advanced – models on the Huggingface hub and have convinced ourselves that even though these models are of course much larger and more powerful, their architecture is the same as for our models, and we can use them in the same way. I encourage you to play with a few other models which can be sampled from in exactly the same way, like the Pythia models from EleutherAI or the gpt2-medium and gpt2-large versions of the GPT model that can all be found on the Huggingface hub – you can use the code snippets from this post or my notebook as a starting point.

In the next post, we will use the popular Streamlit platform to implement a simple ChatBot backed by a Huggingface model.

1 Comment

Leave a Comment