Mastering large language models – Part XIII: putting it all together

Today, we will apply what we have learned in the previous posts on BPE, attention and transformers h- we will implement and train a simple decoder-only transformer model from scratch.

First, let us discuss our model. The overall structure of the model is in fact quite similar to the model we have used when training an LSTM on “War and Peace” – we will have an embedding layer turning words into vectors, the actual network and an output layer of the dimension of the vocabulary from which we will sample our next word. However, there are a few notable differences.

The most obvious and most important one is that we use a transformer instead of an LSTM. More specifically, we will use a decoder-only model, i.e. there is no input from an encoder. Instead, the model applies self-attention on the input in each layer. Recall that technically, the transformer blocks in this model will be instances of torch.nn.TransformerEncoderLayer even though we will use them as decoder building blocks. We will also have to use masking to avoid that our model can peek ahead in the self-attention layers.

The second difference is that instead of character-level encoding, we will apply a BPE tokenizer trained with a fixed number of merges. We will use sinusoidal positional embeddings as in the original transformer paper. With these modifications, our model looks as follows.

Using the sinusoidal embeddings discussed in one of our previous posts, we can now easily code our model in Python.

class Model(torch.nn.Module):
    
    def __init__(self, vocab_size, model_dim = MODEL_DIM, context_size = CONTEXT_SIZE, ff_dim = FF_DIM, heads = HEADS, layers = LAYERS, dropout = DROPOUT):
        super().__init__()
        self._word_embedding = torch.nn.Embedding(vocab_size, model_dim)
        self._pe_embedding = PosEmbeddingLayer(context_size, model_dim)
        layer = torch.nn.TransformerEncoderLayer(d_model = model_dim, nhead = heads, dim_feedforward = ff_dim, dropout = dropout)
        self._transformer = torch.nn.TransformerEncoder(layer, num_layers = layers)
        self._linear = torch.nn.Linear(in_features = model_dim, out_features = vocab_size)
        self._model_dim = model_dim
        self._context_size = context_size
        self._vocab_size = vocab_size
        cached_mask = torch.tril(torch.ones(context_size, context_size)*(-1)*float('inf'), diagonal = -1).t()
        self.register_buffer("_cached_mask", cached_mask)
    
    #
    # Create a causal self-attention mask
    #
    def get_self_attention_mask(self):
        return self._cached_mask
        
    #
    # Shape of input: (L, B)
    # 
    def forward(self, x):
        assert len(x.shape) == 2, "Expecting two-dimensional input"
        (L, B) = x.shape
        x = self._word_embedding(x) # shape (L, B, model_dim)
        x = self._pe_embedding(x) 
        #
        # Mask input. As we have registered this is a buffer, it
        # should already be on the same device as the model
        #
        mask = self.get_self_attention_mask()[:L, :L]
        x = self._transformer(x, mask = mask)
        return self._linear(x)        


Note the causal self-attention mask that we use. In order to avoid having to recalculate this mask for every input, we create the mask once and add it as a buffer to the model, which will ensure that the mask will also be transferred if we move our model to the GPU.

We also follow the convention to use the second dimension as batch dimension. So our initial input will have shape (L, B), where each column of length L is a tensor whose elements are the indices of a token in the vocabulary, which we then pass through the embedding layers to obtain a tensor of shape (L, B, D). We then pass this through all transformer layers and finally through the output layer which will project this back into the vocabulary, i.e. we now have a tensor of shape (L, B, V). As usual, we do not include the final softmax, so that we have to apply the softmax before we sample.

Let us now discuss our dataset and the preprocessing. As for our word2vec example, we will use the WikiText dataset. This dataset comes in two flavors: WikiText2 with a bit more than 2 million token and WikiText103 with a bit more than 100 million token. During preprocessing, we need to pre-tokenize the dataset, build a BPE vocabulary and a rule set from it, encode the data and finally split it into a training dataset and a validation dataset. Training then proceeds as usual using teacher forcing. As in the GPT-3 paper, we use a cosine learning rate schedule to decrease the learning rate over time from its original value by 90%.

If you want to train the model from scratch but do not have access to a CUDA-enable machine, you can play with the notebook that I have created for this post on Google CoLab. For this notebook, I use the smaller WikiText2 dataset and train for only three epochs. The runtime does of course depend on the GPU (and the CPU) that Google will assign to your runtime. When I tried this, running the BPE merge and encoding the entire dataset took roughly 5 minutes in total, and the training took less than 30 minutes.

Here is a short sample created using the model trained on Google CoLab (which you can also find in the notebook itself).

Federer also received the first title of the Australian Open in the finals, including the Madonna Open, the Duke of Florence, and his first tournament. Federer won the final, losing to the finals, having scored a record nine times to five consecutive weeks before losing 19, in the Bir ‘Shoot , an injury

Of course this does not really make sense, but is still a bit impressive for less than 30 minutes of training. Most parts of this sentence are grammatically meaningful, and some snippets do even sound reasonable.

To get better results, we can try to train the model on a larger dataset, for instance the full WikiText103 data set instead of the small WikiText2 dataset. This goes beyond what we can reasonably do with a free GPU on Google CoLab. If, however, you have access to a decent GPU, you can do the training yourself. To create the training dataset, run the following commands (after activating a matching Python environment as described in the README)

#
# Clone repository if not yet done
#
git clone https://github.com/christianb93/MLLM
cd MLLM/wiki
#
# Remove existing data and run preprocessing
#
rm -f vocab.dat rules.data data.json
python3 preprocess.py

This takes a few minutes as we will have to build a BPE vocabulary and encode the entire dataset (do not trust the initial ETCs during encoding – as we implement a cache, the speed will go up with every word already decoded, and the entire process should take around 15 – 20 min, depending of course on your environment). The results of the preprocessing are five files – the vocabulary vocab.dat, the merge rules rules.dat, the fulll encoded dataset data.json as well as files train.json and val.json for training and validation (if the vocabulary, rules and data file exist, they will be reused, so make sure to remove existing copies if you do not want this).

Once the preprocessing is complete, we can start the training. I trained the model for two epochs on an A10 machine on Lambdalabs, using the command

python3 train.py --epochs=2 --compile --autocast

On this machine, one batch took 0.05 seconds, resulting in a runtime of 7.5 hours for two epochs. I have added the resulting model checkpoint model.pt to the repository (make sure to delete this file if you want to train from scratch, as otherwise training will restart from this file).

We can now sample from the model (either from the pretrained model or from a model from a custom training run) by simply running

python3 predict.py

By default, this will apply nucleus sampling with p=0.95 and a temperature of 0.7. Here is an output generated using the pretrained model that is part of the repository.

His wedding did not hear the explanation of the title, as his term ” Alan Keegan ” was in a press release. The first episode was broadcast on July 13, 2007. The episode was written by Matt Groening, and was written by Stella Packard. The episode was broadcast on BBC One in the United States on July 11, 2009, and was watched by 2. 93 million viewers, and it was watched by 7. 50 million viewers in the United States. It was the first time that the episode was viewed by 4. 21 million viewers and was viewed by 4. 3 million households. The episode received a 3. 1 rating / 9 share among viewers between the ages of 18 and 49. The episode was watched by approximately 7. 8 million viewers. It received generally positive reviews from television critics. It received generally positive reviews from television critics, according to Nielsen Media Research…

The first sentence does not make too much sense, but the second and the next few sentences do at least make partial sense. We also see, however, that the model tends to repeat itself, which is probably due to the comparatively low temperature. Here is another sample generated with temperature 0.9.

The series was a single nine – part series against Ivor on its official website. Two seasons earlier, the original series was released on May 2, 2014, was released as a partnership with Clifton Notes in the United States. These were a number of reprint edition hardcover footage. Many reviews were positive and negative and positive. Loves made Yerne the first film in the series by Dirty, featuring Beyoncé ‘s voice acting. The lead character Ron Fisk of Hope called the film ” a rogue look inside “. Lyrically, the crew got off in a course at the end of the scene. Sacred Kimball, who played by the actors of the film, left England at the end of the episode, was shot in a short story of a movie about that film. The film was influenced by David Duchovny

Clearly, this is still not GPT – on the other hand, our model is tiny (it has only roughly 5 million parameters, compared to the 175 billion parameters used for GPT-3) and so is our data (a bit more than 100 million token, whereas GPT-3 was trained on more than 300 billion token). Theoretically, we could now proceed to train larger models on more data, but the cost associated with the required infrastructure will soon become prohibitive.

Fortunately, a few models and weights are available to the public. Probably the easiest way to download and run these models is the Huggingface Transformers library. In our next post, we will start to work with this library to download popular models like GPT-2 and play with them.

Mastering large language models – Part XII: byte-pair encoding

On our way to actually coding and training a transformer-based model, there is one last challenge that we have to master – encoding our input by splitting it in a meaningful way into subwords. Today, we will learn how to do this using an algorithm known as byte-pair encoding (BPE).

Let us first quickly recall what the problem is that subword encoding tries to solve. To a large extent, the recent success of large language models is due to the tremendous amount of data used to train them. GPT-3, for instance, has been trained on roughly 500 mio tokens (which, as we will see soon, are not exactly words) [3]. But even much smaller datasets tend to contain a large number of unique words. The WikiText103 dataset [4] for instance contains a bit more than 100 mio. words, which represent a vocabulary of 267.735 unique words. With standard word-level encoding, this would become a dimension in the input and output layers of our neural network. It is not hard to imagine that with 500 mio. token as input, this number would even be much larger and we clearly have a problem.

The traditional way to control the growth of the vocabulary is to introduce a minimum frequency. In other words, we could only include token in the vocabulary that occur more than a given number of times – the minimum frequency – in the training data and tune this number to realize a cap on the vocabulary size. This, however, has the disadvantage that during training and later inference, we will face unknown words. Of course we can reserve a special token for these unknown words, but still a translator would have to know how to handle them.

Character-level encoding solves the problem of unknown words, but introduces a new problem – the model first has to learn words before it can start to learn relations between words. In addition, if our model is capable of learning relations within a certain range L, it does of course make a difference whether the unit in which we measure this range is a character or a word.

This is the point where subword tokenization comes into play. Here, the general idea is to build up a vocabulary that consists of subword units, not full words – but also more than just individual characters. Of course, to make this efficient, we want to include those subwords in our vocabulary that occur often, and we want to include all individual characters. If we need to encode a word that consists of known subwords, we can use those, and in the worst case, we can still go down to the character level during encoding so that we will not face unknown words.

Byte-pair encoding that has first been applied to machine learning in [1] is one way to do this. Here, we build our vocabulary in two phases. In the first phase, we go through our corpus and include all characters that we find in our vocabulary. We then encode our data using this preliminary vocabulary.

In the second phase, we iterate through a process known as merge. During each merge, we go through our data and identify the pair of token in the vocabulary that occurs most frequently. We then include a new token in our vocabulary that is simply the concatenation of these two existing token and update our existing encoding. Thus every merge adds one more token to the vocabulary. We continue this process until the vocabulary has reached the desired size.

Note that in a real implementation, the process of counting token pairs to determine their frequency is usually done in two steps. First, we count how often an individual word occurs in the text and save this for later reference. Next, given a pair of token, we go through all words that contain this pair and then add up the frequencies of these words to determine the frequency of the token pair.

Let us look at an example to see how this works in practice. Suppose that the text we want to encode is

low, lower, newest, widest

In the first phase, we would build an initial vocabulary that consists of all characters that occur in this text. However, we want to make sure that we respect word boundaries, so we need to be able to distinguish between characters that appear within a word and those that appear at the end of the word. We do this by adding a special token </w> to a token that is located at the end of a word. In our case, this applies for instance to the character w that therefore gives rise to two entries in our vocabulary – w and w</w>. With this modification, our initial vocabulary is

'o', 'r</w>', 's', 'w</w>', 'n', 'i', 'e', 'l', 'd', 'w', 't</w>'

We now go through all possible combinations of two token and see how often the appear in the text. Here is what we would find in our example.

Token pairFrequency
l + o2
o + w</w>1
o + w1
w + e2
e + r</w>1
n + e1
e + w1
e + s2
s + t</w>2
w + i1
i + d1
d + e1

We now pick the pair of token (usually called a byte-pair, even though this is strictly speaking of course not correct when we use Unicode points) that occurs most frequently. In our case, several byte pairs occur twice, and we pick one of them, say w and e. We now add an additional token “we” to our vocabulary and re-encode our text. With this vocabulary, our text which previously was represented by the sequence of token

l, o, w</w>, l, o, w, e, r</w>, n, e, w, e, s, t</w>, w, i, d, e, s, t</w>

would now be encoded as

l, o, w</w>, l, o, we, r</w>, n, e, we, s, t</w>, w, i, d, e, s, t</w>

Note the new token, marked with bold face, that appears at the two positions where we previously had the combination of the token w and e. This concludes our first merge.

We can now continue and conduct the next merge. Each merge will add one token to our vocabulary, so that controlling the number of merges allows us to create a vocabulary of the desired size. Note that each merge results in two outputs – an updated vocabulary and a rule. In our case, the first merge resulted in the rule w, e ==> we.

To encode a piece of text that was not part of our training data when running the merges, we now simply replay these rules, i.e. we start with our text, break it down into characters, corresponding to the token in the initial vocabulary, and apply the rules that we have derived during the merges in the order in which the merges took place. Thus to encode text, we need access to the vocabulary and to the rules.

Let us now turn to implementing this in Python. The original paper [1] does already contain a few code snippets that make an implementation based on them rather easy. As usual, this blog post comes with a notebook that, essentially based on the code in [1], guides you through a simple implementation using the example discussed above. Here are a few code snippets to illustrate the process.

The first step is to build a dictionary that contains the frequencies of individual words, which we will later use to easily calculate the frequency of a byte pair. Note that the keys in this dictionary are the words in the input text, but already broken down into a sequence of token, separated by spaces, so that we need to update them as we merge and add new token.

def get_word_frequencies(pre_tokenized_text):
    counter = collections.Counter(pre_tokenized_text)
    word_frequencies = {" ".join(word) + "</w>" : frequency for word, frequency in counter.items() if len(word) > 0}
    return word_frequencies

As the keys in the dictionary are already pretokenized, we can now build our initial vocabulary based on these keys.

def build_vocabulary(word_frequencies):
    vocab = set()
    for word in word_frequencies.keys():
        for c in word.split():
            vocab.add(c)
    return vocab

Next, we need to be able to identify the pair of bytes that occurs most frequently. Again, Python collections can be utilized to do this – we simply go through all words, split them into symbols, iterate through all pairs and increase the count for this pair that we store in a dictionary.

def get_stats(word_frequencies):
  pairs = collections.defaultdict(int)
  for word, freq in word_frequencies.items():
    symbols = word.split()
    for i in range(len(symbols)-1):
      pairs[symbols[i],symbols[i+1]] += freq
  return pairs

Finally, we need a function that executes an actual merge. During each merge, we use regular expressions (more on the exact expression that we use here can be found in my notebook) to replace each occurence of the pair by the new token.

def do_merge(best_pair, word_frequencies, vocab):
    new_frequencies = dict()
    new_token = "".join(best_pair)
    pattern = r"(?<!\S)" + re.escape(" ".join(best_pair)) + r"(?!\S)"
    vocab.add(new_token)
    for word, freq in word_frequencies.items():
        new_word = re.sub(pattern, new_token, word)
        new_frequencies[new_word] = word_frequencies[word]
    return new_frequencies, vocab

We can now combine the functions above to run a merge. Essentially, during a merge, we need to collect the statistics to identify the most frequent pair, call our merge function to update the dictionary containing the word frequencies and append the rule that we have found to a list of rules that we will save later.

stats = get_stats(word_frequencies)
best_pair = max(stats, key=lambda x: (stats[x], x)) 
print(f"Best pair: {best_pair}")
word_frequencies, vocab = do_merge(best_pair, word_frequencies, vocab)
rules.append(best_pair)

This code, howevers, is awfully slow, as it basically repeats the process of counting once with every merge. The authors of the original paper also provide a reference implementation [2] that applies some tricks like caching and the use of indices to speed up the process significantly. In my repository for this series, I have assembled an implementation that follows this reference implementation to a large extent, but simplifies a few steps (in particular the incremental updates of the frequency counts) and is therefore hopefully a bit easier to read. The code consists of the main file BPE.py and a test script test bpe.py.

The test script can also be used to compare the output of our implementation with the reference implementation [3]. Let us quickly do this using a short snippet from “War and peace” that I have added to my repository.

#
# Clone the reference implementation
#
git clone https://github.com/rsennrich/subword-nmt.git
cd subword-nmt
#
# Get files from my repository
#
wget https://raw.githubusercontent.com/christianb93/MLLM/main/bpe/BPE.py
wget https://raw.githubusercontent.com/christianb93/MLLM/main/bpe/test_bpe.py
wget https://raw.githubusercontent.com/christianb93/MLLM/main/bpe/test.in
wget https://raw.githubusercontent.com/christianb93/MLLM/main/bpe/test.rules
#
# Run unit tests
# 
python3 test_bpe.py
#
# Run reference implementation
#
cat test.in | python3 subword_nmt/learn_bpe.py -s=50 > rules.ref
#
# Run our implementation
#
python3 test_bpe.py --infile=test.in --outfile=rules.dat
#
# Diff outputs
#
diff rules.dat rules.ref
diff rules.ref test.rules

The only difference that the diff should show you is a header line (containing the version number) that the reference implementation adds which we do not include in our output, all the remaining parts of the output files which contain the actual rules should be identical. The second diff verifies the reference file that is part of the repository and is used for the unit tests.

Over time, several other subword tokenization methods have been developed. Two popular methods are Googles wordpiece model ([6], [7]) and the unigram tokenizer [8]. While wordpiece is very similar to BPE and mainly differs in the way how the next byte pair for merging is selected, the unigram tokenizer starts with a very large vocabulary, containing all characters and the most commonly found substrings, and then iteratively removes items from the vocabulary based on a statistic model.

With BPE, we now have a tokenization method at our disposal which will allow us to decouple vocabulary size from the size of the training data, so that we can train a model even on large datasets without having to increase the model size. In the next post, we will put everything together and train a transformer-based model on the WikiText dataset.

References:

[1] R. Sennrich et al., Neural Machine Translation of Rare Words with Subword Units
[2] https://github.com/rsennrich/subword-nmt
[3] T. Brown et al., Language Models are Few-Shot Learners
[4] https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/
[5] K. Bostrom, G. Durrett, Byte Pair Encoding is Suboptimal for Language Model Pretraining
[6] Y. Wu et al., Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
[7] M. Schuster, K. Nakajima, Japanese and Korean voice search
[8] T. Kudo, Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

Autonomous agents and LLMs: AutoGPT, Langchain and all that

Over the last couple of months, a usage pattern for large language models that leverages the model for decision making has become popular in the LLM community. Today, we will take a closer look at how this approach is implemented in two frameworks – Langchain and AutoGPT.

One of the most common traditional use cases in the field of artificial intelligence is to implement an agent that operates autonomously in an environment to achieve a goal. Suppose, for instance, that you want to implement an autonomous agent for banking that is integrated into an online banking platform. This agent would communicate with an environment, consisting of the end user, but maybe also backend systems of the bank that would allow it to gather information, for instance on account balances, and to trigger transactions. The goal of the agent would be to complete a customer request.

In a more abstract setting, an agent operates in discrete time steps. At each time step, it receives an observation – which, in the example above, could be a user input or the result of a call into the backend system – and can decide to take one out of a list of available actions, like making a new request to a backend system or completing the conversation with the end user – a formalism that you will have seen before if you have ever studied reinforcement learning. So the task of the agent is to select the next action, depending on the current and potentially on previous observations.

Obviously, there are many different possibilities to implement an agent – and, maybe not surprisingly, one approach that has been applied successfully in the recent past is to use a large language model as the brain of our agent. The idea behind this approach is to feed the observations, encoded as plain text, into the model at each step and to ask the model for the next action to take.

In practice, this can be done using few-shot learning, i.e. we generate a prompt for our model that consists of a few examples that inform the model about the format that we expect and the possible actions, and then add the last actual observation as a final prompt. To allow our model to take all previous observations into account, we would always include the full trajectory of observations and actions into the model.

In this paper, an approach termed ReACT (Reasoning and acting) has been proposed to orchestrate a large language model for decision making. In this framework, the environment is modelled as a set of tools that the agent can utilize. In the original paper, there are actually three tools corresponding to three different actions that the agent can take. First, it can retrieve a summary from Wikipedia. Second, it can lookup details on the Wikipedia page to which it has navigated with the first action, and finally it can complete the task and return the final result.

In each step, the agent now asks the model what action it should take next. It then runs one of the available tools to execute the action and adds the result as new observation to the next prompt that it sends to the model. To improve the quality of the results, a prompt engineering technique known as chain-of-thought prompting (CoT) is used, which basically means that the prompt contains an instruction for the model to spell out its thoughts, i.e. the rationale for the proposed action. Thus each prompt consists of the original question to be answered and the full history of thoughts, actions and observations.

Here is an example generated by the Langchain implementation of an agent following the ReACT pattern. For a better understanding, I have marked the output of the model in italic.

Question: Out of the following three countries, only two are direct neighbors of France: Germany, Spain, Poland. Please name the country in the list which is not a neighbor of France
Thought: I need to search France, find its neighbors, then find the country in the list which is not a neighbor of France
Observation: France, officially the French republic, is a country located primarily … (long summary of the Wikipedia artice on France)
Thought: France has neighbors such as Germany, Spain, Belgium, Luxemburg, Switzerland, Italy, Monaco, Andorra and the United Kingdom. So the country in the list which is not a neighbor of France is Poland.
Action: Finish[Poland]

To encourage the model to follow this pattern and to use only the allowed actions, the Langchain implementation uses few-shot learning, i.e. the prompt contains a few examples that follows this scheme (you can find these examples here, these are actually the examples cited in the original paper). So the full prompt consists of the few-shot examples, followed by the question at hand and the full sequence of observations, thoughts and actions generated so far.

Let us take a short look at the implementation in Langchain. These notes have been compiled using version 0.202. As the toolset seems to be evolving quickly, some of this might be outdated when you read this post, but I am confident that the basic ideas will remain.

First, Langchain distinguishes between an agent which is the entity talking to the language model, and the agent executor which talks to the agent and runs the tools.

The AgentExecutor class implements the main loop indicated in the diagram above. In each iteration, the method _take_next_step first invokes the agent, which will in turn assemble a prompt as explained above and generate a response from the language model. The output is then parsed and returned as an AgentAction which also contains the thought that the model has emitted. Then, the tool which corresponds to the action is executed and its output is returned along with the selected action.

Back in the main loop, we first test whether the action is an instance of the special type AgentFinish which indicates that the task is completed and we should exit the main loop. If not, we append the new tuple consisting of thought, action and observation to the history and run the next iteration of the loop. The code is actually quite readable, and I highly recommend to clone the repository and spend a few minutes reading through it.

Let us now compare this to what AutoGPT is doing. In its essence, AutoGPT uses the same pattern, but there are a few notable differences (again, AutoGPT seems to be under heavy development, at the time of writing, version 0.4.0 is the stable version and this is the version that I have used).

Similar to the ReACT agent in Langchain, AutoGPT implements an agent that interacts with an LLM and a set of tools – called commands in AutoGPT – inside a main loop. In each iteration, the agent communicates with the AI by invoking the function chat_with_ai. This function assembles a prompt and generates a reply, using the OpenAI API in the background to talk to the selected model. Then, the next action is determined based on the output of the model and executed (in the default settings, the user is asked to confirm this action before running it). Finally the results are added to the history and the next iteration starts.

The first difference compared to the ReACT agent in Langchain is how the prompt is structured. The prompt used by AutoGPT consists of three sections. First, there is a system prompt which defines a persona, i.e. it instructs the model to embody an agent aiming to achieve a certain set of goals (which are actually generated via a separate invocation of the model initially and saved in a file for later runs) and contains further instructions on the model. In particular, the system prompt instructs the model to produce output in JSON format as indicated below.

{
    "thoughts": {
        "text": "thought",
        "reasoning": "reasoning",
        "plan": "- short bulleted- list that conveys- long-term plan",
        "criticism": "constructive self-criticism",
        "speak": "thoughts summary to say to user"
    },
    "command": {
        "name": "command name",
        "args": {
            "arg name": "value"
        }
    }
} 

This is not only much more structured than the simple scheme used in the ReACT agent, but also asks the model to express its thoughts in a specific way. The model is asked to provide additional reasoning, to maintain a long-term plan as opposed to simply deciding on the next action to take and to perform self-criticism. Note that in contrast to ReACT, this is pure zero-shot learning, i.e. no examples are provided.

After the system prompt, the AutoGPT prompt contains the conversation history and finally there is a trigger prompt which is simply hardcoded to be “Determine exactly one command to use, and respond using the JSON schema specified previously:“.

Apart from the much more complex prompt using personas and a more sophisticated chain-of-thought prompting, a second difference is the implementation of an additional memory. The idea behind this is that for longer conversations, we will soon reach a point where the full history does not fit into the context any more (or it does, but we do not want to do this as the cost charged for the OpenAI API depends on the number of token). To solve this, AutoGPT adds a memory implemented by some external data store (at the time of writing, the memory system of AutoGPT is undergoing a major rewrite, so what follows is based on an analysis of earlier versions and a few assumptions).

The memory is used in two different places. First, when an iteration has been completed, an embedding vector is calculated for the results of the step, i.e. the reply of the model and the result of the executed command. The result is then stored in an external datastore, which can be as simple as a file on the local file system, but can also be a cache like Redis, and indexed using the embedding.

When an iteration is started, the agent uses an embedding calculated from the last few messages in the conversation to look up the most relevant data in memory, i.e. the results of previous iterations that have the most similar embeddings. These results are then added to the prompt, which gives the model the ability to remember previous outcomes and reuse them in a way not limited by the length of the prompt. This approach of enriching the prompt with relevant information obtained via a semantic search is sometimes called retrieval augmentation – note however that in this case, the information in the data store has been generated by model and agent, while retrieval augmentation usually leverages pre-collected data.

Finally, the AutoGPT agent can choose from a much larger variety of tools (i.e. commands in the AutoGPT terminology) than the simple ReACT agent which only had access to Wikipedia. On the master branch, there are currently commands to run arbitrary Python code, to read and write files, to clone GitHub repository, to invoke other AIs like DALL-E to generate images, to search Google or to even browse the web using Selenium. AutoGPT can also use Elevenlabs API to convert model output to speech.

AutoGPT and ReACT are not the only shows in town – there is for instance BabyAGI that lets several agents work on a list of tasks which are constantly processed, re-prioritized and amended. Similarly, HuggingGPT [7] initially creates a task list that is then processed, instead of deciding on the next action step by step (AutoGPT is an interesting mix between these approaches, as it has a concrete action for the next step, but maintains a global plan as part of its thoughts). It is more than likely that we will see a rapid development of similar approaches over the next couple of months. However, when you spend some time with these tools, you might have realized that this approach comes with a few challenges (see also the HuggingGPT paper [7] for a discussion).

First, there is the matter of cost. Currently, OpenAI charges 6 cent per 1K token for its largest GPT-4 model (and this is only the input token). If you consume the full 32K context size, this will already be 2$ per turn, so if you let AutoGPT run fully autonomously, it could easily spend an amount that hurts in a comparatively short time.

Second, if the model starts to hallucinate or to repeat itself, we can easily follow a trajectory of actions that takes us nowhere or end up in a loop. This has already been observed in the original ReACT paper, and I had a similar experience with AutoGPT – asking for a list of places in France to visit, the agent started to gather more and more detailed information and completely missed the point where I would have been happy to get an answer compiled out of the information that it already had.

Last but not least the use of autonomous agents that execute Python code, access files on your local system, browse the web using your browser and your IP address or even send mails and trigger transactions on your behalf can soon turn into a nightmare in terms of security, especially given that the output of commands ends up in the next prompt (yes, there is something like a prompt injection, see [8] and [9]). Clearly, a set of precautions is needed to make sure that the model does not perform harmful or dangerous actions ([6] provides an example where AutoGPT was starting to install updated SSL certificates to solve a task that a straight-forward web scraping could have solved as well).

Still, despite of all this, autonomous agents built on top of large language models are clearly a very promising approach and I expect to see more powerful, but also more stable and reliable solutions being pushed into the community very soon. Unitl then, you might want to read some of the references that I have assembled that discuss advanced prompting techniques and autonomous agents in more details.

References:

[1] S. Yao et al., ReAct: Synergizing Reasoning and Acting in Language Models, arXiv:2210.03629
[2] Prompt Engineering guide at https://www.promptingguide.ai/
[3] N. Shinn et al., Reflexion: Language Agents with Verbal Reinforcement Learning, arXiv:2303.11366
[4] L. Wang et al., Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models, arXiv:2305.04091
[5] Ask Marvin – integrate AI into your codebase
[6] https://www.linkedin.com/pulse/auto-gpt-promise-versus-reality-rick-molony – an interesting experience with AutoGPT
[7] Y. Shen, HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face, arXiv:2303.17580 – also known as JARVIS
[8] C. Anderson, https://www.linkedin.com/pulse/newly-discovered-prompt-injection-tactic-threatens-large-anderson/
[9] OWASP Top 10 for LLM Security

Mastering large language models – Part XI: encoding positions

In our last post, we have seen that the attention mechanism is invariant to position, i.e. that reordering of the words in a sentence yields the same output, which implies that position information is lost. Clearly position information does matter in NLP, and therefore transformer based models apply positional embeddings in addition to the pure word embeddings.

Before explaining positional embeddings, let us first recall word embeddings in general. Suppose we are working with sentences consisting of L token taken out of a fixed vocabular. Then creating vectors out of these token amounts to map each token

token_1, token_2, token_3, \dots, token_L

to a vector vi. This encoding, however, is usually position independent, meaning that if the same token appears at two different positions in the sentence, the resulting vector will be the same for both instances of the token. Technically, the embedding can be described by a matrix of dimension (D, V), where D is the model dimension and V is the size of the vocabulary, and each column in this matrix represents the embedding for one token.

In the original transformer paper [1], these word embeddings are augmented by an additional positional embedding. Similar to word embeddings, this again attaches a vector of dimension D to every token in the input. However, this embedding now solely depends on the position of the token, and not on the actual token itself, so that different token at the same position yield the same embedding. This additional embedding is then added to the word embedding to obtain one vector per token which is then fed into the actual model.

But what is the vector that we want to add to the word embedding at position i? Obviously, there are many choices (reference [3] is a good survey of existing approaches). The approach taken in the original transformer is to use fixed values which are not learned and not modified during training and are given by the equations

\textrm{PE}(p, i) = \begin{cases}     \sin(C^{i / D} p) & i \, \textrm{even} \\     \cos(C^{(i-1) / D} p) & i \, \textrm{odd} \end{cases}

where C is a fixed constant (actually 1 / 10000), p is the position of the token in the sentence and i is the vector index, so that PEp or PE(p, : ) in Python notation is the vector added to the word embedding of a token at position p. Here we assume that the model dimension D is an even number.

The motivation provided for this specific choice in [1] is that for a fixed offset k, the positional encoding vectors PEp+k and PEp are related by a linear transformation, so that, as the authors phrase it in section 3.5 of [1], “it would allow the model to easily learn to attend by
relative positions
“. Let us quickly verify this. For any i, component 2i of the vector PEp+k is given by

\begin{aligned}PE_{p+k, 2i} &= \sin(C^{2i / D} (p+k)) \\ &= \sin(C^{2i/D}p) \cos(C^{2i/D}k) + \cos(C^{2i/D}p) \sin(C^{2i/D}k) \\ &= \cos(C^{2i/D}k) PE_{p,2i} + \sin(C^{2i/D}k) PE_{p,2i+1} \end{aligned}

and similarly, the component 2i+1 is given by

\begin{aligned} PE_{p+k,2i+1} &= \cos(C^{2i / D} (p+k)) \\ &= \cos(C^{2i / D}p)\cos(C^{2i / D}k) -  \sin(C^{2i / D}p)\sin(C^{2i / D}k) \\ &= \cos(C^{2i / D}k)PE_{p,2i+1} - \sin(C^{2i / D}k)PE_{p,2i} \end{aligned}

proving our claim. Note that these two relations can nicely be organized into one equation as

\begin{pmatrix} PE_{p+k, 2i} \\ PE_{p+k, 2i+1}  \end{pmatrix} = \begin{pmatrix} \cos(C^{2i / D}k) & \sin(C^{2i / D}k) \\ - \sin(C^{2i / D}k) & \cos(C^{2i / D}k) \end{pmatrix} \begin{pmatrix} PE_{p, 2i} \\ PE_{p, 2i+1}  \end{pmatrix}

showing that the two vectors are related by a rotation (we will get back to this point later when we talk about relative embeddings).

Let us quickly see how we could code this in Python. As we need one embedding vector of dimension D for each position, the embeddings form a matrix of shape (L, D), where each row represents one position. After applying the word embeddings to an input sequence of length L, this will as well result in a matrix of shape (L, D), and we can simply add these two matrices together. Of course, we would not calculate the sine and cosine values above again for every input, but store them in a global variable, an attribute or (better) in a PyTorch buffer. Here is a piece of code to calculate the embeddings, taken from the corresponding section in d2l.ai (note that this only works if D is even, which, however, is usually the case – in fact, dimensions are often multiples of 8, 16 or 32 to better support the inner workings of modern GPUs).

#
# Values for p and 2*i / D
#
_p = torch.arange(L , dtype=torch.float32).unsqueeze(dim = 1)
_i = torch.arange(0, D , step = 2) / D
#
# x = C**(2i/D)*p
#
x = torch.pow(C, _i) * _p
#
# Apply sine and cosine and organize into a matrix pe
#
pe = torch.zeros(L, D)
pe[:, 0::2] = torch.sin(x)
pe[:, 1::2] = torch.cos(x)

For more on this and a few visualisations, you might want to take a look at the notebook for this post.

Note that at this point, we need to make an assumption on the maximum length of the input sequence that we will be using, called the context size. The context size is also an important parameter for training and inference, as the attention matrices have shape (L, L), so that a large context size increases memory usage and computational complexity.

As already mentioned, this way to construct embeddings is not the only one. A second method that has gained some popularity and is for instance applied in GPT-models ([4]) is to use learned embeddings. With this approach, the positional embeddings are treated similarly to the word embeddings, i.e. they are simply parameters that are initialized with some random values and then updated during backward passes as any other parameter of the network.

Both the sinusoidal embeddings used in the original transformer paper as well as the learned embeddings used by GPT are absolute embeddings, i.e. the value of an embedding depends on the absolute position of a token in a sentence. In [2], a different approach is taken which is known as relative embeddings and relies on a modification of the self-attention mechanism to incorporate positional information rather than on absolute positional embeddings added to the word embeddings before processing them in the attention layer.

More specifically, when calculating attention scores and before applying the softmax, the usual dot product between query at position i and key at position j is modified by adding a bias vector w to the key that depends on the relative position, i.e. the difference j – i.

q \left( \cdot k^t + w_{j-i}\right)

These bias vectors are then learned, similar to the word embeddings. This modification is applied not only in the first layer, but in all layers. In addition, to avoid having to learn 2*L embeddings for each layer and each attention head, the authors clip the relativ position, i.e. instead of using the index j – i which can range from -L to L, the use the modified index

\max(-k, \min(k, j-i))

so that the relative position is clipped to values between -k and k, where k is a hyperparameter. For an efficient implementation, see [5].

Many variations of this approach are possible. An interesting example is the T5 model described in [6] which we will look at in more detail in a later post. In this model, the position dependent vector added to the key before taking the dot product with the query is replaced by a scalar added after taking the dot product, so that the dot product is replaced by

q \cdot k^t + a_{j-i}

where aj-i is now a scalar. These scalars are again learned parameters, shared between different layers but unique for each head. Instead of a clipping, buckets are formed, each corresponding to a range of different values of j – i. so that the number of parameters that need to be learned is simply the number of buckets times the number of heads. It is instructive to take a look at the T5 implementation of the Huggingface Transformer library to see how this works in practice – the notebook for this post contains a simplified and commented version of this code.

Before moving on, let us quickly discuss another type of relative positional embeddings known as rotary embeddings, originally proposed in [7]. In contrast to the approaches that we have seen so far, this method does not modify the self-attention score by adding a vector or scalar, but by rotating queries and keys before forming the dot product. Specifically, the authors consider a set of orthogonal matrices

R_{\Theta, i}

depending on a fixed parameter Θ and the position i and replace the usual attention dot product between query and key by

q_i R_{\Theta, i} (k_j R_{\Theta, j})^t = q_i R_{\Theta, i} R_{\Theta, j}^t k_j^t

At the first glance, this does not seem to be a relative embedding, as it seems to depend on both positions – the position i of the query and the position j of the key. However, the authors choose a set of matrices that mathematically speaking form something resembling a one-parameter subgroup, i.e. having the property that

R_{\Theta, i} R_{\Theta, j}^t = R_{\Theta, i - j}

so that the attention score is actually

q_i R_{\Theta, i - j}  k_j^t

which does in fact only depend on the difference i – j, i.e. on the relative position of query and key. Rotary embeddings have also been applied in the GPT-NeoX model by EleutherAI.

We now have a broad selection of different embedding methods to our disposal and could go ahead and apply them to train a transformer-based decoder-only model. However, there is still some challenge that we have not touched upon – the word encodings. So far, our word encodings were straightforward – we did either chose to turn each individual character into a token or used word-based encodings where our vocabulary consists of all words found in the training data. Unfortunately, both approaches have certain drawbacks. Character-based encodings imply that the model also has to learn how to assemble words from characters and makes learning long-range relations more difficult. Word-based encodings suffer from the problem that the number of unique words in the training data can become huge, which will blow up our model size.

For that reason, many models use a third approach known as subword tokenization, where token are built from parts of words instead of individual characters or entire words. In the next post, we will therefore study the tokenization mechanism used by the original transformer – byte-pair encoding (BPE).

References:

[1] A. Vaswani et al., Attention is all you need
[2] P. Shaw et al., Self-Attention with Relative Position Representations
[3] P. Dufter, M. Schmitt, H. Schütze; Position Information in Transformers: An Overview. Computational Linguistics 2022; 48 (3): 733–763
[4] A. Radford et al., Improving Language Understanding by Generative Pre-Training
[5] C. Huang et al, Music Transformer: generating music with long-term structure
[6] C. Raffel et al., Exploring the Limits of Transfer Learning with a Unified
Text-to-Text Transformer

[7] J. Su et al., RoFormer: Enhanced Transformer with Rotary Position Embedding

Mastering large language models – Part X: Transformer blocks

Today, we will start to look in greater detail at the transformer architecture. At the end of the day, a transformer is built out of individual layers, the so-called transformer blocks, stacked on top of each other, and our objective for today is to understand how these blocks are implemented.

Learning about transformer architectures can be confusing. If you read the original paper Attention is all you need by Vaswani et al. (and I highly recommend to actually take a look at the paper), you will, already on page 3, stumble upon a rather complicated diagram which is actually a bit intimidating if all of this is new to you. This paper talks about encoders and decoders, which is a language that we understand thanks to a previous post. On the other hand, you will hear statement that popular models like GPT-2 are in fact decoder-only models, without having a clear definition at hand what this actually means. To make things worse, decoder-only models are built out of PyTorch modules which, confusingly enough, are called torch.nn.TransformerEncoder. So before getting into too many details, let us first try to clean up that mess a bit.

The most general transformer models like the one described in the original paper or popular models like T5 are in fact encoder-decoder models. The general architecture is not so much different from the encoder-decoder models that we have already seen in the context of RNNs. There is a first network, the encoder, which translates a source sentence into a sequence of embeddings. Then, there is a second network that uses the output of the encoder as a context – typically called the memory in transformer-based architectures – to generate the source sequence. A crucial difference, though, is that the decoder reads the input of the encoder via an attention layer which gives the decoder network access to all time steps of the encoder at the same time. So very high-level, the picture is as follows.

Let us now dig a bit deeper. In fact, both, the encoder and the decoder, are built out of layers called transformer blocks. For the encoder, every transformer block has one input and one output. For the decoder, every transformer block has two inputs and one output. One input reads, via an attention mechanism, data from the encoder (more precisely, keys and values are taken from the encoder, while queries are taken from the decoder). The second input is taken from the previous transformer block of the encoder. So on the next level of detail, the diagram looks like this.

This explain the first (and most important) architectural difference between an encoder transformer block (torch.nn.TransformerEncoderLayer in PyTorch) and a decoder transformer block (torch.nn.TransformerDecoderLayer) – a decoder block has two inputs, an encoder block only one.

The second difference is the attention mask. Attention is actually applied at all inputs of a transformer block. For an encoder, masking is typically not done, so that the encoder can look at all token in the input. For the decoder, masking is also usually not done for the attention layer connecting the encoder and the decoder, so that the decoder has access to all time steps of the encoder output. For the second attention layer of a decoder, however, connected to the decoder input or the input of the previous layer, a causal self-attention mask is usually applied, i.e. the decoder is not allowed to peak ahead, as discussed in the last post. This difference, however, is not hardcoded into the PyTorch classes, as the attention mask in PyTorch is a parameter.

Finally, there are decoder-only transformers, and this is the class of transformer on which we will focus in the next few posts. These are transformers that are built from encoder blocks, each of which receives only one input, so there is no encoder. However, the attention mask used for the input is the one used for the decoder, so that we can apply teacher forcing to train such a model. Thus a decoder-only transformer receives a sequence of length L as input, encoded as a matrix of shape (L, D) where D is the embedding dimension, and produces an output of the same shape. If we add an embedding layer and an output layer, we can apply such a model for language generation very much in the same manner as an RNN, and this is the use case we will study first.

So let us talk about transformer blocks (more specifically, transformer blocks with only one input, i.e. encoder blocks in the PyTorch terminology) in more detail. Instead of jumping directly to the paper, let us take a different approach and look at the source code of the corresponding PyTorch module. The key lines in the forward method are

 x = self.norm1(x + self._sa_block(x, src_mask, src_key_padding_mask))
 x = self.norm2(x + self._ff_block(x))

Let us see what this does. First, we feed the input through a self-attention block. Note, however, that it is on us to specify a causal attention mask, as the model will not do this automatically.

Next, we apply a residual connection, i.e. we take the output of the self attention and add the original input to it. This helps to avoid vanishing gradients, as it allows a part of the gradient during backpropagation to flow directly back to the previous layer. Finally, we apply a layer normalization to the output, i.e. to the sum.

We then repeat the same pattern. This time, however, we run the result of the previous step through a feed-forward network before we again add the input to establish a residual connection and finally apply a normalization. Thus our full flow looks as follows.

So we repeat the pattern

\textrm{LayerNorm}(x + \textrm{Sublayer}(x))

twice, where first the sublayer is a self-attention layer and then the sublayer is a feed-forward network. This is, in fact, exactly the expression that we find in section 3.1 of the original paper.

Before we proceed, let us talk about dimensions. When working with transformers, it is common to present the input in the format (L, B, D), where L is the sequence length, B is the batch size and D is the dimension of the model. Each sublayer produces an output of the same shape, and so does the layer norm, so that the entire transformer block again produces an output of that shape.

In addition, a more careful look a the code shows that at the output of each sublayer, there is a dropout layer to avoid overfitting. As usual, the dropout is deactivated during inference and evaluation, but active during training.

So far, so good, but we are still missing an important point in our overall approach. To spot this, let us go ahead and quickly try out a transformer block in PyTorch, then permute the token in the input, i.e. change the orders of the input vectors, and run the new input through the transformer block again. Here is a short piece of code doing this (you can also have a look at the notebook that I prepared for this post.

# Model dimension
D = 8
# Length of encoder input
L = 4
#
# Create random input
#
X = torch.randn(L, D)
#
# and feed through an encoder block
#
block = torch.nn.TransformerEncoderLayer(d_model = D, nhead = 4)
block.eval()
Y = block(X).detach()
#
# Now permute the input, recompute
#
Xp = X.flip(dims = [0]).detach()
Yp = block(Xp).detach()
#
# Verify that Yp is simply the permutation of Y
#
print(f"{torch.allclose(Yp, Y.flip(dims = [0]))}")

We find that permuting the input to a transformer simply results in the same permutation applied to the output. This is not surprising after a closer look at the formula for the attention, but is not really matching our intuition of a good encoding of a sentence. After all, the two sentences

The dog chases the cat

and

The cat chases the dog

are permutations of each other, but have completely different meanings, so that from a good encoding, we would expect more than just a permutation as well. The problem becomes even more apparent if we now simulate the first step of processing that encoder output in a decoder, i.e. passing it through the attention layer connecting encoder and decoder. Recall that this attention layer uses keys and values from the encoder output, but the queries from the decoder input. Let us simulate this.

# length of target sequence, i.e. decoder input
T = 3 
queries = torch.randn(T, D)
attn = torch.nn.MultiheadAttention(embed_dim = D, num_heads = 4)
#
# Put into eval mode to avoid dropout
#
attn.eval()
#
# Values and keys are both the encoder output
#
out = attn(queries, Y, Y)[0].detach()
outp = attn(queries, Yp, Yp)[0].detach()
#
# Compare
#
print(f"{torch.allclose(out, outp)}")

This is even worse – the output is exactly the same, even after permuting the encoder input. That means that a transformer trying to translate the two sentences above would most likely produce the same target sentence, which we can also verify directly by using PyTorchs transformer module instead of its individual building blocks.

transformer = torch.nn.Transformer(d_model = D, nhead = 1)
transformer.eval()
tgt = torch.randn(T, D)
src = torch.randn(L, D)
src_permuted = src.flip(dims = [0])
out = transformer(src, tgt)
_out = transformer(src_permuted, tgt)
print(out)
print(_out)
print(f"{torch.allclose(out, _out)}")

So transformer models as discussed so far are not sensitive to the order of words in the input. To use them for NLP tasks, we therefore need an additional trick – positional embeddings – which we will discuss in the next post.