July 2023 – LeftAsExercise

Mastering large language models – Part XVII: reinforcement learning and PPO

A large part of the success of GPT-3.5 and GPT-4 is attributed to the fact that these models did undergo, in addition to pre-training and supervised instruction fine-tuning, a third phase of learning called reinforcement learning with human feedback. There are many posts on this, but unfortunately most of them fall short of explaining how the training method really works and stay a bit at the surface. If you want to know more – read on, we will close this gap today which also concludes this series on large language models. But buckle on, this is going to be a long post.

What is reinforcement learning?

Pretraining a language model using teacher forcing and fine-tuning using supervised training have one thing in common – a ground truth, i.e. a label, is known at training time. Suppose for instance we train a model using teacher forcing. We then feed the beginning of a sentence that is part of our training data into the model and, at position n, teach the model to predict the token at position n + 1. Thus we have a model output and a label, calculate a loss function that measures the difference between those two and use gradient descent to minimize the loss. Similarly, if we fine-tune on a set of question-instruction pairs, we again have a model output and a label that we can compare in exactly the same way.

However, language is more complicated than this. Humans do not only rate a reply to a question on the level of individual token, but also by characteristics that depend on the sentence as a whole – whether a reply is considered as on topic or not is not only a function of the individual token that appear in it.

So if we want to use human feedback to train our model, we will have to work with feedback, maybe in the form of some score, that is attached not to an individual token, but to the entire sequence of token that make up a reply. To be able to apply methods like gradient descent, we need to somehow propagate that score down the level of individual token, i.e. sampling steps. Put differently, we will need to deal with delayed feedback, i.e. constituents of the loss function that are not known after sampling an individual token but only at a later point in time when the entire sentence is complete. Fortunately, the framework of reinforcement learning gives us a few tools to do this.

To understand the formalism of reinforcement learning, let us take a look at an example. Suppose that you wanted to develop a robot that acts as an autonomous vacuum cleaner which is able to navigate through your apartment, clean as much of it as possible and eventually return to a docking station to re-charge its batteries. While moving through the apartment, the robot needs to constantly take decisions like moving forward or backward or rotating to change direction. However, sometimes the benefit of a decision is not fully apparent at the point in time when it has to be taken. If entering a new room, for instance, the robot might enter the room and try to clean it as well, but doing this might move it too far away from the charging station and it might run out of energy on the way back.

This is an example where the robot has to take a decision which is a trade off – do we want to achieve a short-term benefit by cleaning the additional room or do we consider the benefit of being recharged without manual intervention as more important and head back to the charging station? Even worse, the robot might not even know the surface area of the additional room and would have to learn by trial-and-error whether it can cover the room without running out of power or not.

Let us formalize this situation a bit. Reinforcement learning is about an agent like our vacuum cleaning robot that operates in an environment – our apartment – and, at each point in time, has to select one out of a set of possible actions, like moving forward, backward, left or right. When selecting an action, the robot receives a feedback from the environment called a reward, which is simply a scalar value that signals the agent whether the decision it has taken is considered beneficial or not. In our example, for instance, we could give the robot a small positive reward for every square meter it has cleaned, a larger positive reward for reaching the charging station and a small negative reward, i.e. a penalty, when it bumps into a wall.

In every time step, the agent can select one the available actions. The environment will then provide the reward and, in addition, update the state of the system (which, in our case, could be the location of the agent in the apartment and the charging state of the battery). The agent then receives both, the reward and the new state, as an information and proceeds with the next time step.

Let us now look at two ways to formalize this setting – mathematics and code. In mathematical terms, a problem in reinforcement learning is given by the following items. First, there is a set $\mathcal{S}$ of states, which we typically assume to be finite, and a set $\mathcal{A}$ of possible actions, which we assume to be finite as well. In addition, there is a set $\mathcal{R} \subset \mathbb{R}$ of rewards.

The reward that the agent receives for a specific action and the next state depend on the current state and the action. We could now proceed and define this is a mapping

$\mathcal{S} \times \mathcal{A} \rightarrow \mathcal{R} \times \mathcal{S}$

However, in many cases, this assignment has a probabilistic character, i.e. the reward and the next state are not fully deterministic. Therefore, the framework of a Markov decision process (MDP) that we use here describes rewards and next state via a probability distribution conditioned on the current state and the action, i.e. as

$p(S_{t+1} = s', R_{t+1} = r | S_t = s, A_t = a)$

where S_t+1 is the state at time t + 1, S_t is the state at time t, A_t is the action taken at time t and R_t+1 is the reward that the agent receives at time t (this convention for the index of the reward might appear a bit unusual, but is in fact quite common in the field). Collectively, these conditional probabilities are called transition probabilities.

Note that we are making two crucial assumptions here. The first one is that the reward and next state depend only on the action and the current state, not on any previous state and not on the history of events so far (Markov property). This is rather a restriction when modelling a real world problem, not a restriction on the set of problems to which we can apply the framework, as we could simply redefine our state space to be the full history up to the current timestep. Second, we assume that the agent has access to the full state on which rewards and next states depend. There are also more general frameworks in which the agent only receives an observation that reflects a partial view on the actual state (this is the reason why some software packages use the term observation instead of state).

Let us now look at this from a different angle and see how this is modeled as an API. For that purpose, we will look at the API of the popular Gymnasium framework. Here, the central object is an environment that comes with two main methods – step and reset. The method reset resets the environment to an initial state and returns that state. The method step is more interesting. It accepts an action and returns a reward and the new state (called an observation in the Gym framework), exactly as indicated in the diagram above.

Thus an agent that uses the framework would essentially sit in a loop. In every step, it would pick an action that it wants to perform. It would then call the step method of the environment which would trigger the transition into a new state, depending on the action passed as argument. From the returned data, the agent would learn about the reward it has received as well as about the next state.

When would the loop end? This does of course depend on the problem at hand, but many problems are episodic, meaning that at some point in time, a terminal state is reached which is never left again and in which no further rewards are granted. The series of time steps until such a terminal state is reached is called an episode. We will see later than in applications to language modelling, an episode could be a turn in a dialogue or a sentence until an end-of-sentence token is sampled. However, there are also continuing tasks which continue potentially forever. An example could be a robot that learns to walk – there is no obvious terminal state in this task, and we would even encourage the robot to continue walking as long as possible.

Policies and value functions

We have now described a mathematical framework for states and actions. What we have not yet discussed, however, is how the agent actually takes decisions. Let us make the assumption that the agents decisions only depend on the state (and not, for instance, on the time step). The rule which assigns states to actions is called the policy of the agent. Again, we could model this in a deterministic way, i.e. as a function

$\mathcal{S} \rightarrow \mathcal{A}$

but it turns out to be useful to also allow for stochastic policies. A policy is then again a probability distribution which describes the probability that the agent will take a certain action, conditioned on the current state. Thus, a policy is a conditional probability

$\pi(a | s) = \pi(A_t = a | S_t = s)$

Note that given a policy, we can calculate the probability to move into a certain state s’ starting from a state s as

$P(S_{t+1} = s' | S_t = s) = \sum_{a,r} \pi(a | s) p(r, s' | a, s)$

giving us a classical Markov chain.

What makes a policy a good policy? Usually, the objective of the agent is to maximize the total reward. More precisely, for every time step t, the reward R_t+1 at this time step is a random variable of which we can take the expectation value. To measure how beneficial being in a certain state s is, we could now try to use the sum

$\sum_t \mathbb{E}_\pi \left( R_{t+1} | S_0 = s\right)$

of all these conditional expectation values as a measure for the total reward. Note that the conditioning simply means that in order to calculate the expectation values, we only consider trajectories that start at given state s.

However, it is not obvious that this sum converges (and it is of course easy to come up with examples where it does not). Therefore, one usually builds a discounting factor $0 < \gamma < 1$ into this equation and defines the value of a state s to be the conditional expectation value

$v_\pi(s) = \sum_t \gamma^t \mathbb{E}_\pi \left( R_{t+1} | S_0 = s \right)$

Note that the discounting factor plays two roles in this definition. First, it makes sure that under mild conditions, for instance a bound on the rewards, the sum is finite. Second, the value of a reward received early in the trajectory is higher than the value of the same reward at a later time step, so the discounting encourages the agent to collect rewards as early as possible. Whether this is reasonable for the problem at hand is a matter of modelling, and in real implementations, we often see discounting factors close to one.

The mapping that assigns to each state s the value v(s) defined above is called the value function. Note that the expectations are taken with respect to the state transition probabilities given by the policy $\pi$ , and in fact the value function depends on the policy. In general, the objective of the agent will be to find a policy which maximizes the value function for all states or for a distinguished state.

In addition to this value function that assigns a value to a state, there are also two additional functions that we will have to use. First, instead of assigning a value to a state, we could as well assign a value to a state and a given action. This action-value function is usually denoted by q and defined as

$q_\pi(s,a) = \sum_t \mathbb{E}_\pi \left( \gamma^t R_{t+1} | S_0 = s, A_0 = a \right)$

In other words, this is the expected discounted return if we start in state s and choose a as the first action before we start following the policy. Finally, the advantage function

$A_\pi(s, a) = q_\pi(s,a) - v_\pi(s)$

essentially measure the additional benefit (or penalty) of choosing a as action in state s compared to the average under the policy.

PPO

To explain PPO, let us simplify things a bit and assume that we have picked a dedicated starting state s₀ and are aiming at maximizing the outcome for this state only, i.e. we are trying to find a policy that maximizes the value of this state. Suppose that we start with a randomly chosen policy $\pi$ and are striving to identify a better policy $\pi'$ . As a starting point, we can relate the value of our state under the policy $\pi'$ to that under $\pi$ and find, after some calculations that I will not reproduce here and that you can find in [1], that

$v_{\pi'}(s_0) - v_\pi(s_0) = \mathbb{E}_{s \sim \pi'} \left[ \sum_a \pi'(a | s) A_\pi(s, a) \right]$

Here, the expectation value is taken with respect to $\pi'$ , more precisely it is taken with respect to the so-called discounted state distribution defined by $\pi'$ . We will not go through the proof, but intuitively, this makes sense – note that in order to maximize the right-hand side, we need to amplify the probabilities $\pi'(a | s)$ for which the advantage taken with respect to $\pi$ is positive, i.e. those actions which represent an improvement over the current policy $\pi$ .

Let us now suppose that we model the policy $\pi$ using a neural network (spoiler: later, this will be our language model). We could then try to maximize the right hand side of this equation using gradient descent (or rather gradient ascent), and this relation would tell us that by doing so, we would obtain a better policy $\pi'$ . Unfortunately, there are two problems with this – the sampling and the unknown advantage function.

Let us first take a look at the advantage function. To calculate the loss function (or rather objective, as we want to maximize this, not minimize) we need the values of the advantage function. To do this, PPO uses the so-called actor-critic approach: in addition to the model describing the policy (called the actor model) which we want to optimize, we maintain a second model, called the critic, which is trained to learn the value function. We will later see that in each iteration of the PPO algorithm, we sample a few trajectories using the current policy, and during this phase, we can observe the resulting rewards and use them to derive an estimate for the advantage function. This estimate can then be used as ground truth for the critic model and we can apply gradient descent to train the critic model to approximate the advantage function as closely as possible. In practice, the actor model and the critic model have the same architecture and only differ in the top-level layer, or even share layers and weights.

The second problem is more difficult to solve. When looking at the equation above, we see that the policy $\pi'$ – which we want to tune to maximize our objective – appears in two different places, namely inside the sum over the actions but also in the probability distribution that we use to define the expectation value. To estimate the expectation value, we would have to sample from this distribution, but after the first iteration of gradient ascent, we have changed $\pi'$ and all our samples, being taken with the previous value of the weights, would become stale. Even worse, sampling is a non-differentiable operation, so we cannot simply apply a backward pass to calculate the gradient that we need to apply to the weights.

In [1], an algorithm that has become known as TRPO (Trust region policy optimization) was proposed that deals with this problem by making an approximation. Specifically, instead of using the loss function (note the minus sign to be able to use gradient descent)

$\mathcal{L} =- \mathbb{E}_{s \sim \pi'} \left[ \sum_a \pi'(a | s) A_\pi(s, a) \right]$

it uses the loss function

$\mathcal{L} =- \mathbb{E}_{s \sim \pi} \left[ \sum_a \pi'(a | s) A_\pi(s, a) \right]$

Note the subtle but important difference – we take the expectation value with respect to the distribution $\pi$ , not $\pi'$ as before. This approximation turns out to work well as long as $\pi'$ and $\pi$ are not too different (in [1], this is spelled out using the so-called Kullback-Leibler divergence, we will see soon that PPO uses a slightly different approach to enforce this). Trivially, this is the same as

$\mathcal{L} =- \mathbb{E}_{s \sim \pi} \left[ \sum_a \frac{\pi'(a | s)}{\pi(a | s)} \pi(a | s) A_\pi(s, a) \right]$

which we can write as an expectation value over the actions as well, namely

$\mathcal{L} =- \mathbb{E}_{s, a \sim \pi} \left[ \sum_a \frac{\pi'(a | s)}{\pi(a | s)} A_\pi(s, a) \right]$

This is not yet exactly the loss function that PPO actually uses, but let us pause there for a moment and use this preliminary loss function to discuss the overall structure of the algorithm.

First, we initialize both models – the actor model that will learn our policy and the critic model that will learn the value function. Next, we perform the actual training iterations. In each iteration, we go through two phases as indicated in the diagram above – the sampling phase and the optimization phase.

In the sampling phase, we use our current policy $\pi$ to sample a few trajectories until we have collected data from a pre-defined number of state transitions, typically a few hundred or thousand. We do, however, not process these state transitions directly but first store them in a buffer.

One this is done, we enter the second phase, the optimization phase. In this phase, we randomly sample mini-batches from the previously filled buffer. Each item in one of the batches will give us one term of the form

$\frac{\pi'(a | s)}{\pi(a | s)} A_\pi(s, a)$

i.e. one term in the loss function (behind the scenes, we use the critic model and a method known as GAE (see [2]) to calculate the advantage term). We add up these terms for a batch, obtain the loss function and apply a gradient descent step. This will give us new, updated weights of the actor model and therefore a new policy $\pi'$ . Note, however, that the samples are still from the old policy $\pi$ . We then repeat this for a given number of batches and epochs.

At this point, one iteration is complete. We now repeat the procedure for the next iteration, using $\pi'$ as starting policy $\pi$ from which we sample. In each iteration, we obtain a slightly improved policy , and we repeat this until we have reached convergence or a predefined number of iterations.

This already looks like a meaningful approach, but we have ignored one important point. Our loss function is an approximation, which gets worse within each iteration as $\pi'$ starts to deviate significantly from $\pi$ . To solve this, the PPO algorithm proposed in [3], uses the ratio

$r_t = \frac{\pi'(a_t | s_t)}{\pi(a_t | s_t)}$

as a measure for the distance between $\pi'$ and $\pi$ . To control this distance and thus the error that we make by using the approximation for the loss function, PPO clips this ratio, i.e. restricts it to an interval of the form $\left[ 1 - \epsilon, 1 + \epsilon \right]$ where $\epsilon$ is a hyperparameter that is often chosen to be 0.2.. This additional clipping yields the full loss function that PPO uses – in reality, the loss function is even a bit more complicated as the clipping logic depends on the sign of the advantage function and reads as follows.

$- \mathbb{E}_{s, a \sim \pi} \left[ \min( r_t A_{\pi}(s,a) , \mathrm{clip}(r_t, 1 + \epsilon, 1 - \epsilon) A_{\pi}(s,a) ) \right]$

PPO and language models

Having discussed PPO in a more general setting, let us now translate the terminology used in the previous few sections to the world of language models. First, we need to define what states and actions we will use to describe the training of a language model as a problem in reinforcement learning.

Intuitively, our state should correspond to the state of the conversation that a human has with the model. Therefore, a state is given by a prompt (which we would in practice sample from a set of prompts serving as test data) along with a partial completion, i.e. a sequence of token following the prompt. The actions are simply the token in the vocabulary, and the state transition is deterministic and modelled by simply appending the token. Here is an example.

In this case, the initial state is the prompt “Hi, how are you today?”. When we sample a trajectory starting at this state, our first action could be the token “I”. We now append this token to the existing prompt and end up in the state given by the sequence of token “Hi, how are you today? I”. Next, we might sample the token “am” which takes us to the state “Hi, how are you today? I am” and so forth. A state is considered a terminal state if the last token is a dedicated end-of-sentence token used to indicate that the reply of the model is complete. Thus, an episode starts with a prompt and ends with a full reply from the model – in other words, an episode is a turn in the conversation with the model.

With this, it should also be clear what our policy is. A policy determines the probability for an action given a state, i.e. the probability of the next token given a sequence of token consisting of the initial prompt and the already sampled token – and this is of course what a language model does. So our policy is simply the language model we want to train.

To implement PPO, we also need a critic model. We could, for instance, again use a transformer that might or might not share some weights with the policy model, equipped with a single output head that models the value function.

Finally, we need a reward. Recall that the reward function is a (stochastic) function of states and actions. Thus, given an initial prompt x and a partial completion y, of which the last token is the selected action, we want to assign a reward r(x, y). This reward has two parts. The first part is only non-zero if y is complete, i.e. at the end of an episode, and – for reasons that will become clear in a minute – we will denote this part by

$RM(x, y)$

The second part is designed to address the issue that during reinforcement learning, the model might learn a policy that decreases the overall quality of the language model as it deviates too much from the initial model. To avoid this, this second term is a penalty based on the Kullback-Leibler divergence between the current policy and a reference policy $\pi^{ref}$ (if you have never heard the term Kullback-Leibler divergence before, think of it as a distance between two probability distributions). This penalty term is

$- \beta \left[ \ln \frac{\pi'(y | x)}{\pi^{ref}(y | x)} \right]$

where $\beta$ is a hyperparameter which might be dynamically adapted during training.

Let us now talk about the actual reward RM(x, y). In an ideal world, this could be a reward determined by a human labeler that reflects whether the output is aligned with certain pre-defined criteria like accuracy, consistency or the absence of inappropriate language. Unfortunately, this labeling step would be an obvious bottleneck during training and would restrict us to a small number of test data items.

To solve this, OpenAI (building on previous work of other groups) trained a separate model, the so-called reward model, to predict the reward that a human labeler would assign to the reply. This reward model was first trained using labeled test data, i.e. episodes rated by a workforce of human labelers. Then, the output of this model (called RM(x,y) above) was used as part of the reward function for PPO. Thus, there are actually four models involved in the training process, as indicated in the diagram below.

First, there is the model that we train, also known as the actor model or the policy. This is the policy from which we sample, and we also use the probabilities that it outputs as inputs to the loss function. Next, there is, as always when using the PPO algorithm, the critic model which is only used during training and is trained to estimate the value of a state. The reward model acts as a substitute for a human labeler and assigns rewards to complete trajectories, i.e. episodes.

The reference model is a copy of the policy model which is frozen at the beginning of the training. This model is sometimes called the SFT model, where SFT stands for supervised fine tuning, as in the full training cycle used by OpenAI described nicely in this blog post, it is the result of a pretraining on a large data set plus fine tuning on a collection of prompts and gold standard replies created by a human workforce. Its outputs are used to calculate the Kullback-Leibler divergence between the current version of the policy and the reference model. All these ingredients are then put together in the loss function. From the loss function, gradients are derived and the critic as well as the actor are updated.

This completes our discussion of how PPO can be used to align language models with human feedback, an approach that has become known as reinforcement learning from human feedback (RLHF). Even though this is a long post, there are many details that we have not yet covered, but armed with this post and the references below, you should now be able to dig deeper if you want. If you are new to reinforcement learning, you might want to consult a few chapters from [6] which is considered a standard reference, even though I find the mathematics a bit sloppy at times, and then head over to [1] and [3] to learn more about PPO. If you have a background in reinforcement learning, you might find [3] as well as [4] and [5] a good read.

This post is also the last one in my series on large language models. I hope I could shed some light on the mathematics and practical methods behind this exciting field – if yes, you are of course invite to check out my blog once in a while where I will most likely continue to write about this and other topics from machine learning.

References

[1] J. Schulman et. al, Trust Region Policy Optimization, available as arXiv:1502.05477
[2] J. Schulman et al., High-Dimensional Continuous Control Using Generalized Advantage Estimation, available as arXiv:1506.02438
[3] J. Schulman et al., Proximal Policy Optimization Algorithms, available as arXiv:1707:06347
[4] D. Ziegler et al., Fine-Tuning Language Models from Human Preferences}, available as arXiv:1909.08593
[5] N. Stiennon et al., Learning to summarize from human feedback, available as arXiv:2009.01325
[6] R. S. Sutton, A.G. Barto, Reinforcement learning – an introduction, available online here

Mastering large language models – Part XVI: instruction fine-tuning and FLAN-T5

In most of our previous posts, we have discussed and used transformer networks that have been trained on a large set of data using teacher forcing. These models are good at completing a sentence with the most likely next token, but are not optimized for following instructions. Today, we will look at a specific family of networks that have been trained to tailor their response according to instructions – FLAN-T5 by Google.

If you ask a language model that has been trained on a large data set to complete a prompt that represents an instruction or a question, the result you will obtain is the most likely completion of the prompt. This, however, might not be the answer to your question – it could as well be another question or another instruction, as the model is simply trying to complete your prompt. For many applications, this is clearly not what we want.

One way around this is few-shot learning which basically means that you embed your instruction into a prompt that somehow resembles the first few lines of a document so that the answer that you expect is the most natural completion, giving the model a chance to solve your task by doing what it can do best, i.e. finding completions. This, however, is not really a satisfying approach and you might ask whether there is a way to fix the problem at training time, not at inference time. And in fact there is a method to train a model to recognize and follow instructions – instruction fine-tuning.

Most prominent models out there like GPT did undergo instruction fine tuning at some point. Today we will focus on a model series called FLAN-T5 which has been developed by Google and for which the exact details of the training process have been published in several papers (see the references). In addition, versions of the model without instruction fine tuning are available so that we can compare them with the fine-tuned versions.

In [1], instruction fine-tuning has been applied to LaMDA-PT, a decoder-only transformer model that Google had developed and trained earlier. In [2], the same method has been applied to T5, a second model series presented first in [3]. The resulting model series is known as FLAN-T5 and available on the Hugginface hub. So in this post, we will first discuss T5 and how it was trained and than explain the instruction fine tuning that turned T5 into FLAN-T5.

T5 – an encoder-decoder model

Other than most of the models we have played with so far, T5 is a full encoder-decoder model. If you have read my previous post on transformer blocks and encoder-decoder architectures, you might remember the following high-level overview of such an architecture.

For pre-training, Google did apply a method known as masking. This works as follows. Suppose your training data contains the following sentence:

The weather report predicted heavy rain for today

If we encode this using the the encoder used for training T5, we obtain the sequence of token

[37, 1969, 934, 15439, 2437, 3412, 21, 469, 1]

Note that the last token (with ID 1) is the end-of-sentence token that the tokenizer will append automatically to our input. We then select a few token in the sentence randomly and replace each of them with one of 100 special token called extra_id_NN, where NN is a number between 0 and 99. These token have IDs ranging from 32000 to 32099, where 32099 is extra token 0 and 32000 is extra token 99. After applying this procedure known as masking our sentence could look as follows

The weather report predicted heavy <extra_id_0> for today

[37, 1969, 934, 15439, 2437, 32099, 21, 469, 1]

We now train the decoder to output a list of the extra token used, followed by the masked word. So in our case, the target sequence for the decoder would be

<extra_id_0> rain

or, again including the end-of-sentence token

[32099, 3412, 1]

In other words, we ask our model to guess the word that has been replaced by the mask (this will remind you of the word2vec algorithm that I have presented in a previous post). To train the combination of encoder and decoder to reach that objective, we can again apply teacher forcing. For that purpose, we shift the labels to the right by one and append a special token called the decoder start token to the left (which, by the way, is identical to the padding token 0 in this case). So the input to the decoder is

[0, 32099, 3412]

We then can calculate the cross-entropy loss between the decoder output and the labels above and apply gradient descent as usual. I have put together a notebook for this post that walks you through an example and that you can also run on Google Colab as usual.

Pre-training using masking was the first stage in training T5. In a second stage, the model was fine-tuned on a set of downstream tasks. Examples for these tasks include translation, summarization, question answering and reasoning. For each task, the folks at Google defined a special format in which the model received the inputs and in which the output was expected. For translation, for instance, the task started with

“translate English to German: “

followed by the sentence to be translated, and the model was trained to reply with the correct translation. Another example is MNLI, which is a set of pairs of premise and hypothesis, and the model is supposed to answert with one word indicating whether a premise implies the hypothesis, is a contradiction to it or is neutral towards the hypothesis. In this case, the input is a sentence formatted as

“mnli premise: … premise goes here… hypothesis: … hypothesis goes here…

and the model is expected to answer with one of the three words “entailment”, “contradiction” or “neutral”. So all tasks are presented to the model as pure text and the outcome is expected to be pure text. My notebook contains a few examples of how this works.

From T5 to FLAN-T5

After the pre-training using masking and the fine-tuning, T5 is already a rather powerful model, but still is sometimes not able to deduce the correct task to perform. In the notebook for this post, I hit upon a very illustrative example. If we feed the sentence

Please answer the following question: what is the boiling temperature of water?

into the model, it will not reply with the actual boiling point of water. Instead, it will fall back to one of the tasks it has been trained on and will actually translate this to German instead of answering the question.

This is where instruction fine-tuning [2] comes into play. To teach the model to recognize instructions as such and to follow them, the T5 model was trained on a large number of additional tasks, phrased in natural language.

To obtain a larger number of tasks from the given dataset, the authors applied different templates to each task. If, for instance, we are given an MNLI task with a hypothesis H and a premise P, we could present this in several ways to the model, i.e. according to several templates. We could, for instance, turn the data point into natural language using the template

“Premise: P. Hypothesis: H. Does the premise entail the hypothesis?”

Or we could phrase this as

Based on the premise P, can we conclude that the hypothesis H is true?

By combining every data point from the used data set with a certain number of templates, a large dataset for the fine-tuning was obtained, and the model learned to deal with many different instructions referring to the same type of task, which hopefully would enable the model to infer the correct tasks for an unseen template at test time. As Google published the code to generate the data, you can take a look at the templates here.

Also note that the data contains some CoT (chain-of-thought) prompting, i.e. some of the instruction templates for reasoning tasks include phrases like “apply reasoning step-by-step” to prepare the model for later chain-of-thought prompting.

In the example in my notebook, the smallest FLAN-T5 model (i.e. T5 after this additional fine-tuning procedure) did at least recognize that the input is a question, but the reply (“a vapor”) is still not what we want. The large model does in fact reply with the correct answer (212 F).

Instruction fine-tuning has become the de-facto standard for large language models. The InstructGPT model from which ChatGPT and GPT-4 are derived, however, did undergo an additional phase of training – reinforcement learning from human feedback. This is a bit more complex, and as many posts on this exist which unfortunately do only scratch the surface I will dive deeper into this in the next post which will also conclude this series.

References

[1] J. Wei et al., Finetuned Language Models Are Zero-Shot Learners, arXiv:2109.01652
[2] H.W. Chung et al., Scaling Instruction-Finetuned Language Models, arXiv:2210.11416
[3] C. Raffel et al., Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, arXiv:1910.10683

Mastering large language models Part XV: building a chat-bot with DialoGPT and Streamlit

Arguably, a large part of the success of models like GPT is the fact that they have been equipped with a chat frontend and proved able to have a dialogue with a user which is perceived as comparable to a dialogue you would have with another human. Today, we will see how transformer based language models can be employed for that purpose.

In the previous post, we have seen how we can easily download and run pretrained transformer models from the Huggingface hub. If you play with my notebook on this a bit, you will soon find that they do exactly what we expect them to do – find the most likely completion. If, for instance, you use the promt “Hello”, followed by the end-of-sentence token tokenizer.eos_token, they will give you a piece of text starting with “Hello”. This could be, for instance, a question that someone has posted into a Stackexchange forum.

If you want to use a large language model to build a bot that can actually have a conversation, this is usually not what you want. Instead, you want a completion that is a reply to the prompt. There are different ways to achieve this. One obvious approach that has been taken by Microsoft for their DialoGPT model is to already train the model (either from scratch or via fine-tuning) on a dataset of dialogues. In the case of DialoGPT, the dataset has been obtained by scraping from Reddit.

A text-based chat-bot

Before discussing how to use this model to implement a chat-bot, let us quickly fix a few terms. A conversation between a user and a chat-bot is structured in units called turns. A turn is a pair of a user prompt and a response from the bot. DialoGPT has been trained on a dataset of conversations, modelled as a sequence of turns. Each turn is a sequence of token, where the user prompt and the reply of the bot are separated by an end-of-sentence (EOS) token. Thus a typical (very short) dialogue with two turns could be represented as follows

Good morning!<eos>Hi<eos>How do you feel today?<eos>I feel great<eos>

Here the user first enters the prompt “Good morning!” to which the bot replies “Hi”. In the second turn, the user enters “How do you feel today” and the bot replies “I feel great”.

Now the architecture of DialoGPT is identical to that of GPT-2, so we can use what we have learned about this class of models (decoder-only models) to sample from it. However, we want the model to take the entire conversation that took place so far into account when creating a response, so we feed the entire history, represented as a sequence of token as above, into the model when creating a bot response. Assuming that we have a function generate that accepts a prompt and generates a reply until it reaches and end-of-sentence token, the pseudocode for a chat bot would therefore look as follows.

input_ids = []
while True:
    prompt = input("User:")
    input_ids.extend(tokenizer.decode(prompt))
    input_ids.append(tokenizer.eos_token)
    response = generate(input_ids)
    print(f"Bot: {tokenize.decode(response)}"
    input_ids.extend(response)

Building a terminal-based chat-bot on top of DialoGPT is now very easy. The only point that we need to keep in mind is that in practice, we will want to cache past keys and values as discussed in the last post. As we need the values again in the next turn which will again be based on the entire history, our generate function should return these values and we will have to pass them back to the function in the next turn. An implementation of generate along those lines can be found here in my repository. To run a simple text-based chatbot, simply do (you probably want to run this in a virtual Python environment):

git clone https://github.com/christianb93/MLLM
cd MLLM
pip3 install -r requirements.txt
cd chatbot
python3 chat_terminal.py

The first time you run this, the DialoGPT model will be downloaded from the hub, so this might take a few minutes. To select a different model, you can use the switch --model (use --show_models to see a list of valid values).

Using Streamlit to build a web-based chat-bot

Let us now try to build the same thing with a web-based interface. To do this, we will use Streamlit, which is a platform designed to easily create data-centered web applications in Python.

The idea behind Streamlit is rather simple. When building a web-based applicaton with Streamlit, you put together an ordinary Python script. This script contains calls into the Streamlit library that you use to define widgets. When you run the Streamlit server, it will load the Python script, execute it and render an HTML page accordingly. To see this in action, create a file called test.py in your working directory containing the following code.

import streamlit as st 


value = st.slider(label = "Entropy coefficient", 
          min_value = 0.0, 
          max_value = 1.0)
print(f"Slider value: {value}")

Then install streamlit and run the applications using

pip3 install streamlit==1.23.1 
streamlit run test.py

Now a browser window should open that displays a slider, allowing you to select a parameter between 0 and 1.

In the terminal session in which you have started the Streamlit server, you should also see a line of output telling you that the value of the slider is zero. What has happened is that Streamlit has started an HTTP server listening on port 8501, processed our script, rendered the slider that we have requested, returned its current value and then entered an internal event loop outside of the script that we have provided.

If now change the position of the slider, you will see a second line printed to the terminal containing the new value of the slider. In fact, if a user interacts with any of the widgets on the screen, Streamlit will simply run the script once more, this time returning the updated value of the slider. So behind the scenes, the server is sitting in an event loop, and whenever an event occurs, it will not – like frontend frameworks a la React will do it – use callbacks to allow the application to update specific widgets, but instead run the entire script again. I recommend to have a short look at the section “Data flow” in the Streamlit documentation which explains this in a bit more detail.

This is straightforward and good enough for many use cases, but for a chat bot, this creates a few challenges. First, we will have to create our model at some point in the script. For larger models, loading the model from disk into memory will take some time. If we do this again every time a user interacts with the frontend, our app will be extremely sluggish.

To avoid this, Streamlit supports caching of function values by annotating a function with the decorator st.cache_resource. Streamlit will then check the arguments of the function against the cached values. If the arguments match a previous call, it will directly return a reference to the cached response without actually invoking the function. Note that here a reference is returned, meaning in our case that all sessions share the same model.

The next challenge that we have is that our application requires state, for instance the chat history or the cached past keys and values, which we have to preserve for the entire session. To maintain session-state, Streamlit offers a dictionary-like object st.session_state. When we run the script for the first time, we can initialize the chat history and other state components by simply adding a respective key to the session state.

st.session_state['input_ids'] = []

For subsequent script iterations, i.e. re-renderings of the screen, that are part of the same session, the Streamlit framework will then make sure that we can access the previously stored value. Internally, Streamlit is also using the session state to store the state of widgets, like the position of a slider or the context of a text input field. When creating such a widget, we can provide the extra parameter key which allows us to specify the key under which the widget state is stored in the session state.

A third feature of Streamlit that turns out to be useful is to use conventional callback functions associated with a widget. This makes it much easier to trigger specific actions when a specific widget changes state, for instance if the user submits a prompt, instead of re-running all actions in the script every time the screen needs to be rendered. We can, for instance, define a widget that will hold the next prompt and, if the user hits “Return” to submit the prompt, invoke the model within the callback. Inside the callback function, we can also update the history stored in the session state.

def process_prompt():
    #
    # Get widget state to access current prompt
    #
    prompt = st.session_state.prompt
    #
    # Update input_ids in session state
    #
    st.session_state.input_ids.extend(tokenizer.encode(prompt))
    st.session_state.input_ids.append(tokenizer.eos_token_id)
    #
    # Invoke model
    #
    generated, past_key_values, _ = utils.generate(model = model, 
                                            tokenizer = tokenizer, 
                                            input_ids = st.session_state.input_ids, 
                                            past_key_values = st.session_state.past_key_values, 
                                            temperature = st.session_state.temperature, debug = False)
    response = tokenizer.decode(generated).replace('<|endoftext|>', '')
    #
    # Prepare next turn
    #
    st.session_state.input_ids.extend(generated)

Armed with this understanding of Streamlit, it is not difficult to put together a simple web-based chatbot using DialoGPT. Here is a screenshot of the bot in action.

You can find the code behind this here. To run this, open a terminal and type (switching to a virtual Python environment if required):

git clone https://github.com/christianb93/MLLM
cd MLLM
pip3 install -r requirements.txt
cd chatbot
streamlit run chat.py

A word of caution: by default, this will expose port 8501 on which Streamlit is listening on the local network (not only on localhost) – it is probably possible to change this somewhere in the Streamlit configuration, but I have not tried this.

Also remember that the first execution will include the model download that happens in the background (if you have not used that model before) and might therefore take some time. Enjoy your chats!

Mastering large language models – Part XIV: Huggingface transformers

In the previous post, we have completed our journey to being able to code and train a transformer-based language model from scratch. However, we have also seen that in order to obtain professional results, we need an enormous amount of resources, i.e. training data and compute power. Luckily, there are some pre-trained transformer models that we can use, be it directly for inference or as a basis for further fine-tuning. Today, we will take a look at the probably most prominent platform that exists for that purpose – the Huggingface platform.

Huggingface is a company that offers a variety of services around machine learning. In our context, we will mainly be interested in the wo of their products – the transformers library and the Huggingface hub, which is a platform to which scientists and enthusiasts can upload weights and trained models to make them available for everyone using the Huggingface transformer library.

As a starting point, let us first make sure that we have the library installed. For this post (and the entire series) I have used version 4.27.4, which you can install as usual.

pip3 install transformers==4.27.4

The recommended starting point for working with the library is a pipeline, which is essentially a container that includes everything which we typically need, most notably a tokenizer and a model. As the intention of this post, however, is to make contact with the models which we have implemented so far, we will dig a bit deeper and start directly with a model instead.

In the transformers library, pretrained models are usually downloaded from the hub and identified by name, which – by convention – is the name of the provider followed by a slash and the name of the model. Here is a short code snippet that you can execute in a notebook or in an interactive Python session which will download the weights for the 1.3 billion parameter version of the GPT-Neo model. GPT-Neo is a series of models developed and trained by EleutherAI and made available as open source, including the weights.

import transformers
import torch
model_name="EleutherAI/gpt-neo-1.3B"
model = transformers.AutoModelForCausalLM.from_pretrained(model_name)
print(model)
print(isinstance(model, torch.nn.Module))

Executing this for the first time will trigger the download of the model to ~/.cache/huggingface/hub. The entire download is roughly 5 GB MB, so this might take a second. If you want to avoid the large download, you can also use the much smaller 125m version – just change the last part of the model name accordingly. From the output, we can see that this model is a torch.nn.Module, so we can use it as any other PyTorch model.

We can also see that the model actually consists of two components, namely an instance of the class GPTNeoModel defined here and a linear layer – the so-called head – that is sitting on top of the actual transformer and, as usual, transforms between the model inner dimension which is 2048 and the one-hot encoded vocabulary. This decomposition of the full model into a core model and a task-specific head is common to most Huggingface models, as it allows us to use the same set of weights for different tasks.

Even though the model does not use the transformer blocks that are part of PyTorch, it is a decoder-only transformer model as we would expect it – there are two learned embeddings for positions and words and 24 transformer blocks consisting of self-attention and linear layers. There is a minor difference, though – every second layer in the GPT-Neo model uses local self-attention, which basically means that to calculate the attention at a given position, only keys and values within a sliding window are used, not all keys and values in the context – see the comment here.

Apart from this, our model is very close to the models that we have implemented and trained previously. We know how to sample from such a model. But before we can do this, we need to turn our attention to the second part of a pipeline, namely the tokenizer.

Obviously, a pretrained transformer model will only produce reasonable results if we use the same vocabulary and the same tokenizer for inference that we have used for training. Therefore, the Huggingface hub contains a matching tokenizer for every pretrained model. This tokenizer can be downloaded and initialized similar to what we have done for the model.

tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

Our tokenizer will contain everything that we need to convert forth and back between text and sequences of token. Specifically, the main components of a tokenizer are as follows.

First, there is method encode which accepts a text and returns a sequence of token IDs, and a method decode which conversely translates a sequence of token IDs into a text.

To convert between the string representation of a token and the ID, there are the methods convert_ids_to_tokens and convert_tokens_to_ids. In addition, the tokenizer obviously contains a copy of the vocabulary and some special token like a token that marks the end of a sentence (EOS token), the beginning of a sentence (BOS token), unknow token and so forth – this should in general match the token defined in the model config model.config

In our case, the tokenizer is an instance of the GPT2 tokenizer which in turn is derived from the PretrainedTokenizer class.

Having a tokenizer at our disposal, we can now sample from the model in exactly the same way as from our own models. There are two little details that we need to keep in mind. First, in the Huggingface world, the batch dimension is the first dimension. Second, the forward method of our model returns a dictionary in which the logits are stored under a key with the same name. Also, the context size is part of the model config and called max_position_embeddings. Thus our sampling method looks as follows, assuming that we have a method do_p_sampling that draws from a distribution p.

def predict(model, prompt, length, tokenizer, temperature = 0.7,  p_val = 0.95):
    model.eval()
    with torch.no_grad():
        sample = []
        device = next(model.parameters()).device
        #
        # Turn prompt into sequence of token IDs
        # 
        encoded_prompt  = tokenizer.encode(prompt)        
        encoded_sample = encoded_prompt
        encoded_prompt = torch.tensor(encoded_prompt, dtype = torch.long).unsqueeze(dim = 0)
        with torch.no_grad():
            out = model(encoded_prompt.to(device)).logits # shape B x L x V
            while (len(encoded_sample) < length):
                #
                # Sample next character from last output. Note that we need to remove the
                # batch dimension to obtain shape (L, V) and take the last element only
                #
                p = torch.nn.functional.softmax(out[0, -1, :] / temperature, dim = -1)
                #
                # Sample new index and append to encoded sample
                #
                encoded_sample.append(do_p_sampling(p, p_val))
                #
                # Feed new sequence
                #
                input = torch.tensor(encoded_sample[-model.config.max_position_embeddings:], dtype=torch.long)
                input = torch.unsqueeze(input, dim = 0)
                out = model(input.to(device)).logits
                print(tokenizer.decode(encoded_sample))

        return tokenizer.decode(encoded_sample)

This works, but it is awfully slow, even on a GPU. There is a reason for that, which is similar to what we have observed when sampling from an LSTM.

Suppose that our prompt has length L, and that we have already sampled the first token after the prompt, so that our full sample now has length L + 1. We now want to sample the next token. For that purpose, we only need the logits at position L + 1, so it might be tempting to feed only token L + 1 into our model. Let us quickly recall why this does not work.

To derive the output at position L + 1, each attention block will run the attention value at position L + 1 (which is a tensor of shape B x V) through the feed-forward network. If q denotes the query at position L + 1, this attention value is given by

$\frac{1}{\sqrt{d}} \sum_{i=0}^{L} (q \cdot k_i) v_i$

So we see that we need all keys and values to calculate the attention – which is not surprising, as this is the way how the model takes the context into account – but the query is only needed for position L + 1 (which is index 0, as we use zero-based indexing). Put differently, the only reason why we need to feed the previously generated token again and again into the model is that we need to key and value pairs from the past.

This hints at an opportunity to speed up inference – we could try to cache these values. To realize this, the Huggingface model returns the keys and values of all layers as an item past_key_values in the output and allows us to provide this as an additional argument to the next call. Thus instead of passing the full tensor of input IDs of length L + 1, we can as well pass only the last value, i.e. the ID of the currently last token, plus in addition the key value pairs from the previous call. Here is the updated sampling function.

def predict(model, prompt, length, tokenizer, temperature = 0.7,  p_val = 0.95):
    model.eval()
    with torch.no_grad():
        sample = []
        device = next(model.parameters()).device
        #
        # Turn prompt into sequence of token IDs
        # 
        encoded_prompt  = tokenizer.encode(prompt)        
        encoded_sample = encoded_prompt
        input_ids = torch.tensor(encoded_prompt, dtype = torch.long).unsqueeze(dim = 0)
        with torch.no_grad():
            #
            # First forward pass- use full prompt
            #
            out = model(input_ids = input_ids.to(device))
            logits = out.logits[:, -1, :] # shape B x V
            past_key_values = out.past_key_values
            while (len(encoded_sample) < length):
                #
                # Sample next character from last output
                #
                p = torch.nn.functional.softmax(logits[0, :] / temperature, dim = -1)
                #
                # Sample new index and append to encoded sample
                #
                idx = do_p_sampling(p, p_val)
                encoded_sample.append(idx)
                #
                # Feed new token and old keys and values
                #
                input_ids = torch.tensor(idx).unsqueeze(dim = 0)
                out = model(input_ids = input_ids.to(device), past_key_values = past_key_values)
                logits = out.logits
                past_key_values = out.past_key_values
                print(tokenizer.decode(encoded_sample))

        return tokenizer.decode(encoded_sample)

This should be significantly faster and provide a decent performance even on a CPU – when running this, you will also realize that the first pass that feeds the entire prompt takes some time, but all subsequent passes are much faster.

That closes our post for today. We have bridged the gap between the models that we have implemented and trained so far and the – much more advanced – models on the Huggingface hub and have convinced ourselves that even though these models are of course much larger and more powerful, their architecture is the same as for our models, and we can use them in the same way. I encourage you to play with a few other models which can be sampled from in exactly the same way, like the Pythia models from EleutherAI or the gpt2-medium and gpt2-large versions of the GPT model that can all be found on the Huggingface hub – you can use the code snippets from this post or my notebook as a starting point.

In the next post, we will use the popular Streamlit platform to implement a simple ChatBot backed by a Huggingface model.