Mastering large language models – Part XI: encoding positions

In our last post, we have seen that the attention mechanism is invariant to position, i.e. that reordering of the words in a sentence yields the same output, which implies that position information is lost. Clearly position information does matter in NLP, and therefore transformer based models apply positional embeddings in addition to the pure word embeddings.

Before explaining positional embeddings, let us first recall word embeddings in general. Suppose we are working with sentences consisting of L token taken out of a fixed vocabular. Then creating vectors out of these token amounts to map each token

token_1, token_2, token_3, \dots, token_L

to a vector vi. This encoding, however, is usually position independent, meaning that if the same token appears at two different positions in the sentence, the resulting vector will be the same for both instances of the token. Technically, the embedding can be described by a matrix of dimension (D, V), where D is the model dimension and V is the size of the vocabulary, and each column in this matrix represents the embedding for one token.

In the original transformer paper [1], these word embeddings are augmented by an additional positional embedding. Similar to word embeddings, this again attaches a vector of dimension D to every token in the input. However, this embedding now solely depends on the position of the token, and not on the actual token itself, so that different token at the same position yield the same embedding. This additional embedding is then added to the word embedding to obtain one vector per token which is then fed into the actual model.

But what is the vector that we want to add to the word embedding at position i? Obviously, there are many choices (reference [3] is a good survey of existing approaches). The approach taken in the original transformer is to use fixed values which are not learned and not modified during training and are given by the equations

\textrm{PE}(p, i) = \begin{cases}     \sin(C^{i / D} p) & i \, \textrm{even} \\     \cos(C^{(i-1) / D} p) & i \, \textrm{odd} \end{cases}

where C is a fixed constant (actually 1 / 10000), p is the position of the token in the sentence and i is the vector index, so that PEp or PE(p, : ) in Python notation is the vector added to the word embedding of a token at position p. Here we assume that the model dimension D is an even number.

The motivation provided for this specific choice in [1] is that for a fixed offset k, the positional encoding vectors PEp+k and PEp are related by a linear transformation, so that, as the authors phrase it in section 3.5 of [1], “it would allow the model to easily learn to attend by
relative positions
“. Let us quickly verify this. For any i, component 2i of the vector PEp+k is given by

\begin{aligned}PE_{p+k, 2i} &= \sin(C^{2i / D} (p+k)) \\ &= \sin(C^{2i/D}p) \cos(C^{2i/D}k) + \cos(C^{2i/D}p) \sin(C^{2i/D}k) \\ &= \cos(C^{2i/D}k) PE_{p,2i} + \sin(C^{2i/D}k) PE_{p,2i+1} \end{aligned}

and similarly, the component 2i+1 is given by

\begin{aligned} PE_{p+k,2i+1} &= \cos(C^{2i / D} (p+k)) \\ &= \cos(C^{2i / D}p)\cos(C^{2i / D}k) -  \sin(C^{2i / D}p)\sin(C^{2i / D}k) \\ &= \cos(C^{2i / D}k)PE_{p,2i+1} - \sin(C^{2i / D}k)PE_{p,2i} \end{aligned}

proving our claim. Note that these two relations can nicely be organized into one equation as

\begin{pmatrix} PE_{p+k, 2i} \\ PE_{p+k, 2i+1}  \end{pmatrix} = \begin{pmatrix} \cos(C^{2i / D}k) & \sin(C^{2i / D}k) \\ - \sin(C^{2i / D}k) & \cos(C^{2i / D}k) \end{pmatrix} \begin{pmatrix} PE_{p, 2i} \\ PE_{p, 2i+1}  \end{pmatrix}

showing that the two vectors are related by a rotation (we will get back to this point later when we talk about relative embeddings).

Let us quickly see how we could code this in Python. As we need one embedding vector of dimension D for each position, the embeddings form a matrix of shape (L, D), where each row represents one position. After applying the word embeddings to an input sequence of length L, this will as well result in a matrix of shape (L, D), and we can simply add these two matrices together. Of course, we would not calculate the sine and cosine values above again for every input, but store them in a global variable, an attribute or (better) in a PyTorch buffer. Here is a piece of code to calculate the embeddings, taken from the corresponding section in d2l.ai (note that this only works if D is even, which, however, is usually the case – in fact, dimensions are often multiples of 8, 16 or 32 to better support the inner workings of modern GPUs).

#
# Values for p and 2*i / D
#
_p = torch.arange(L , dtype=torch.float32).unsqueeze(dim = 1)
_i = torch.arange(0, D , step = 2) / D
#
# x = C**(2i/D)*p
#
x = torch.pow(C, _i) * _p
#
# Apply sine and cosine and organize into a matrix pe
#
pe = torch.zeros(L, D)
pe[:, 0::2] = torch.sin(x)
pe[:, 1::2] = torch.cos(x)

For more on this and a few visualisations, you might want to take a look at the notebook for this post.

Note that at this point, we need to make an assumption on the maximum length of the input sequence that we will be using, called the context size. The context size is also an important parameter for training and inference, as the attention matrices have shape (L, L), so that a large context size increases memory usage and computational complexity.

As already mentioned, this way to construct embeddings is not the only one. A second method that has gained some popularity and is for instance applied in GPT-models ([4]) is to use learned embeddings. With this approach, the positional embeddings are treated similarly to the word embeddings, i.e. they are simply parameters that are initialized with some random values and then updated during backward passes as any other parameter of the network.

Both the sinusoidal embeddings used in the original transformer paper as well as the learned embeddings used by GPT are absolute embeddings, i.e. the value of an embedding depends on the absolute position of a token in a sentence. In [2], a different approach is taken which is known as relative embeddings and relies on a modification of the self-attention mechanism to incorporate positional information rather than on absolute positional embeddings added to the word embeddings before processing them in the attention layer.

More specifically, when calculating attention scores and before applying the softmax, the usual dot product between query at position i and key at position j is modified by adding a bias vector w to the key that depends on the relative position, i.e. the difference j – i.

q \left( \cdot k^t + w_{j-i}\right)

These bias vectors are then learned, similar to the word embeddings. This modification is applied not only in the first layer, but in all layers. In addition, to avoid having to learn 2*L embeddings for each layer and each attention head, the authors clip the relativ position, i.e. instead of using the index j – i which can range from -L to L, the use the modified index

\max(-k, \min(k, j-i))

so that the relative position is clipped to values between -k and k, where k is a hyperparameter. For an efficient implementation, see [5].

Many variations of this approach are possible. An interesting example is the T5 model described in [6] which we will look at in more detail in a later post. In this model, the position dependent vector added to the key before taking the dot product with the query is replaced by a scalar added after taking the dot product, so that the dot product is replaced by

q \cdot k^t + a_{j-i}

where aj-i is now a scalar. These scalars are again learned parameters, shared between different layers but unique for each head. Instead of a clipping, buckets are formed, each corresponding to a range of different values of j – i. so that the number of parameters that need to be learned is simply the number of buckets times the number of heads. It is instructive to take a look at the T5 implementation of the Huggingface Transformer library to see how this works in practice – the notebook for this post contains a simplified and commented version of this code.

Before moving on, let us quickly discuss another type of relative positional embeddings known as rotary embeddings, originally proposed in [7]. In contrast to the approaches that we have seen so far, this method does not modify the self-attention score by adding a vector or scalar, but by rotating queries and keys before forming the dot product. Specifically, the authors consider a set of orthogonal matrices

R_{\Theta, i}

depending on a fixed parameter Θ and the position i and replace the usual attention dot product between query and key by

q_i R_{\Theta, i} (k_j R_{\Theta, j})^t = q_i R_{\Theta, i} R_{\Theta, j}^t k_j^t

At the first glance, this does not seem to be a relative embedding, as it seems to depend on both positions – the position i of the query and the position j of the key. However, the authors choose a set of matrices that mathematically speaking form something resembling a one-parameter subgroup, i.e. having the property that

R_{\Theta, i} R_{\Theta, j}^t = R_{\Theta, i - j}

so that the attention score is actually

q_i R_{\Theta, i - j}  k_j^t

which does in fact only depend on the difference i – j, i.e. on the relative position of query and key. Rotary embeddings have also been applied in the GPT-NeoX model by EleutherAI.

We now have a broad selection of different embedding methods to our disposal and could go ahead and apply them to train a transformer-based decoder-only model. However, there is still some challenge that we have not touched upon – the word encodings. So far, our word encodings were straightforward – we did either chose to turn each individual character into a token or used word-based encodings where our vocabulary consists of all words found in the training data. Unfortunately, both approaches have certain drawbacks. Character-based encodings imply that the model also has to learn how to assemble words from characters and makes learning long-range relations more difficult. Word-based encodings suffer from the problem that the number of unique words in the training data can become huge, which will blow up our model size.

For that reason, many models use a third approach known as subword tokenization, where token are built from parts of words instead of individual characters or entire words. In the next post, we will therefore study the tokenization mechanism used by the original transformer – byte-pair encoding (BPE).

References:

[1] A. Vaswani et al., Attention is all you need
[2] P. Shaw et al., Self-Attention with Relative Position Representations
[3] P. Dufter, M. Schmitt, H. Schütze; Position Information in Transformers: An Overview. Computational Linguistics 2022; 48 (3): 733–763
[4] A. Radford et al., Improving Language Understanding by Generative Pre-Training
[5] C. Huang et al, Music Transformer: generating music with long-term structure
[6] C. Raffel et al., Exploring the Limits of Transfer Learning with a Unified
Text-to-Text Transformer

[7] J. Su et al., RoFormer: Enhanced Transformer with Rotary Position Embedding

Leave a Comment