Mastering large language models – Part IX: self-attention with PyTorch

In the previous post, we have discussed how attention can be applied to avoid bottlenecks in encoder-decoder architectures. In transformer-based models, attention appears in different flavours, the most important being what is called self-attention – the topic of todays post. Code included.

Before getting into coding, let us first describe the attention mechanism presented in the previous blog in a bit more general setting and, along the way, introduce some terminology. Suppose that a neural network has access to a set of vectors vi that we call the values. The network is aiming to process the information contained in the vi in a certain context and, for that purpose, needs to condense the information specific to that context into a single vector. Suppose further that each vector vi is associated with a vector ki called the key, that somehow captures which information the vector vi contains. Finally assume that we are able to form a query vector q that specificies what information the network needs. The attention mechanism will then assemble a linear combination

\textrm{attn}(q, \{k_i\}, \{v_i\}) = \sum_i \alpha(q, k_i) v_i

called the attention vector.

To make this more tangible, let us look at an example. Suppose an encoder processes a sequence of words, represented by vectors xi. While encoding a part of the sentence, i.e. a specific vector, the network might want to pull in some data from other parts of the sentence. Here, the query would be the currently processed word, while keys and values would come from other parts of the same sentence. As keys, queries and values all come from the same sequence, this form of attention is called self-attention. Suppose, for example, that we are looking at the sentence (once again taken from “War and peace”)

The prince answered nothing, but she looked at him significantly, awaiting a reply

When the model is encoding this sentence, it might help to pull in the representation of the word “prince” when encoding “him”, as here, “him” refers to “the prince”. So while processing the token for “him”, the network might put together an attention vector focusing mostly on the token for “him” itself, but also on the token for “prince”.

In this example, we would compute keys and values from the input sequence, and the query would be computed from the word we are currently encoding, i.e. “him”. A weight is then calculated for each combination of key and query, and these weights are used to form the attention vector, as indicated in the diagram above (we will see in a minute how exactly this works).

Of course, this only helps if the weights that we use in that linear combination are somehow helping the network to focus on those values that are most relevant for the given key. At the same time, we want the weights to be non-negative and to sum up to one (see this paper for a more in-depth discussion of these properties of the attention weights which are a bit less obvious). The usual approach to realize this is to first define a scoring function

\textrm{score}(q, k_i)

and then obtain the attention weights as a softmax function applied to these scores, i.e. as

\alpha(q, k_i) = \frac{\exp(\textrm{score}(q, k_i))}{\sum_j \exp(\textrm{score}(q, k_j))}

For the scoring function itself, there are several options (see Effective Approaches to Attention-based Neural Machine Translation by Luong et al. for a comparison of some of them). Transformers typically use a form of attention called scaled dot product attention in which the scores are computed as

\textrm{score}(q, k) = \frac{q k^t}{\sqrt{d}}

i.e. as the dot product of query and key divided by the square root of the dimension d of the space in which queries and keys live.

We have not yet explained how keys, values and queries are actually assembled. To do this and to get ready to show some code, let us again focus on self attention, i.e. queries, keys and values all come from the same sequence of vectors xi which, for simplicity, we combine into a single matrix X with shape (L, D), where L is the length of the input and D is the dimensionality of the model. Now the queries, keys and values are simply derived from the input X by applying a linear transformation, i.e. by a matrix multiplication, using a learnable set of matrices weights WQ (for the queries), WV (for the values) and WK (for the keys).

Q = X W^Q
K = X W^K
V = X W^V

Note that the individual value, key and query vectors are now the rows of the matrices V, K and Q. We can therefore conveniently calculate all the scaled dot products in one step by simply doing the matrix multiplication

\frac{Q \cdot K^T}{\sqrt{d}}

This gives us a matrix of dimensions (L, L) containing the scores. To calculate the attention vectors, we now still have to apply the softmax to this and multiply by the matrix V. So our final formula for the attention vectors (again obtained as rows of a matrix of shape (L, D)) is

\textrm{attn}(Q, K, V) = \textrm{softmax}\left(  \frac{Q \cdot K^T}{\sqrt{d}} \right) \cdot V

In PyTorch, this is actually rather easy to implement. Here is a piece of code that initializes weight matrices for keys, values and queries randomly and defines a forward function for a self-attention layer.

wq = torch.nn.Parameter(torch.randn(D, D))
wk = torch.nn.Parameter(torch.randn(D, D))
wv = torch.nn.Parameter(torch.randn(D, D))
# Receive input of shape L x D
def forward(X):
    Q = torch.matmul(X, wq)
    K = torch.matmul(X, wk)
    V = torch.matmul(X, wv)
    out = torch.matmul(Q, K.t()) / math.sqrt(float(D))
    out = torch.softmax(out, dim = 1)
    out = torch.matmul(out, V)
    return out

However, this is not yet quite the form of attention that is typically used in transformers – there is still a bit more to it. In fact, what we have seen so far is what is usually called an attention head – a single combination of keys, values and queries used to produce an attention vector. In real-world transformers, one typically uses several attention heads to allow the model to look for different patterns in the input. Going back to our example, there could be one attention head that learns how to model relations between a pronoun and the noun to which it refers. Other heads might focus on different syntactic or semantic aspects, like linking a verb to an object or a subject, or an adjective to the noun described by it. Let us now look at this multi-head attention.

The calculation of a multihead attention vector proceeds in three steps. First, we go through each of the heads which each has its own set of weight matrices (WQi, WKi, WVi) as before, and apply the ordinary attention mechanism that we have just seen to obtain a vector headi as attention vector for this head. Note that the dimension of this vector is typically not the model dimension D, but a head dimension dhead, so that the weight matrices now have shape (D, dhead). It is common, though not absolutely necessary, to choose the head dimension as the model dimension divided by the number of heads.

Next, we concatenate the output vectors of each head to form a vector of dimension nheads * dhead, where nheads is of course the number of heads. Finally, we now apply a linear transformation to this vector to obtain a vector of dimension D.

Here is a piece of Python code that implements multi-head attention as a PyTorch module. Note that here, we actually allow for two different head dimensions – the dimension of the value vector and the dimension of the key and query vectors.

class MultiHeadSelfAttention(torch.nn.Module):
    def __init__(self, D, kdim = None, vdim = None, heads = 1):
        self._D = D
        self._heads = heads
        self._kdim = kdim if kdim is not None else D // heads
        self._vdim = vdim if vdim is not None else D // heads
        for h in range(self._heads):
            wq_name = f"_wq_h{h}"
            wk_name = f"_wk_h{h}"
            wv_name = f"_wv_h{h}"
            wq = torch.randn(self._D, self._kdim)
            wk = torch.randn(self._D, self._kdim)
            wv = torch.randn(self._D, self._vdim)
            setattr(self, wq_name, torch.nn.Parameter(wq))
            setattr(self, wk_name, torch.nn.Parameter(wk))
            setattr(self, wv_name, torch.nn.Parameter(wv))
        wo = torch.randn(self._heads*self._vdim, self._D)
        self._wo = torch.nn.Parameter(wo)
    def forward(self, X):
        for h in range(self._heads):
            wq_name = f"_wq_h{h}"
            wk_name = f"_wk_h{h}"
            wv_name = f"_wv_h{h}"
            Q = X@getattr(self, wq_name)
            K = X@getattr(self, wk_name)
            V = X@getattr(self, wv_name)
            head = Q@K.t() / math.sqrt(float(self._kdim))
            head = torch.softmax(head, dim = -1)
            head = head@V
            if 0 == h:
                out = head
                out =[out, head], dim = 1)
        return out@self._wo       

We will see in a minute that this implementation works, but the actual implementation in PyTorch is a bit different. To see why, let us count parameters. For each head, we have three weight matrices of dimensionality (D, dhead), giving us in total

3 x D x dhead x nheads

parameters. If the head dimension times the number of heads is equal to the model dimension, this implies that the number of parameters is in fact 3 x D x D and therefore the same as if we head a single attention head with dimension D. Thus, we can organize all weights into a single matrix of dimension (D, D), and this is what PyTorch does, which makes the calculation much more efficient (and is also better prepared to process batched input, which our simplified code is not able to do).

It is instructive to take a look at the implementation of multi-head attention in PyTorch to see what happens under the hood. Essentially, PyTorch reshuffles the weights a bit to treat the head as an additional batch dimension. If you want to learn more, you might want to take a look at this notebook in which I go through the code and also demonstrate how we can recover the weights of the individual heads from the parameters in a PyTorch attention layer.

If you actually do this, you might stumble upon an additional feature that we have ignored so far – the attention mask. Essentially, the attention mask allows you to forbid the model to look at certain parts of the input when processing other parts of the input. To see why this is needed, let us assume that we want to use attention in the input of a decoder. When we train the decoder with the usual teacher forcing method, we provide the full sentence in the target language to the model. However, we of course need to prevent the model from simply peaking ahead by looking at the next word, which is the target label for the currently processed word, otherwise training is trivially successful but the network has not learned anything useful.

In an RNN, this is prevented by the architecture of the network, as in each time step, we only use the hidden state assembled from the previous time steps, plus the current input, so that network does not even have access to future parts of the sentence. In an attention-based model, however, the entire input is usually processed in parallel, so that the model could actually look at later words in the sentence. To solve this, we need to mask all words starting at position i + 1 when building the attention vector for word i, so that the model can only attend to words at positions less than or equal to i.

Technically, this is done by providing an additional matrix of dimension (L, L) as input to the self attention mechanism, called the attention mask. Let us denote this matrix by M. When the attention weights are computed, PyTorch does then not simply take the matrix product

Q \cdot K^T

but does in fact add the matrix M before applying the softmax, i.e. the softmax is applied to (a scaled version of)

Q \cdot K^T + M

To prevent the model from attending to future words, we can now use a matrix M which is zero on the diagonal and below, but minus infinity above the diagonal, i.e. for the case L = 4:

M = \begin{pmatrix} 0 & -\infty & -\infty & -\infty \\ 0 & 0 & -\infty & -\infty \\ 0 & 0 & 0 & -\infty \\ 0 & 0 & 0 & 0  \end{pmatrix}

As minus infinity plus any other floating point number is again minus infinity and the exponential in the softmax turns minus infinity into zero, this implies that the final weight matrix after applying the softmax is zero above the diagonal. This is exactly what we want, as it implies that the attention weights have the property

\alpha_{ij} = 0 \, \textrm{for } i < j

In other words, the attention vector for a word at position i is only assembled from this word itself and those to the left of it in the sentence. This type of attention mask is often called a causal attention mask and, in PyTorch, can easily be generated with the following code.

mask = torch.ones(L, L)
mask = torch.tril(mask*(-1)*float('inf'), diagonal = -1)
mask = mask.t()

This closes our post for today. We now understand attention, which is one of the major ingredients to a transformer model. In the next post, we will look at transformer blocks and explain how they are combined into encoder-decoder architectures.

Mastering large language models – Part VIII: encoder-decoder architectures and attention

The examples for LSTMs and RNNs that we have studied so far have one feature in common – the input and the output have the same length. We have seen that this is a natural choice for tasks like creating sentences based on a given prompt or attaching labels to each word in a sentence. There are, however, tasks for which this simple architecture cannot be used as we need to create a sentence based on an input of different length. To address this class of problems, encoder-decoder architectures have been developed which we will discuss today.

A good example to illustrate the challenge is machine translation. Here, we are given a source sequence, i.e. a sequence of token (x1, …, xS), maybe already vectorized, as input. The objective is to create a target sequence (y1, …, yT) as output representing the translation of the source sentence into a target language. Typically, the length T of the target sequence differs from the length S of the source sequence, which a single LSTM cannot easily cover.

Put differently, the objective is now no longer to model conditional probabilities for the completion of a sentence, i.e. probabilities of the form

P(w | w_1, \dots, w_n)

but conditional probabilities for a target sequence given the source sequence, i.e. the probability distribution

P(y_1, \dots, y_T | x_1, \dots, x_S)

The idea behind encoder-decoder architectures is to split this task into two subtasks, each of which is taken over by a dedicated network. First, there is a network called the encoder, that translates the source sequence into a single vector (or a concatentation of multiple vectors) called the context. The idea is that this vector represents, in a form independent of the length of the source sequence, an internal representation of the source sequence that somehow captures its meaning. Then a second network, the decoder, takes over. This network has access to the context and, given the context, creates the target sequence, as indicated in the diagram below..

In this general form, the idea is not bound to a specific type of network. Often, encoder and decoder are both LSTMs, but we will see in a later post that also transformer networks can be used in this way.

Let us now be a bit more specific on how we can bring this idea to live with LSTMs serving as decoders and encoders and specifically how these models are trained. For the encoder, this is easy – as we want the model to compress the meaning of the source sentence in the context, we of course need to feed the source sentence into the encoder. For the decoder, it is common to again apply teacher forcing. However, to create the first word of the target sentence, we need a starting point for the model. Therefore, the encoded sentences typically contain a “beginning of sentence” token, and it is also common practice to conclude the source sentence with a “end of sentence token”.

During inference, the model receives the source sequence as input and the first item of the target sequence, i.e. the beginning-of-sentence token. We can then apply any of the sampling methods discussed in an earlier post to successively generate the next word, until the model emits an end-of-sentence marker and the translation is complete.

Unfortunately, there is a fundamental problem with this architecture. If you look at the diagram, you will see that the context is the only connection between the encoder and the decoder. Consequently, the network has to compress all information necessary for the generation of the target sequence into the context, which is typically a vector of fixed size. Especially for longer source sentences, this quickly creates a bottleneck.

In 2015, Bahdanau, Cho and Bengio proposed a mechanism called attention to solve this issue. In their paper, they describe an approach which, instead of using a fixed-length context vector shared by all time steps of the decoder, uses a time-dependent context vector.

Specifically, each time step of the decoder receives three different inputs: the decoder input xt (which, as in the diagram above, is the previous word of the target sentence), the previous hidden state ht-1 and, in addition, a step-specific context ct. This context is assembled from the outputs os of the encoder network (these are in fact not exactly the hidden states, as the architecture proposed here uses a so-called bidirectional RNN as encoder, but let us ignore this for the time being). More precisely, each context vector ct is a weighted linear combination

c_t = \sum_s \alpha_{ts} o_s

of all outputs of the encoder network at all time steps. Thus, each decoder step has access to the full output of the encoder. However, and this is the crucial point, the way how the weights are calculated is governed by learned parameters which the network can adapt during training. This allows the decoder to focus on specific parts of the input sequence, depending on the current time step, i.e. to put more attention on specific time steps, lending the mechanism its name. Or, as the original paper puts it nicely in section 3.1:

The decoder decides parts of the source sentence to pay attention to. By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixed length vector.

To illustrate this, the authors include some examples for the weights used by the network. Let us look at the first of them, which illustrates the weights while translating the english sentence “The agreement on the European Economic Area was signed in 1992.”

We see that, as envisioned, the network learns to focus on the word that is currently being translated, even though the order of words in target and source sentence are different.

Attention turned out to be a very powerful idea. In fact, attention is so useful that it gave rise to a new class of language models which works without the need to have a hidden state which is updated sequentially over time. This new generation of networks is called transformers, and we will start to look at transformers and at attention scores in more detail in the next post.


[1] Sequence to Sequence learning with Neural Networks, Sutskever et al.
[2] Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation, Cho et al.
[3] Speech and Language Processing by Jurafsky and Martin, specifically chapter 9
[4] Neural Machine Translation and Sequence-to-sequence Models by Neubig
[5] Neural Machine Translation by Jointly Learning to Align and Translate, D. BahdanauK.ChoY. Bengio

Mastering large language models – Part VII: War and Peace

This blog is all about code – we will implement and train an LSTM on Tolstoys War and Peace and use it to generate new text snippets that (well, at least very remotely) resemble pieces from the original novel. All the code can be found here in my GitHub repository for this series.

First, let us talk about the data we are going to use. Fortunately, the original novel is freely available in a text-only format at Project Gutenberg. The text file in UTF-8 encoding contains roughly 3.3 million characters.

After downloading the book, we need to preprocess the data. The respective code is straightforward – we tokenize the text, build a vocabulary from the list of token and save the vocabulary on disk. We then encode the entire text and save 90% of the data in a training data set and 10% of the data in a validation data set.

The tokenizer requires some thoughts. We cannot simply use the standard english tokenizer that comes with PyTorch, as we want to split the text down to the level of the individual characters. In addition, we want to compress a sequence of multiple spaces into one, remove some special characters and replace line breaks by spaces. Finally, we convert the text into a list of characters. Here is the code for a simple tokenizer that does all this.

def tokenize(text):
    text = re.sub(r"[^A-Za-z0-9 \-\.;,\n\?!]", '', text)
    text = re.sub("\n+", " ", text)
    text = re.sub(" +", " ", text)
    return [_t for _t in text]

The preprocessing step creates three files that we will need later – train.json which contains the encoded training data in JSON format, val.json which contains the part of the text held back for validation and which is the vocabulary.

The next step is to assemble a dataset that loads the encoded book (either the training part or the validation part) from disk and returns the samples required for training. Teacher forcing is implemented in the __getitem__ method of the dataset by setting the targets to be the input shifted by one character to the right.

def __getitem__(self, i):
    if (i < self._len):
        x = self._encoded_book[i: i + self._window_size]
        y = self._encoded_book[i + 1: i + self._window_size + 1]
        return torch.tensor(x), torch.tensor(y)
        raise KeyError

For testing, it is useful to have an implementation of the dataset that uses only a small part of the data. This is implemented by an additional parameter limit that, if set, restricts the length of the dataset artificially to a given number of token.

Next let us take a look at our model. Nothing here should come as a surprise. We first use an embedding layer to convert the input (a sequence of indices) into a tensor. The actual network consists of an LSTM with four layers and an output layer which converts the data back from the hidden dimension to the vocabulary, as we have used it before. Here is the forward method of the model which again optionally accepts a previously computed hidden layer – the full model can be found here.

def forward(self, x, hidden = None):
    x = self._embedding(x) # shape is now (L, B, E)
    x, hidden = self._lstm(x, hidden) # shape (L, B, H)
    x = self._out(x)
    return x, hidden

Finally, we need a training loop. Training proceeds as in the examples we have seen before. After each epoch, we perform a validation run, record the validation loss and save a model checkpoint. Note, however, that training will take some time, even on a GPU – be prepared for up to an hour on an older GPU. In case you want to skip training, I have included a pretrained model checkpoint in the GitHub repository, so if you want to run the training yourself, remove this first. Assuming that you have set up your environment as described in the README, training therefore proceeds as follows.

cd warandpeace
# Remove existing model unless you want to pick up from previous model
rm -f

Note that the training will look for an existing model on disk, so you can continue to train using a previously saved checkpoint. If you start the training from scratch and use the default parameters, you will have a validation loss of roughly 1.27 and a training loss of roughly 1.20 at the end of the last epoch. The training losses are saved into a file called losses.dat that you can display using GNUPlot if that is installed on your machine.

In case you want to reproduce the training but do not have access to a GPU, you can use this notebook in which I have put together the entire code for this post and run it in Google Colab, using one of the freely available GPUs. Training time will of course depend on the GPU that Google will assign to your runtime, for me training took roughly 30 minutes.

After training has completed (or in case you want to dive right away into inference using the pretrained model), we can now start to sample from the model.


The script has some parameters that you can use to select a sampling method (greedy search, temperature sampling, top-k sampling or top-p sampling), the parameters (temperature, k, p), the length of the sample and the prompt. Invoke the script with

python3 --help

to get a summary of the available parameters. Here are a few examples created with the standard settings and a length of 500 characters, using the model checked into the repository. Note that sampling is actually fast, even on a CPU, so you can try this also in case you are not working on a CUDA-enabled machine.

She went away and down the state of a sense of self-sacrifice. The officers who rode away and that the soldiers had been the front of the position of the room, and a smile he had been free to see the same time, as if they were struck with him. The ladies galloped over the shoulders and in a bare house. I should be passed about so that the old count struck him before. Having remembered that it was so down in his bridge and in the enemys expression of the footman and promised to reach the old…

Thats a wish to be a continued to be so since the princess in the middle of such thoughts of which it was all would be better. Seeing the beloved the contrary, and do not about that a soul of the water. The princess wished to start for his mother. He was the princess and the old colonel, but after she had been sent for him when they are not being felt at the ground of her face and little brilliant strength. The village of Napoleon was stronger difficult as he came to him and asked the account

Of course this is not even close to Tolstoy, but our model has learned quite a bit. It has learned to assemble characters into words, in fact all the words in the samples are valid english words. It has also learned that at the start of a sentence, words start with a capital letter. We even see some patterns that resemble grammar, like “Having remembered that” or “He was the princess”, which of course does not make too much sense, but is a combination of a pronoun, a verb and an object. Obviously our model is small and our training data set only consists of a few million token, but we start to see that this approach might take us somewhere. I encourage you to play with additional training runs or changed parameters like the number of layers or the model dimension to see how this affects the quality of the output.

In the next post, we will continue our journey through the history of language models and look at encoder-decoder architectures and the rise of attention mechanisms.

Mastering large language models – Part V: LSTM networks

In the last post, we have seen how we can implement and train an RNN on a very simple task – learning how to count. In the example, I have chosen a sequence length of L = 6. It is tempting to play around with this parameter to see what happens if we increase the sequence length.

In this notebook, I implemented a very simple measurement. I did create and train RNNs on a slightly different task – remembering the first element of a sequence. We can use the exact same code as in the last post and only change the line in the dataset which prepares the target to

 targets = torch.tensor(self._L*[index], dtype=torch.long)

I also changed the initialization approach compared to our previous post by using a random uniform distribution, similar to what PyTorch is doing. I then did training runs for different values of the sequence length, ranging from 4 to 20, and measured the accuracy on the training set after each training run.

The blue curve in the diagram below displays the result, showing the sequence length on the x-axis and the accuracy on the y-axis (we will get to the meaning of the upper curve in a second).

Clearly, the efficiency of the network decreases with larger sequence length, starting at a sequence length of approximately 10, and accuracy drops to less than 60%. I also repeated this with our custom build RNN replaced by PyTorchs RNN to rule out issues with my code, and got similar results.

This seems to indicate that there is a problem with higher sequence lengths in RNNs, and in fact there is one (you might want to take a look at this paper for a more thorough study, which, however, arrives at a similar result – RNNs have problems to learn dependencies in time series with a range of more than 10 or 12 time steps). Given that our task is to simply remember the first item, it appears that an RNNs memory is more of a short-term memory and less of a long term memory.

The upper, orange curve looks much more stable and maintains an accuracy of 100% even for long sequences, and in fact this curve has been generated using a more advanced network architecture called LSTM, which is the topic of todays post.

Before we explain LSTMs, let us try to develop an intuition what the problem is they are trying to solve. For that purpose, it helps to look at how backpropagation actually works for RNNs. Recall that in order to perform automatic differentiation, PyTorch (and similar frameworks) build a computational graph, starting at the inputs and weights and ending at the loss function. Now look at the forward function of our network again.

for t in range(L):
  h = torch.tanh(x[t] @ w_ih.t() + b_ih + h @ w_hh.t() + b_hh)

Notice the loop – in every iteration of the loop, we continue our calculation, using the values of the previous loop (the hidden layer!) as one of the inputs. Thus we do not create L independent computational graphs, but in fact one big computational graph. The depth of this graph grows with larger L, and this is where the problem comes from.

Roughly speaking, during backpropagation, we need to go back the entire graph, so that we move “backwards in time” as well, going back through all elements of the graph added during each loop iteration (this is why this procedure is sometimes called backpropagation in time). So the effective length of the computational graph corresponds to that of a deep neural network with L layers. And we therefore face one of the problems that these networks have – vanishing gradients.

In fact, when we run the backpropagation algorithm, we apply the chain rule at least once for every time step. The chain rule involves a multiplication by the derivative of the activation function. As these derivatives tend to be smaller than one, this can, in the worst case, imply that the gradient gets smaller and smaller with each time step, so that eventually the error signal vanishes and the network stops learning. This is the famous vanishing gradient problem (of course this is by no means a proof that this problem really occurs in our case, however, this is at least likely, as it disappears after switching to an LSTM, so let us just assume that this is really the case….).

So what is an LSTM? The key difference between an LSTM and an ordinary RNN is that in addition to the hidden state, there is a second element that allows the model to remember information across time steps called the memory cell c. At time t, the model has access to the hidden state from the previous time step and, in addition, to the memory cell content from the previous time step, and of course there is an input x[t] for the current time step, like a new word in our sequence. The model then produces a new hidden state, but also a new value for the memory cell at time t. So the new value of the hidden state and the new value of the cell are a function of the input and the previous values.

(h_t, c_t) = F(x_t, h_{t-1}, c_{t-1})

Here is a diagram that displays a single processing step of an LSTM – we fill in the details in the grey box in a minute.

So far this is not so different from an ordinary RNN, as the memory cell is treated similarly to a hidden layer. The key difference is that a memory cell is gated, which simply means that the cell is not directly connected to other parts of the network, but via learnable matrices which are called gates. These gates determine what content present in the cell is made available to the current time step (and thus subject to the flow of gradients during backward processing) and which information is kept back and used either not at all or during the next time step. Gates also control how new information, i.e. parts of the current input, are fed into the cells. In this sense, the memory cell acts a as more flexible memory, and the network can learn rules to erase data in the cell, add data to the cell or use data present in the cell.

Let us make this a bit more tangible by explaining the first processing step in an LSTM time step – forgetting information present in the cell. To realize this, an LSTM contains an additional set of weights and biases that allow it to determine what information to forget based on the current input and the previous value of the hidden layer, called (not surprisingly) the input-forget weight matrix Wif, the input-forget bias bif, and similarly for Whf and bhf. These matrices are then combined with inputs and hidden values and undergo a sigmoid activation to form a tensor known as the forget gate.

f_t = \sigma(W_{if} x_t  + b_{if} + W_{hf} h_{t-1} + b_{hf})

In the next step, this gate is multiplied element-wise with the value of the memory cell from the previous time step. Thus, if a dimension in the gate f is close to 1, this value will be taken over. If, however, a dimension is close to zero, the information in the memory cell will be set to zero, will therefore not be available in the upcoming calculation and in the output and will in that sense be forgotten once the time step completes. Thus the tensor f controls which information the memory cell is supposed to forget, explaining its name. Expressed as formula, we build

f_t \odot c_{t-1}

Let us add this information to our diagram, using the LaTex notation (a circle with a dot inside) to indicate the component-wise product of tensors, sometimes called the Hadamard-Product.

The next step of the calculation is to determine what is sometimes called the candidate cell, i.e. we want to select data derived from the input and the previous hidden layer that is added to the memory cell and thus remembered. Even though I find this a bit confusing I will instead follow the convention employed by PyTorch and other sources to use the letter g to denote this vector and call it the cell gate (the reason I find this confusing is that it is not exactly a gate, but, as we will see in a minute, input that is passed through the gate into the cell). This vector is computed similarly to the value of the hidden layer in an ordinary RNN, using again learnable weight matrices and bias vectors.

g_t = \tanh(W_{ig} x_t  + b_{ig} +  W_{hg} h_{t-1} + b_{hg})

However, this information is not directly taken over as next value of the memory cell. Instead, a second gate, the so-called input gate is applied. Then the result of this operation is added to the output of the forget gate times the previous cell value, so that the new cell value is now the combination of a part that survived the forget-gate, i.e. the part which the model remembers, and the part which was allowed into the memory cell as new memorizable content by the input gate. As a formula, this reads as

i_t = \sigma(W_{ii} x_t + b_{ii} + W_{hi} h_{t-1} + b_{hi})
c_t = f_t \odot c_{t-1} + i_t \odot g_t

Let us also update our diagram to reflect how the new value of the memory cell is computed.

We still have to determine the new value of the hidden layer. Again, there is a gate involved, this time the so-called output gate. This gate is applied to the value of the tanh activation function on the new memory cell value to form the output value, i.e.

o_t = \sigma(W_{io} x_t + b_{io} + W_{ho} h_{t-1} + b_{ho})
h_t = o_t \odot \tanh(c_t)

Adding this to our diagram now yields a complete picture of what happens in an LSTM processing step – note that, to make the terminology even more confusing, sometimes this entire box is also called an LSTM cell.

This sounds complicated, but actually an implementation in PyTorch is more straighforward than you might think. Here is a simple forward method for an LSTM. Note that in order to make the implementation more efficient, we combine the four matrices Wif, Wig, Wio and Wii into one matrix by concatenating them along the first dimension, so that we only have to multiply this large matrix by the input once, and similarly for the hidden layer. The convention that we use here is the same that PyTorch uses – the input gate weights come first, followed by the forget gate, than the cell gate and finally the output gate.,

def forward(x):    
    L = x.shape[0]
    hidden = torch.zeros(H)
    cells = torch.zeros(H)
    _hidden = []
    _cells = []
    for i in range(L):
        _x = x[i]
        # multiply w_ih and w_hh by x and h and add biases
        A = w_ih @ _x 
        A = A + w_hh @ hidden 
        A = A + b_ih + b_hh
        # The value of the forget gate is the second set of H rows of the result
        # with the sigmoid function applied
        ft = torch.sigmoid(A[H:2*H])
        # Similary for the other gates
        it = torch.sigmoid(A[0:H])
        gt = torch.tanh(A[2*H:3*H])
        ot = torch.sigmoid(A[3*H:4*H])
        # New value of cell
        # apply forget gate and add input gate times candidate cell
        cells = ft * cells + it * gt
        # new value of hidden layer is output gate times cell value
        hidden = ot * torch.tanh(cells)
        # append to lists
    return torch.stack(_hidden)

In this notebook, I have implemented this (with a few tweaks that allow the processing of previous values of hidden layer and cell, as we have done it for RNNs) and verified that the implementation is correct by comparing the outputs with the outputs of the torch.nn.LSTM class which is the LSTM implementation that comes with PyTorch.

The good news it that even though LSTMs are more complicated than ordinary RNNs, the way they are used and trained is very much the same. You can see this nicely in the notebook that I have used to create the measurements presented above – using an LSTM instead of an RNN is simply a matter of changing one call, all the other parts of the data preparation, training loop and inference remain the same. This measurement also shows that LSTMs do in fact achieve a much better performance when it comes to detecting long-range dependencies in the data.

LSTMs were first proposed in this paper by Hochreiter and Schmidhuber in 1997. If you consult this paper, you will find that this original version did in fact not contain the forget gate, which was added by Gers, Schmidhuber and Cummins three years later. Over time, other versions of LSTM networks have been developed, like the GRU cell.

Also note that LSTM layers are often stacked on top of each other. In this architecture, the hidden state of an LSTM layer is not passed on directly to an output layer, but to a second LSTM layer, playing the role of the input from this layers point of view.

I hope that this post has given you an intuition for how LSTMs work. Even if our ambition is mainly to understand transformers, LSTMs are still relevant, as the architectures that we will meet when discussing transformers like encoder-decoder model and attention layers, and training methods like teacher forcing have been developed based on RNNs and LSTMs .

There is one last ingredient that we will need to start working on our first project – sampling, i.e. using the output of a neural network to create the next token or word in a sequence. So far, our approach to this has been a bit naive, as we just picked the token with the highest probability. In the next post, we will discuss more advanced methods to do this.

Mastering large language models – Part IV: learning how to count

In the previous post, we have discussed recurrent neural networks in the context of language processing, but in fact, they can be used to learn any type of data structured as a time series. To make sure that we really understand how this works before proceeding to more complex models, we will spent some time today and teach a simple RNN on a very specific sequence – we will teach it how to count.

As usual, this post comes with a notebook that you can either run locally (please follow the instructions in the README to set up a local environment, no GPU needed for today) or in Google CoLab using this link. So I will not go through all the details in the code, but focus on the less obvious parts of it.

First, let us discuss our dataset. Today we will in fact not ta ckle language related tasks, but simply train a model to predict the next element in a sequence. These elements are numbers between 0 and 127, and the sequences that make up our training data set are simply all sequences of six consecutive numbers in this range, like [3,4,5,6,7,8] or [56,57,58,59,60,61]. The task that the model will learn is to predict the next element in a given sequence. If, for example, we present it the sequence [3,4,5], we expect it to predict that the next element is 6.

So our dataset is really simple, the only actual work that we have to do is to convert our data items into tensors. Our input data will be one-hot encoded, so that a sequence of length L has shape (L,V) where V = 128. Our targets will just be labels, so the targets for a sequence of length L will be L labels, i.e. a tensor of shape L. Here is the code to generate an item in the dataset.

# Input at index is the sequence of length L 
# starting at index
inputs = torch.arange(index, index + self._L, dtype = torch.long)
targets = torch.arange(index + 1, index + self._L + 1, dtype = torch.long)    
# Convert inputs to one-hot encoding
inputs = torch.nn.functional.one_hot(inputs, num_classes = self._V)
inputs =

Next, let us discuss our model. The heart of the model will of course be an RNN. The input dimension will be V, as we plan to present the input as one-hot encoded vectors. We have already seen the forward function of the RNN in the last blog post, and it is not difficult to put this into a class that is a torch.nn.Module. Keep in mind, however, that the weights need to be wrapped into instances of torch.nn.Parameter so that they are detected by the optimizer during learning.

The output of the RNN will be the output of the hidden layer and will be of shape (L,D), where L is the length of the input sequence and D is the inner dimension of the model. To predict the next elements of the sequence from this, we add a linear layer that maps this back into a tensor of shape (L, V). We then take the last element of the output, which is a tensor of shape V, and apply a softmax to get a probability distribution. To make a prediction, we could now either sample according to this multinomial distribution, or just take the element with the highest probability weight – we will discuss more advanced sampling methods in a later post.

So here is the code for our model – note that we again allow a previous hidden layer value to be used as optional input.

class MyModel(torch.nn.Module):

  def __init__(self, d_in, d_hidden):
    self._d_hidden = d_hidden
    self._d_in = d_in
    self._rnn = RNN(d_in = d_in, d_hidden = d_hidden)
    self._linear = torch.nn.Linear(in_features = d_hidden, out_features = d_in)

  def forward(self, x, h = None):
    rnn_out, hidden = self._rnn(x, h)
    out = self._linear(rnn_out)
    return out, hidden  

We can now train our model by putting the logic for the generation of samples above into a Torch dataset, firing up a data loader, instantiating a model and going through the usual training procedure. Our data set is sufficiently small to make the model converge quickly (this is of course a massive case of overfitting, but for our purposes this is good enough). I used a hidden dimension of 32 and a batch size corresponding to half of the dataset, so that one epoch involves two gradient updates. Here is a diagram showing the training loss per epoch over time.

Having trained our model, we can now go ahead and make predictions. We have already indicated how this works. To predict the next item of a given sequence, we feed the sequence into the model – note that this sequence can be longer or shorter than those used during training. The output of the model will be a tensor of shape (L, V). We only use the last time step for prediction, apply a softmax to it and pick the element with the highest weight.

# Input is the sequence [7,8,9,10]
input = torch.arange(7, 11, dtype=torch.long)
input = torch.nn.functional.one_hot(input, num_classes = V)
input =
out, hidden = model(
# Output has shape (L, V) 
# Strip off last output and apply softmax
# to obtain a probability distribution p of length V
p = torch.softmax(out[-1], dim = -1)
# Predict
guess = torch.argmax(p).item()

If everything worked, the result will be 11, as expected, so our model learns what it is supposed to learn.

There is an important lesson to learn from this simple example. During inference, the output that we actually use is the output of the last time step, i.e. of the last iteration inside the forward method (this is not the case for all tasks on which RNNs are typically trained, but for many of them). At this point, the model has only access to the last input x[t], so that all information about previous time steps that the model needs to make a prediction have to be part of the hidden layer. In that sense, the hidden layer really serves as a memory and helps the model to remember previously seen input in the same sequence.

Of course our example is a bit of an exception, as the model only needs the last value to make the actual prediction. In the next post, we will challenge the model a bit more and ask it to make a prediction that really requires a memory, namely to predict the first element of a sequence, which the model thus needs to remember until the very last element is processed. We will be able to see nicely that this gets more difficult as the sequence length grows and discuss a special type of RNNs called long-short term memory neural networks (LSTM for short) that have been designed to increase the ability of a network to learn long-range dependencies.

Mastering large language models – Part III: recurrent neural networks

When designing neural networks to handle language, one of the central design decisions you have to make is how you model the flow of time. The words in a sentence do not have a random order, but the order in which they appear is central to their meaning and correct grammar. Ignoring this order completely will most likely not allow a network to properly learn the structure of language.

We will later see that the transformer architecture uses a smart approach to incorporate the order of words, but before these architectures were developed, a different of models was commonly used to deal with this challenge – recurrent neural networks (RNNs). The idea of a recurrent neural network is to feed the network one word at a time, but to allow it to maintain a memory of words it has seen, or, more precisely, of their representations. This memory is called the hidden state of the model.

To explain the basic structure of an RNN, let us see how they are used to solve a very specific, but at the same time very fundamental problem – text generation. To generate text with a language model, you typically start with a short piece of text called the prompt and ask the model to generate a next word that makes sense in combination with the prompt. From a mathematical point of view, a word might be a good candidate for a continuation of the prompt if the probability that this word appears after the words in the prompt is high. Thus to build a model that is able to generate language, we train a model on conditional probabilities and, during text generation, choose the next word based on the probabilities that the model spits out. This is what large language models do at the end of the day – they find the most likely continuation of a started sentence or conversation.

Suppose, for instance, that you are using the prompt


Let us denote this word by w1. We are now asking the model to predict, for every word w in the vocabulary, the conditional probability

P(w | w_1)

that this word appears after the prompt. We then sample a word according to this probability distribution, call it w2 and again calculate all the conditional probabilities

P(w | w_1, w_2)

to find the next word in the sentence. We call this w3 and so forth, until we have found a sentence of the desired length.

Let us now think about a neural network that we could use for this purpose. One approach could be an architecture similar to the one that we used for word2vec. We start with an embedding, followed by a hidden layer of a certain size (the internal dimension of the model), and add an output layer to convert back into V dimensions, were, as before, V is the size of our vocabulary.

This model looks like a good fit for the first step of the word generation. If our prompt consists of one word w1, and we need to calculate the conditional probabilities P(w | w1), we could do this by feeding w1 into the model and train it to predict the probabilities P(w | w1) in its softmax layer, very much like a classical classification problem. However, things become more difficult if we have already generated two words and therefore need to calculate a conditional probability P(w | w1, w2) that depends on the two previous words. We could of course simply ignore w1 and only feed w2, which amounts to approximating

P(w | w_1, w_2) \approx P(w | w_1)

This, however, is a very crude approximation (sometimes called a unigram model), in general, the next sentence in a word will not only depend on the previous word, but on all previous words. Just think of the two sentences

Dinosaurs are big animals

Mice are small animals

If we wanted to determine the most likely word after “are”, the first word clearly does make a difference. So we need to find a way to make the previous words somehow available to our model. The approach that an RNN (or, more precisely, an Elman RNN) takes is to use a hidden layer which does not only have one input (the output of the lowest layer) but two inputs. One input is the output of the lowest layer, and the second input is the output of the hidden layer during the previous time step.

Let us go through this one by one. First, we feed word w1 into the model as before. This goes through the input layer, through the hidden layer and finally through the output layer. However, this time, we save the output of the hidden layer and call it h1. In the second step, we feed w2 into the model. Again, this goes through the input layer. But now, we combine the value of the input layer with our previously saved h1 and feed this combined value into the hidden layer to get an output h2. We again save this, but also feed it into the output layer. In the next step, we read word w3, combine the result of the input layer with h2, feed this into the hidden layer and so forth. So at step t, the input the hidden layer is a function of ht-1 (the value of the hidden layer in the previous step) and it (the value of the input layer at time t).

I actually found it difficult to understand this when I first read about RNNs, but sometimes code can help. So let us code this in Python. But before we do this, we need to make it more precise what it means to “combine” the value of the hidden layer from the previous time step with the value of the input layer. Combination here means essentially concatenation. More precisely, we have two weight matrices that determine the value of the hidden layer. The first matrix that we will call Wih (the indices mean input-to-hidden) is applied to the inputs. The second one that we will call Whh (hidden-hidden) is applied to the value of the hidden layer from the previous time step. Taking activation functions and bias into account, the value of the hidden layer at time t is therefore

h_t = tanh(x_t W_{ih}^t + b_{ih} + h_{t-1}W_{hh}^t + b_{hh})

where xt is the input at time t. Note how the output from the previous step ht-1 sneaks into this equation and realizes the feedback loop indicated in the diagram above.

Armed with this understanding, let us implement a simple RNN forward method that accepts L words (or, more generally, input vectors) and applies the above processing steps, i.e. processes L time steps.

 def forward(x):
    L = x.shape[0]
    h = torch.zeros(d_hidden)
    out = []
    for t in range(L):
        h = torch.tanh(x[t] @ w_ih.t() + b_ih + h @ w_hh.t() + b_hh)
    return torch.stack(out)

Here, d_hidden is the dimension of the hidden layer. Note that we not not include the output layer which is actually common practice (the RNN model class that comes with PyTorch does not do this either). We can clearly see that the hidden state h is saved between the individual time steps and can therefore help the model to remember information between time steps, for instance some features of the previous words that are helpful to determine the next word in the sequence.

The output of our forward function has shape (L, d_hidden), so that we have one hidden vector for every time step. We could feed this into an output layer, followed for instance by a softmax layer. Typically, however, we are only interested in the output of the last time step, which we would interpret as the conditional probability for the next word as discussed above.

Unfortunately, our forward function still has a problem. Suppose we start with a prompt with L = 3 words. We encode the words, call the forward function and obtain an output. We take the last element of the output, feed it into a feed forward layer to map back into the vocabulary, apply softmax and sample to get the next word. When we now want to determine the next word, however we would have to repeat the same procedure for the first three words to get our hidden state back that we need for the last time step. To avoid this, it would be useful if we could pass a previous hidden state as optional parameter into our model so that the forward function can pick up where it left. Similarly, we want our forward function to return the hidden layer value at the last time step as well to be able to feed it back with the next call. Here is our modified forward function.

def forward(x, h = None):
    L = x.shape[0]
    if h is None:
        h = torch.zeros(d_hidden)
    out = []
    for t in range(L):
        h = torch.tanh(x[t] @ w_ih.t() + b_ih + h @ w_hh.t() + b_hh)
    return torch.stack(out), h

This looks simple enough, but it is actually a working RNN. It is instructive to compare the output that this simple forward function produces with the output of PyTorchs torch.nn.RNN module. The GitHub repository for this series contains a simple notebook doing this, which you can again either run locally (no GPU needed) or in Google Colab using this link. This notebook also shows how to extract the weight and bias vectors from an RNN in PyTorch.

We have not yet explained how an RNN is trained. Suppose again we wanted to train an RNN for text generation. We would then pick a large corpus of text as a training dataset to obtain a large number of sentences. Each sentence would create a training sample. To force the model to predict the next word, the targets that we use to calculate the loss function are obtained by shifting the input sequence by one position to the right. If, for instance, our dataset contains the sentence “I love my cute dog”, we would turn this into a pair of inputs and targets as follows.

inputs = ["I", "love", "my", "cute"]
targets = ["love", "my", "cute", "dog"]

In particular, the target label does not depend on the prediction that the model makes during training, but only on the actual data (the “ground truth”). If, for instance, we feed the first word “I” and the model does not predict “love” but something else, we would still continue use the label “my,”, even if this does not make sense in combination with the actual model output.

This method of training is often called teacher forcing. The nice thing about this approach is that it does not require any labeled data, i.e. this is a form of unsupervised learning. The only data that we need is a large number of sentences to learn the probabilities, which we can try to harvest from public data available on the internet, like Wikipedia, social media or projects like Project Gutenberg that make novels freely available. We will actually follow this approach in a later post to train an RNN on a novel.

This concludes our post for today. In the next post, we will make all this much more concrete by implementing a toy model. This model will learn a very simple sequence, in fact we will teach the model to count, which will allow us to demonstrate the full cycle of implementing an RNN, training it using teacher forcing and sampling from it to get some results.

Mastering large language models – Part II: Words and vector spaces

In the last post, we have looked at how a text is pre-processed to make it accessible for a neural network and have seen that the first step is to convert a text into a sequence of numbers, where each number is the index of the corresponding word in a vocabulary. Let us now discuss how we can convert each of these numbers into a vector.

Most machine learning models which are used for natural language processing have a property called the model dimension which we will abbreviate by D. A model dimension of, say, 768, simply means that internally, all words are represented by vectors in a vector space of dimension 768. Thus, a single word is a one-dimensional tensor of length D = 768. Our task is therefore to assign to each word in a vocabulary of size V a vector in a D-dimensional space. This assignment, called the embedding, can be nicely represented as a matrix of dimension D x V, so that the column at position i represents the word with index i in the vocabulary.

Of course there are endless possibilities to construct such an embedding. In most cases, the embedding is a learned parameter, i.e. we start training with a randomly initialized embedding and then apply gradient descent to the embedding matrix as to any other parameter during learning. However, it has become increasingly popular to use an embedding which has already been pre-trained so that training does not start from zero and the model hopefully converges faster. One method that facilitates such a pre-training is called the word2vec algorithm.

The idea of the word2vec algorithm (and of many other approaches to constructing embeddings) is to start with a larger model that contains the embedding we wish to train and a second model part that is adapted to a certain downstream task. We then train the entire model on this downstream task, hoping that the embedding layer will capture not only the specific information required for the downstream task, but more general patterns that are useful for other tasks as well. We then throw away the upper part of the model and reuse the embedding layer for other tasks.

The diagram above shows this architecture. The model consists of an embedding layer which translates a word represented by an index between 0 and V – 1 into a vector of dimension D, the internal model dimension. You can think of this as the combination of a one-hot encoding that turns the index into a vector of dimension V and a linear layer without bias that projects onto D dimensions. This part of the model is, once trained, the actual artefact that we might reuse in other, more complex models.

The upper part of the model is adapted to the specific downstream task on which word2vec has been trained. The original paper actually explains two downstream tasks called CBOW and Skipgram, we will focus on CBOW in this post.

Before describing CBOW, let us first try to explain the underlying objective of the training. We want to construct embeddings that capture not only a word itself, but the meaning of a word. Put differently, we want words that have a similar meaning to end up as nearby vectors. To make this precise, we have to define a notion of similarity for our embeddings, i.e. for D-dimensional vectors, and for words.

For vectors, this is easy. In two-dimensional linear algebra, we would call two vectors similar if they point roughly in the same direction, i.e. if the angle between them is small, or in other words if the cosine of the angle is close to one. There is no good notion of an angle in D-dimensional space, but there is a good replacement for the cosine, namely the dot product. So to measure the similary of two vectors x and y, we can take the normed dot product and simply define this to be the cosine

\textrm{cos}(x,y) =  \frac{\langle x, y \rangle}{|x| \cdot |y|}

Defining similarity between words is a bit more complicated. The approach that word2vec takes is to assume that two words have a similar meaning if they tend to appear in the same context, i.e. surrounded by similar sets of words.

To explain this, let us consider a simple sentence (I did not make this up, this sentence actually appears almost verbatim in the training data that we will use later).

“A team of 24 players was selected from an initial pool of 49 candidates”

Let us pick a word in this sentence, say “from”. We can then define the context of this center word to be the set of all words in the sentence that appear within in certain range around the center word. For example if we choose a window size of four, the region that makes up the context extends by two words to the left and two words to the right of the center word. Thus, the context words for the center word “from” are

“was”, “selected”, “an”, “initial”

So the context of a center word is simply the set of all words in the window around the center without the center word itself. The idea of word2vec is that the meaning of a word can be captured by the context words that appear in combination with it. If two center words appear most of the time surrounded by the same context words, then they are considered to have a similar meaning.

To see that this makes sense, consider another example – the sentence “The mighty king is sitting on a golden throne”. If we replace king by “ruler”, the resulting sentence would still be likely to appear in a large corpus of text. As the words “king” and “ruler” can replace each other in their respective context while still making sense, we would consider them to have a similar meaning.

To turn this idea into a training objective, the CBOW algorithm proceeds as follows. First, we go through our data and for each center word, we determine the context as above. Each pair of center word and context will constitute one training sample. We now train the model to predict the center word from the given context. More specifically, we first turn the context into a single vector by feeding each context word into the embedding layer and taking the average of the resulting vectors (this is why the model is called CBOW which is the abbreviation for “continuous bag of words”, as taking the average ignores the position of the word in the context). We now have a single vector of dimension D which we can use as input for a second linear layer which turns it back into a vector of dimension V. This vector is then the input for a softmax so that we eventually obtain an index in the range between 0 and V – 1. The target is our center word and we can apply the usual cross entropy loss function. So CBOW is essentially a classification problem in which the label is the center word and the input is the averaged context.

Note that both the embedding layer and the linear layer do not have a bias, so that they are fully determined by their weight matrices, say U and V. The function to which we apply the softmax is then essentially the matrix product of U with the transpose of V. This in turn is the dot product of the rows of U and the rows of V, which we can both interpret as embeddings. If we write this out in terms of scalar products, you see the cosines emerging and develop an intuition why this training objective does indeed foster the learning of similarities. But instead of diving deeper into this, let us go ahead and discuss the implementation in PyTorch.

To implement the embedding layer without having to make the one-hot encoding explicit, we can use the torch.nn.Embedding class provided by PyTorch. The forward method of this module accepts an index or a sequence of indices and returns a vector or a sequence of vectors, which is exactly what we need. The output layer is an ordinary linear layer. Following the usual practice, we do not add the final softmax layer to our model but use the cross entropy loss function which includes the calculation of the softmax. With this, our model is rather simple:


class CBOW(torch.nn.Module):
    def __init__(self, model_dim, V, bias = False):
        self.embedding = torch.nn.Embedding(
        self.linear = torch.nn.Linear(
            bias = bias
    def forward(self, Y):
        E = self.embedding(Y).mean(axis=1)
        U = self.linear(E)
        return U

Note the max_norm parameter which re-normalizes the embeddings if they exceed a certain size. Setting this parameter turns out to be helpful during training (a fact that I discovered after reading this excellent blog post by O. Chernytska which turned out to be a very valuable resource when training my model).

We see that the forward method does what we have sketched earlier – we first apply the embeddings which will give us a tensor of shape (B, W, D) where B is the batch size, W is the window size and D is the model dimension. We then take the mean along the middle dimension, i.e. the mean of embedding of all words in the context, which will give us a tensor of shape (B, D). We then apply the output layer to get a batch of shape (B, V) which we use as input to our loss function.

Let us now discuss our data and the preprocessing. To train our model, we will use the WikiText2 dataset which consists of roughly 2 million token taken from Wikipedia and is available via the Torchtext library. Each item in the dataset is a paragraph. Some paragraphs consist of a title only which we remove. We then apply a tokenizer to the remaining paragraphs and collect them in one large list, in which each item is again a list of token.

ds = torchtext.datasets.WikiText2(split="train")
tokenizer ="basic_english")
paragraphs = []
for item in ds:
    # Remove trailing whitespace and special characters
    item = re.sub("^\s+", "", item)
    item = re.sub("@", "", item)
    if not re.match("^=", item):
        p = tokenizer(item)
        if len(p):

Next, we build a vocabulary using again the torchtext library. We add a special token “<unk>” to the vocabulary that stands for an unknown word (and is in fact already present in the input data). We also only add token to the vocabulary which appear more than a given number of times, i.e. have a certain minimum frequency.

vocab = torchtext.vocab.build_vocab_from_iterator(paragraphs, 

We can then encode the paragraphs as usual using our vocabulary. Next, we need to create pairs of center word and context out of our training data. Here, it is helpful that we maintain the paragraph structure, as a context that spans across a paragraph is probably less useful, so we want to avoid this. One way to generate the center/context pairs is using a Python generator function, like this.

def yield_context(paragraphs, window_size = 8):
    for p in paragraphs:
        half = window_size // 2
        # If we are not yet at the last token in the paragraph, 
        # yield window and advance center. 
        for index, center in  enumerate(p):
            context = p[max(0, index - half):index]
            context.extend(p[index + 1:min(len(p), index + half + 1)])
            yield center, context

Here we visit each position in each paragraph and use it as center position. We then carve out half of the window size to the left of the center token and half of the window size to the right and concatenate the result lists, which we return along with the center.

To train our model, we can put all this code into a PyTorch dataset so that we can use it along with a PyTorch data loader. Training is now straightforward. I have used an initial learning rate of 0.1 for a model dimension of 300 and a batch size of 20000. I apply a linear rate scheduler and use an Adam optimizer.

To make it a bit easier for you to try this out, I put together all the code in a notebook that is available in the GitHub repository for this series. To run this, you have several options. First, you can run it locally (or in your favorite cloud environment, of course) by cloning the repository and following the instructions in the README, i.e.

git clone
python3 -m venv .venv
source .venv/bin/activate
pip3 install -r requirements.txt
pip3 install ipykernel
python3 -m ipykernel install --user --name=MLLM

Then navigate to the notebook in the word2vec directory and run it. Alternatively, you can run it in Google Colab by simply clicking on this link. Note that the second cell will install the portalocker package if not yet present, so you might have to restart the runtime afterwards to make sure that the package can be used (use Runtime – restart and run all if you get an error in the third cell saying that portalocker cannot be found).

With the chosen parameters, training is rather smooth and takes less than 10 epochs to achieve reasonable results. In the notebook, I train for 7 epochs to achieve a mean training loss of a bit more than 4.5 in the last epoch (to keep things simple, I do not measure the validation loss).

Once we have trained the model, we can check that it does what we are up to – making sure that similar words or semantically related words receive similar embeddings. To verify this, let us pick a word that appears in the vocabulary and try to find those words in the vocabulary that are closest to it, i.e. have the largest cosines. To do this, we can extract the weights (which PyTorch stores internally with shape (V, D) so that the rows are the embeddings) using the attribute weight of torch.nn.Embedding. Given the embedding of a fixed token, we now need to take the dot product of all rows of the weight matrix with the vector representing the fixed token, which can be conveniently organized as a matrix multiplication. We can then sort the resulting vector and extract the five largest entries. Here is a short piece of code doing this.

 def print_most_similar(token, embeddings, vocab):
    # Normalize embeddings
    _embeddings = torch.nn.functional.normalize(embeddings, dim = 1)
    # get u, the embedding of our token
    u = _embeddings[vocab[token], :]
    # do dot products as one large matrix multiplication
    v = torch.matmul(_embeddings, u)
    # Sort this
    values, indices = torch.sort(v, descending=True)
    print(f"Most similar token for {token}")
    for i in range(5):
        print(f"      {vocab.lookup_token(indices[i])} -- {values[i]}")

If we run this after training for the word “king”, the results we get (which are to some extent random, so you might get slightly different results if you run this yourself) are

king -- 0.9999999403953552
son -- 0.5954159498214722
earl -- 0.59091717004776
archbishop -- 0.57264244556427
pope -- 0.5617164969444275

This is not bad for two minutes of training! Except the second one, all others are clearly some sort of ruler and in that sense will probably appear in similar semantic roles as the word “king”.

There is one more experiment that we can make. Remember that the second layer of our model converts the vectors from the internal dimension back into the vocabulary. This is a linear layer with a weight matrix that has the same shape as that of the embedding! Thus we actually learn two embeddings – the embeddings that we have modelled as a torch.nn.Embedding layer and that we apply to the context vectors (the context embedding) and the embedding that is implicit in the linear layer of type torch.nn.Linear that one might call the center embedding. We can repeat the test above with the center embedding (again, look at the notebook for the details) and get a very similar output.

 king -- 1.0
 pope -- 0.6784377694129944
 lord -- 0.6100903749465942
 henry -- 0.5989689230918884
 queen -- 0.5779016017913818

With this, let us close this post for today. If you want to read more on word2vec, the Skipgram mechanism that we have not presented and more advanced versions of the algorithm, I have listed a few valuable reads below. In this series, we will continue with an introduction to RNNs, a generation of network architectures that are important to understand as many training methods and terms used for transformers as well go back to them.

[1] Chapter 6 of “Speech and Language Processing” by Jurafsky and Martin
[2] Section 15.1 of “Dive into Deep Learning”
[3] The original paper introducing word2vec available here
[4] A PyTorch implementation by O. Chernytska
[5] The illustrated word2vec by J. Alammar

Mastering large language models – Part I: Overview

In the history of AI, progress has always come from several sources – more powerful hardware, more high-quality training data or refined training methods. And sometimes, we have seen a step change triggered by a new and innovative generation of models.

Some of you might remember the time when the term Deep Learning was coined, referring to machine learning models consisting of several stacked layers of neural networks. Later, convolutional neural networks (CNNs) took computer vision and image recognition by storm, and recurrent neural networks and in particular LSTMs where widely applied to boost applications in natural language processing like machine translation or sentiment analysis.

Similarly, the recent revolution in natural language processing has been triggered by a novel type of neural networks called Transformers which is very well adapted to the specific challenges of processing long sentences. In fact, all of the currently hyped language models like GPT , Googles LaMDA or Facebooks LLaMA are based on the transformer architecture – so clearly that is something that everybody interested in machine learning should probably understand.

This is what we will do in this series – take a deeper look at the transformer architecture, understand how these models are implemented and trained and learn how pre-trained, publicly available models can easily be downloaded and used. We will cover both the theory, referring to the original papers on the subject if that makes senses, and practice, using the PyTorch framework to implement some networks and training scripts ourselves.

More specifically, we will cover the following topics in this series:

  • the basics of NLP – tokenization, vocabularies and language modelling tasks
  • representing words as vectors and word similarity
  • a brief look at the generation of models preceding transformers, i.e. recurrent neural networks, LSTM networks and encoder-decoder architectures
  • Project I: training a character level LSTM on Tolstoys War and Peace
  • the attention mechanism and how it is used in transformer models
  • transformer blocks, encoders and decoders
  • Subword tokenization and BPE
  • Project II: training a transformer on Wikipedia
  • Using pretrained transformers like GPT-2 or DialoGPT
  • Project III: building a chatbot using DialoGPT and the Huggingface Transformer library
  • Instruction fine-tuning and FLAN-T5
  • Proximal Policy Optimization (PPO) and reinforcement learning

Obviously, we cannot start from zero to be able to cover all this. So I assume some familiarity with machine learning and neural networks (if this is new to you, you might want to read some of the excellent available introductions like chapter 7 of Speech and Language processing by Jurafsky and Martin, which also covers parts of what we will discuss, chapter 3 and 4 of Dive into Deep Learning or chapter 5 – 7 of Machine learning with neural networks by B. Mehlig). I also assume that you understand the basics of PyTorch, if not, I recommend the excellent introduction to PyTorch which is part of the official documentation.

Tokenization, vocabularies and basis tasks in NLP

After this short outlook of what’s ahead, let us dive right into the content. We start by explaining some terms that you will see over and over again if you get into the field of natural language processing (NLP).

First, let us discuss some of the common problems that NLP tries to solve. Some of these problems can be described as a classification task. The input that machine learning models receive is a text, consisting of a sequence of words or, more generally, token (we get to this in a minute), and the output is a label. An example for this type of task is sentiment analysis. The model is given a text, for instance a review of a movie or a book, and is asked to predict whether the overall sentiment of the review is positive or negative. Note that the input to the model, the text, has a linear structure – each word has a position, so there is a notion of time in the input – while the output is simply a label.

A second class of tasks is sometimes called sequence-to-sequence and consists of receiving a sequence of words as an input and providing a sequence of words as output. The prime example is machine translation, which receives a sequence of words in the source language as input and produces a translation into the target language. Note that the length of the target and length of the source are, in general, different.

Finally, a third class of problems (there are many more in NLP) that we will be concerned with is text generation. Here the task is to create a text which appears natural and fluent (and of course grammatically correct) from scratch, maybe be completing a short piece of text fed into the model called the prompt. We will see later that machine translation can actually be expressed as conditional text generation, i.e. text generation giving some context which, in the case of a translation task will be an encoding of the input sequence.

For all of these tasks, we will have to find a reasonable way to encode a sequence of words as number or, more precisely, as vectors. Typically, this proceeds in several steps. The first step is tokenization which means that we break down our text into a list of tokens. Initially, a token will be a word or a punctuation character, but later we will turn to more general tokens which can be parts of words or even individual characters. The most straightforward way to do this is to split a text along spaces, i.e. to do something like

text = "My name is John. What is your name?"
token = text.split()

which would give the toutput

['My', 'name', 'is', 'John.', 'What', 'is', 'your', 'name?']

This is simple, maybe too simple. We combine, for instance, a punctuation mark along with the word after which it follows, which might not be a good idea as a punctuation mark has an independent syntactical meaning. We also do not convert our words to lower- or uppercase, so that “Name” and “name” would be different token. There are many ready-to-use implementations of more sophisticated approaches. For now, we will use the tokenizer that is part of the Torchtext library. To use this, please make sure that you have torch and torchtext installed, I have used version 2.0.0 of PyTorch and version 0.15.1 of Torchtext, but older versions should work as well. We can than tokenize the same text as before as follows.

import torchtext

tokenizer ="basic_english")

This time, the output that we get is

['my', 'name', 'is', 'john', '.', 'what', 'is', 'your', 'name', '?']

We see that the tokenizer has converted all words into lower-case and has translated punctuation marks into individual token.

The next stage consists of building a list of all known token that appear in our text, i.e. of all unique token. This can conveniently be done using a counter. Here is a short code snippet that creates a list of all unique token.

import collections
token = tokenizer(text)
counter = collections.Counter(token)
vocabulary = [t for t in counter.keys()]
# Output: ['my', 'name', 'is', 'john', '.', 'what', 'your', '?']

So far, our token are still words. To feed them into a neural network, we will have to encode them as numbers. For that purpose, we replace each token in the original text by its index in the vocabulary, so that the initial text is turned into a sequence of numbers. Note that this sequence still preserves the sequential structure, i.e. the order of numbers is the same as the order of the corresponding words in the original sentence (there are other models, commonly referred to as bag-of-word models, in which only the unordered set of token is considered).

stois = dict()
for idx, t in enumerate(vocabulary):
    stois[t] = idx
encoded_text = [stois[t] for t in token]
# Output: [0, 1, 2, 3, 4, 5, 2, 6, 1, 7]

Of course, we can revert this process by replacing each index in the list by the corresponding token, a process known as decoding.

decoded_text = " ".join([vocabulary[idx] for idx in encoded_text])
# Output: my name is john . what is your name ?

Most tokenizers will, in addition to the token generated by identifying words in the text, use additional special token that represent of instance unknown words (i.e. words which are not in the vocabulary as they have not been part of the text used to build the vocabulary) or the end or beginning of a sentence.

At this point, we have converted text into a sequence of numbers. In order to be meaningful as input to a neural network, we now have to turn each of these numbers into a vector. A straightforward approach would be to use one-hot encoding. Suppose that our vocabulary has V items. Then we can turn an index i into a vector in V-dimensional space which is one at position i and zero at all other positions. In other words, the encodings form a base of the vector space on which the model will then operate. This encoding is simply, but has two major drawbacks. First, it treats all words in the same way, regardless of their meaning. It would be nice to have an embedding that translates words into vectors in such a way that similar words end up as somehow similar vectors. Second, the vector space become huge. A vocabulary can easily be as big as 50 k or more token, so our vector space would have 50.000 dimensions, blowing up the model unnecessarily. For those reasons, other procedures to turn words into vectors are more common, which will be the topic of the next post. If you want to try out what we have discussed today, you can download a notebook here and play with it.

The EM algorithm and Gaussian mixture models – part I

In the last few posts on machine learning, we have looked in detail at restricted Boltzmann machines. RBMs are a prime example for unsupervised learning – they learn a given distribution and are able to extract features from a data set, without the need to label the data upfront.

However, there are of course many other, by now almost classical, algorithms for unsupervised learning. In this and the following posts, I will explain one of the most commonly applied algorithms in this field, namely the EM algorithm.

As a preparation, let us start with a very fundamental exercise – clustering. So suppose that we are given a data set which we suspect to consist of K clusters. We want to identify these clusters, i.e. to each point in the dataset, we want to assign the number of the cluster to which the point belongs.

More precisely, let us suppose that we are given a set {\mathcal D} = \{ x_i, \dots, x_N\} of points in some euclidian space and a number K of clusters. We want to identify the centre \mu_i of each cluster and then assign the data points xi to some of the \mu_j, namely to the \mu_j which is closest to xi. We can encode the assignment of data points to clusters in a matrix Rij where Rij = 1 if data point xi belongs to cluster j with centre \mu_j. Thus each row of R corresponds to a data point and is non-zero in exactly one column, where it is one. This is known as 1-of-K encoding.

For each value of i, the squared distance of xi to the centre of the cluster to which it is assigned is then given by

\sum_j R_{ij} \| x_i - \mu_j |^2

where we note that only one of the summands is different from zero. Assigning the data points to the clusters and positioning the centre points of the clusters then amounts to minimizing the loss function

l(\mu, R) = \sum_i \sum_j R_{ij} \langle x_i - \mu_j , x_i - \mu_j \rangle

where the brackets denote the usual euclidean scalar product.

Now let us see how we can optimize this function. For a fixed vector \mu of cluster centers, we can easily minimize Rij by assigning each data point to the cluster whose center is closest to xi. Thus we set

R_{ij} =  \begin{cases} 1 & \text{if} \, j = \arg \min \| x_i - \mu_j \|^2 \\ 0 & \text{otherwise} \end{cases}

Conversely, given a matrix R which we hold fixed, it is likewise easy to minimize \mu_j. As there are no further constraints on \mu_j, we can find the minimum by differentiating the loss function. We find that the derivative is zero if and only if

0 = \sum_i R_{ij}(x_i - \mu_j)

holds for all j. Assuming for a moment that each cluster has at least one data point assigned to it, i.e. that none of the columns of R contains zeroes only, we can solve this by

\mu_j = \frac{\sum_i x_i R_{ij}}{\sum_i R_{ij}}

which is also easily seen to be a local minimum by taking the second derivative.

Note that this term has an obvious geometric interpretation. The denominator in this expression is the number of data points that are currently assigned to cluster j. The numerator is the sum of all data points assigned to this cluster. Thus the minimum is the mean value of the positions of the data points that are currently assigned to the cluster (a physicist would call this the center of gravity of the cluster). This is the reason why this method is called the k-means algorithm. If no data points are assigned to the cluster, the loss function does not depend on \mu_j and we can choose any value that we want.

The algorithm now works as follows. First, we choose centers \mu_j at will – it is common to use some of the data points for this purpose. Then we assign each data point to a center using the formula for Rij derived above. We then adjust the center points \mu_j and reallocate the points to the new centers and so forth.

From our discussion above, it is clear that each full iteration of this procedure will reduce the loss function. This does of course not imply that the algorithm converges to a global minimum, as it might get stuck in local minima or saddle points. In practice, the algorithm is executed until the cluster assignments and centers do not change any more substantially or for a given number of steps.


The diagram above shows the result of applying this algorithm to a set of points that is organized in two clusters. To generate the data, 100 samples were drawn from 2-dimensional Gaussian distributions. On the left hand side, half of the the samples were centered at the point (5,1), the other samples at (1,4), and both had standard deviation 0.6. On the right hand side, the same centers were used, but only a small number of samples were drawn from the second distribution which had standard deviation 0.5, whereas most samples came from the first distribution with standard deviation 0.8. Then 10 iterations of the k-means algorithm were applied to the resulting sample set. The points in the sample were then plotted with a color indicating the assignment to clusters resulting from the matrix R. The actual cluster from which the sample was drawn is indicated by the shape – a point is cluster one, a diamond is cluster two.

We see that in the example on the left hand side, the algorithm has correctly assigned all points to their original cluster. For the example on the right hand side, the situation is different – the algorithm did actually assign significantly more points to the blue cluster, i.e. there are many wrong assignments (blue points). This does not change substantially if we increase the number of iterations, even with 100 iterations, we still see many wrong assigments for this example. If you want to reproduce the example and play a bit with the parameters, you can get the sourcecode for a straightforward implementation in Python from my GitHub repository.

The K-means algorithm is very simple and straightforward, but seems to have limitations because it cannot determine the shape of a distribution, only its center. It turns out that K-means is a special case of a broader class of algorithms that we now study, hoping to find more robust algorithms.

In our example, we have generated sample data as a combination of two Gaussian distributions. What if we just change the game and simply assume that our data is of this type? In other words, we assume an underlying probabilistic model for our sample data. Once we have that, we can pull all the tricks that statistical inference can offer – we can calculate maximum likelihood and maximum posterior probability, we can try to determine the posterior or even sample from the model.

Thus let us try to fit our data to a model of the form

P(x) = \sum_k \pi_k {\mathcal N}(x ; \mu_k, \Sigma_k)

where {\mathcal N}(x ; \mu_k, \Sigma_k) is a multivariate Gaussian distribution with mean \mu_k and covariance matrix \Sigma_k, i.e.

{\mathcal N}(x ; \mu_k, \Sigma_k) = \frac{1}{\sqrt{(2\pi)^d \det(\Sigma)}} \exp (-\frac{1}{2} \langle x - \mu_k, \Sigma_k^{-1}(x - \mu_k)\rangle)

and the \pi_k are non-negative real numbers with \sum_k \pi_k = 1.

Let us now see how this equation looks like if we use a 1-of-K encoding. We introduce a random variable Z that takes values in \{ 0, 1\}^K with the additional constraint that only one of the Zk is allowed to be different from zero. We interpret \pi_k as the probability

\pi_k = P(Z_k = 1)


P(Z = z) = \prod_k \pi_k ^{z_k}

and we can write

P(X=x) = \sum_z P(Z=z) P(X=x | Z=z) = \sum_z P(x,z)

where P(z) is as above and

P(X = x | Z = z) = \prod_k {\mathcal N}(x ; \mu_k, \Sigma_k)^{z_k}

This is a very general type of distribution which reflects a common pattern in machine learning, namely the introduction of so called latent or hidden variables. In general, latent or hidden variables are random variables that are a part of the model which cannot be observed, i.e. are not part of the input or the output of the model. We have seen latent variables in action several times – adding hidden units to a neural network introduces latent variables and makes the model much more powerful, the hidden layer of a restricted Boltzmann machine serves as memory to learn features, and latent variables that are used to construct a mixture of Gaussians as above allow us to model a much broader class of distributions than a model with just one Gaussian.

Intuitively, it is also clear how to sample from such a model. In a first step, we sample from Z, in other words we determine the index k randomly according to the distribution given by the \pi_k. Once we have k, we then sample from the conditional distribution P(X = x | Z = z). As we already have k, this amounts to sampling from the Gaussian distribution {\mathcal N}(x ; \mu_k, \Sigma_k) with mean \mu_k and covariance matrix \Sigma_k.


In the example above, we have first applied this procedure to a one-dimensional Gaussian mixture with K=2. The histogram shows the result of the sampling procedure, the solid lines are the probability density functions of the two individual Gaussian distributions, multiplied by the respective weight. On the right hand side, the result of sampling a Gaussian mixture in two dimensions is displayed (you can download the notebook used to produce this example here).

When we now try to fit this model to the sample data, we again have to calculate the likelihood function and try to maximize it. However, it turns out that the gradient of the likelihood is not so easy to calculate, and it might very well be that there is no closed form solution or that we obtain rather complicated expressions for the gradient. Fortunately, there is an alternative that works for very general models and does not require knowledge of the gradient – the EM algorithm. In the next post, I will present the algorithm in this general setup, before we apply it to our original problem and compare the results with the k-means algorithm.


1. C.M. Bishop, Pattern recognition and machine learning, Springer, New York 2006
2. A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM-algorithm, Journ. Royal Stat. Soc. Series B. Vol. 39 No. 1 (1977), pp. 1-38

Why you need statistics to understand neuronal networks

When I tried to learn about neuronal networks first, I did what probably most of us would do – I started to look for tutorials, blogs etc. on the web and was surprised by the vast amount of resources that I found. Almost every blog or webpage about neuronal networks has a section on training a simple neuronal network, maybe on the MNIST data set, using a framework like TensorFlow, Theano or MXNET. When you follow such a tutorial, a network is presented as a collection of units and weights. You see how the output of the network is calculated and then an error function – sometimes least squares, sometimes something else – is presented. Often, a regulation term is applied, and then you are being told that the automatic gradient calculation features of the framework will do the gradient descent algorithm for you and you just have to decide on an optimizer and run the network and enjoy the results.

Sooner or later, however, you will maybe start to ask a few questions. Why that particular choice of the loss function? Where does the regulator come from? What is a good initial value for the weights and why? Where does the sigmoid function come from? And many, many other questions….

If you then decide to dig deeper, using one of the many excellent textbooks or even try to read some of the original research papers (and some are actually quite readable), you will very soon be confronted with terms like entropy, maximum likelihood, posterior distribution, Gaussian mixtures and so on, and you will realize that the mathematics of neuronal networks has a strong overlap with mathematical statistics. But why? Why is that a good language to discuss neuronal networks, and why should you take the time to refresh your statistics knowledge if you really want to understand neuronal networks? In this post, I will try to argue that statistical inference comes up very naturally when you try to study neuronal networks.

Many neuronal networks are designed to excel at classification tasks. As an example, suppose you wanted to design and train a neuronal network that, given data about an animal, classifies the animal as either a bird or some other animal (called a “non-bird” for convenience). So our starting point is a set modelling all possible objects that could be presented to the network. How exactly we model this set is not so important, more important is that in general, the network will not have access to all the data about the animal, but only to certain attributes of elements in the set called features. So there could be a feature which we call X1 and which is defined as

X_1 = \text{the animal can fly}

taking values in \{0,1\}. Another data point the network could get is

X_2 = \text{length of animal in cm}

taking values in the reals and so forth. More generally, we assume that on the set of all possible objects, we have certain functions Xi taking values in, say, the real numbers. Based on these numbers, the network will then try to take a decision whether a given animal is a bird or not. Thus we do not have to deal directly with our space of objects, but use the functions Xi as primary objects.

If the network had a chance to look at every possible animal, this would be easy, even though it would cost a lot memory – it could simply remember all possible combinations of features and for each feature, store the correct answer. In reality however, this does not work. Instead, we have access to a small subset of the data – a sample – for which we can evaluate the Xi. Based on this subset, we then have to derive a model which gives the right answer in as many cases as possible. Thus we try to make a statement about the full space of things that could be presented to our network for classification based on a small sample.

This is where probabilities come into play very naturally. We need to assume that our sample has been chosen randomly, but still we need to make assertions about the full set. This is exactly what inferential statistics is doing. The fact that our sample is chosen randomly turns our Xi into random variables. Similarly, the variable

Y = \text{is a bird}

taking values in \{0,1\} is a random variable, and we try to gain information on the distribution of Y across the full population based on its values on a given set of labelled samples, i.e. a set of samples where Y is known. Thus Y would represent the labels or targets in the language of neuronal networks. Applying the methods of statistical inference to this situation would typically start by choosing a statistical model and than using estimators or hypothesis testing to make deductions.

Apart from the fact that we have to derive information on the full population based on a sample, there is another reason why probabilities appear naturally in the theory of machine learning. In many cases, the available input – being a reduction of the full set of data – is not sufficient to classify the sample with full certainty. To see this, let us go back to our examples. How would you derive the property “bird” from the given data “can fly” and “length”? Not all animals than can fly are birds – and not all birds can fly. So we have to try to distinguish for instance a butterfly from a hummingbird based on the length. The smallest hummingbird – a bee hummingbird – is about 5 cm in length. The largest known butterfly – the Queen’s Alexandra birdwing – can be as long as 8 cm (both informations taken from Wikipedia). Thus our data is not sufficient to clearly distinguish butterflies and birds in all cases!

However, very small birds and very large butterflies have one thing in common – they are rare. So chances are that a given animal that can fly and is larger than 5 cm is actually a bird (yes, I know, there are bats….). In other words, if again Y denotes the variable which is 1 on birds and 0 on all other animals, we can in general not hope that Y is a function of the Xi, but we can hope that given some values of the Xi, the probability P(Y=1) to be a bird is a function of the Xi. In other words, using the language of conditional probabilities,

P(Y=1 | X = x) = f(x)

with some unknown function f. In a Bayesian interpretation of probability, the certainty with which can say “this animal is a bird” is a function of the values xi of the observable variables Xi.

With these considerations, we now arrive at the following mathematical model for what a classification algorithm is about. We are given a probability space (P, \Omega) with a vector valued random variable X. The attributes of a sample are described by the feature vector X in some subset of m-dimensional euclidian space, where m is the number of different features. In our example, m=2, as we try to classify animals based on two properties. The result of the classification is described by a random variable Y taking – for the simple case of a binary classification problem – values in \{0,1\}. We then assume that

P(Y =1 | X=x) = f(x;w_0)

where f(\cdot;w) is a function parametrized by some parameter w that we call the weights of the model. The actual value w0 of w is unknown. Based on a sample for X and Y, we then try to fit the model, i.e. we try to find a value for w such that f(\cdot, w) models the actual conditional distribution of Y as good as possible. Once the fitting phase is completed, we can then use the model to derive predictions about objects which are not in our initial sample set.

This model sounds a bit abstract, but many feed forward neuronal networks can be described with this or similar models. And we can now apply the full machinery of mathematical statistics – we can calculate cross entropies and maximum likelihood, we can analyse converge and variance, we can apply the framework of Bayesian statistics and Monte Carlo methods. This is the reason why statistics is so essential when it comes to describing and analyzing neuronal networks. So on the next rainy Sunday afternoon, you might want to grab a steaming hot cup of coffee, head towards your arm chair and spent some time with one of the many good exposures on this topic, like chapter IV in MacKays book on Machine Learning, or Bishops “Pattern recognition and machine learning” or chapter 3 of the deep learning book by Goodfellow, Bengio and Courville.