Machine learning – Page 3

Mastering large language models – Part V: LSTM networks

In the last post, we have seen how we can implement and train an RNN on a very simple task – learning how to count. In the example, I have chosen a sequence length of L = 6. It is tempting to play around with this parameter to see what happens if we increase the sequence length.

In this notebook, I implemented a very simple measurement. I did create and train RNNs on a slightly different task – remembering the first element of a sequence. We can use the exact same code as in the last post and only change the line in the dataset which prepares the target to

 targets = torch.tensor(self._L*[index], dtype=torch.long)

I also changed the initialization approach compared to our previous post by using a random uniform distribution, similar to what PyTorch is doing. I then did training runs for different values of the sequence length, ranging from 4 to 20, and measured the accuracy on the training set after each training run.

The blue curve in the diagram below displays the result, showing the sequence length on the x-axis and the accuracy on the y-axis (we will get to the meaning of the upper curve in a second).

Clearly, the efficiency of the network decreases with larger sequence length, starting at a sequence length of approximately 10, and accuracy drops to less than 60%. I also repeated this with our custom build RNN replaced by PyTorchs RNN to rule out issues with my code, and got similar results.

This seems to indicate that there is a problem with higher sequence lengths in RNNs, and in fact there is one (you might want to take a look at this paper for a more thorough study, which, however, arrives at a similar result – RNNs have problems to learn dependencies in time series with a range of more than 10 or 12 time steps). Given that our task is to simply remember the first item, it appears that an RNNs memory is more of a short-term memory and less of a long term memory.

The upper, orange curve looks much more stable and maintains an accuracy of 100% even for long sequences, and in fact this curve has been generated using a more advanced network architecture called LSTM, which is the topic of todays post.

Before we explain LSTMs, let us try to develop an intuition what the problem is they are trying to solve. For that purpose, it helps to look at how backpropagation actually works for RNNs. Recall that in order to perform automatic differentiation, PyTorch (and similar frameworks) build a computational graph, starting at the inputs and weights and ending at the loss function. Now look at the forward function of our network again.

for t in range(L):
  h = torch.tanh(x[t] @ w_ih.t() + b_ih + h @ w_hh.t() + b_hh)
  out.append(h)

Notice the loop – in every iteration of the loop, we continue our calculation, using the values of the previous loop (the hidden layer!) as one of the inputs. Thus we do not create L independent computational graphs, but in fact one big computational graph. The depth of this graph grows with larger L, and this is where the problem comes from.

Roughly speaking, during backpropagation, we need to go back the entire graph, so that we move “backwards in time” as well, going back through all elements of the graph added during each loop iteration (this is why this procedure is sometimes called backpropagation in time). So the effective length of the computational graph corresponds to that of a deep neural network with L layers. And we therefore face one of the problems that these networks have – vanishing gradients.

In fact, when we run the backpropagation algorithm, we apply the chain rule at least once for every time step. The chain rule involves a multiplication by the derivative of the activation function. As these derivatives tend to be smaller than one, this can, in the worst case, imply that the gradient gets smaller and smaller with each time step, so that eventually the error signal vanishes and the network stops learning. This is the famous vanishing gradient problem (of course this is by no means a proof that this problem really occurs in our case, however, this is at least likely, as it disappears after switching to an LSTM, so let us just assume that this is really the case….).

So what is an LSTM? The key difference between an LSTM and an ordinary RNN is that in addition to the hidden state, there is a second element that allows the model to remember information across time steps called the memory cell c. At time t, the model has access to the hidden state from the previous time step and, in addition, to the memory cell content from the previous time step, and of course there is an input x[t] for the current time step, like a new word in our sequence. The model then produces a new hidden state, but also a new value for the memory cell at time t. So the new value of the hidden state and the new value of the cell are a function of the input and the previous values.

$(h_t, c_t) = F(x_t, h_{t-1}, c_{t-1})$

Here is a diagram that displays a single processing step of an LSTM – we fill in the details in the grey box in a minute.

So far this is not so different from an ordinary RNN, as the memory cell is treated similarly to a hidden layer. The key difference is that a memory cell is gated, which simply means that the cell is not directly connected to other parts of the network, but via learnable matrices which are called gates. These gates determine what content present in the cell is made available to the current time step (and thus subject to the flow of gradients during backward processing) and which information is kept back and used either not at all or during the next time step. Gates also control how new information, i.e. parts of the current input, are fed into the cells. In this sense, the memory cell acts a as more flexible memory, and the network can learn rules to erase data in the cell, add data to the cell or use data present in the cell.

Let us make this a bit more tangible by explaining the first processing step in an LSTM time step – forgetting information present in the cell. To realize this, an LSTM contains an additional set of weights and biases that allow it to determine what information to forget based on the current input and the previous value of the hidden layer, called (not surprisingly) the input-forget weight matrix W^if, the input-forget bias b_if, and similarly for W_hf and b_hf. These matrices are then combined with inputs and hidden values and undergo a sigmoid activation to form a tensor known as the forget gate.

$f_t = \sigma(W_{if} x_t + b_{if} + W_{hf} h_{t-1} + b_{hf})$

In the next step, this gate is multiplied element-wise with the value of the memory cell from the previous time step. Thus, if a dimension in the gate f is close to 1, this value will be taken over. If, however, a dimension is close to zero, the information in the memory cell will be set to zero, will therefore not be available in the upcoming calculation and in the output and will in that sense be forgotten once the time step completes. Thus the tensor f controls which information the memory cell is supposed to forget, explaining its name. Expressed as formula, we build

$f_t \odot c_{t-1}$

Let us add this information to our diagram, using the LaTex notation (a circle with a dot inside) to indicate the component-wise product of tensors, sometimes called the Hadamard-Product.

The next step of the calculation is to determine what is sometimes called the candidate cell, i.e. we want to select data derived from the input and the previous hidden layer that is added to the memory cell and thus remembered. Even though I find this a bit confusing I will instead follow the convention employed by PyTorch and other sources to use the letter g to denote this vector and call it the cell gate (the reason I find this confusing is that it is not exactly a gate, but, as we will see in a minute, input that is passed through the gate into the cell). This vector is computed similarly to the value of the hidden layer in an ordinary RNN, using again learnable weight matrices and bias vectors.

$g_t = \tanh(W_{ig} x_t + b_{ig} + W_{hg} h_{t-1} + b_{hg})$

However, this information is not directly taken over as next value of the memory cell. Instead, a second gate, the so-called input gate is applied. Then the result of this operation is added to the output of the forget gate times the previous cell value, so that the new cell value is now the combination of a part that survived the forget-gate, i.e. the part which the model remembers, and the part which was allowed into the memory cell as new memorizable content by the input gate. As a formula, this reads as

$i_t = \sigma(W_{ii} x_t + b_{ii} + W_{hi} h_{t-1} + b_{hi})$
$c_t = f_t \odot c_{t-1} + i_t \odot g_t$

Let us also update our diagram to reflect how the new value of the memory cell is computed.

We still have to determine the new value of the hidden layer. Again, there is a gate involved, this time the so-called output gate. This gate is applied to the value of the tanh activation function on the new memory cell value to form the output value, i.e.

$o_t = \sigma(W_{io} x_t + b_{io} + W_{ho} h_{t-1} + b_{ho})$
$h_t = o_t \odot \tanh(c_t)$

Adding this to our diagram now yields a complete picture of what happens in an LSTM processing step – note that, to make the terminology even more confusing, sometimes this entire box is also called an LSTM cell.

This sounds complicated, but actually an implementation in PyTorch is more straighforward than you might think. Here is a simple forward method for an LSTM. Note that in order to make the implementation more efficient, we combine the four matrices W_if, W_ig, W_io and W_ii into one matrix by concatenating them along the first dimension, so that we only have to multiply this large matrix by the input once, and similarly for the hidden layer. The convention that we use here is the same that PyTorch uses – the input gate weights come first, followed by the forget gate, than the cell gate and finally the output gate.,

def forward(x):    
    L = x.shape[0]
    hidden = torch.zeros(H)
    cells = torch.zeros(H)
    _hidden = []
    _cells = []
    for i in range(L):
        _x = x[i]
        #
        # multiply w_ih and w_hh by x and h and add biases
        # 
        A = w_ih @ _x 
        A = A + w_hh @ hidden 
        A = A + b_ih + b_hh
        #
        # The value of the forget gate is the second set of H rows of the result
        # with the sigmoid function applied
        #
        ft = torch.sigmoid(A[H:2*H])
        #
        # Similary for the other gates
        #
        it = torch.sigmoid(A[0:H])
        gt = torch.tanh(A[2*H:3*H])
        ot = torch.sigmoid(A[3*H:4*H])
        #
        # New value of cell
        # apply forget gate and add input gate times candidate cell
        #
        cells = ft * cells + it * gt
        #
        # new value of hidden layer is output gate times cell value
        #
        hidden = ot * torch.tanh(cells)
        #
        # append to lists
        #
        _cells.append(cells)
        _hidden.append(hidden)
    return torch.stack(_hidden)

In this notebook, I have implemented this (with a few tweaks that allow the processing of previous values of hidden layer and cell, as we have done it for RNNs) and verified that the implementation is correct by comparing the outputs with the outputs of the torch.nn.LSTM class which is the LSTM implementation that comes with PyTorch.

The good news it that even though LSTMs are more complicated than ordinary RNNs, the way they are used and trained is very much the same. You can see this nicely in the notebook that I have used to create the measurements presented above – using an LSTM instead of an RNN is simply a matter of changing one call, all the other parts of the data preparation, training loop and inference remain the same. This measurement also shows that LSTMs do in fact achieve a much better performance when it comes to detecting long-range dependencies in the data.

LSTMs were first proposed in this paper by Hochreiter and Schmidhuber in 1997. If you consult this paper, you will find that this original version did in fact not contain the forget gate, which was added by Gers, Schmidhuber and Cummins three years later. Over time, other versions of LSTM networks have been developed, like the GRU cell.

Also note that LSTM layers are often stacked on top of each other. In this architecture, the hidden state of an LSTM layer is not passed on directly to an output layer, but to a second LSTM layer, playing the role of the input from this layers point of view.

I hope that this post has given you an intuition for how LSTMs work. Even if our ambition is mainly to understand transformers, LSTMs are still relevant, as the architectures that we will meet when discussing transformers like encoder-decoder model and attention layers, and training methods like teacher forcing have been developed based on RNNs and LSTMs .

There is one last ingredient that we will need to start working on our first project – sampling, i.e. using the output of a neural network to create the next token or word in a sequence. So far, our approach to this has been a bit naive, as we just picked the token with the highest probability. In the next post, we will discuss more advanced methods to do this.

Mastering large language models – Part IV: learning how to count

In the previous post, we have discussed recurrent neural networks in the context of language processing, but in fact, they can be used to learn any type of data structured as a time series. To make sure that we really understand how this works before proceeding to more complex models, we will spent some time today and teach a simple RNN on a very specific sequence – we will teach it how to count.

As usual, this post comes with a notebook that you can either run locally (please follow the instructions in the README to set up a local environment, no GPU needed for today) or in Google CoLab using this link. So I will not go through all the details in the code, but focus on the less obvious parts of it.

First, let us discuss our dataset. Today we will in fact not ta ckle language related tasks, but simply train a model to predict the next element in a sequence. These elements are numbers between 0 and 127, and the sequences that make up our training data set are simply all sequences of six consecutive numbers in this range, like [3,4,5,6,7,8] or [56,57,58,59,60,61]. The task that the model will learn is to predict the next element in a given sequence. If, for example, we present it the sequence [3,4,5], we expect it to predict that the next element is 6.

So our dataset is really simple, the only actual work that we have to do is to convert our data items into tensors. Our input data will be one-hot encoded, so that a sequence of length L has shape (L,V) where V = 128. Our targets will just be labels, so the targets for a sequence of length L will be L labels, i.e. a tensor of shape L. Here is the code to generate an item in the dataset.

#
# Input at index is the sequence of length L 
# starting at index
#
inputs = torch.arange(index, index + self._L, dtype = torch.long)
targets = torch.arange(index + 1, index + self._L + 1, dtype = torch.long)    
#
# Convert inputs to one-hot encoding
#
inputs = torch.nn.functional.one_hot(inputs, num_classes = self._V)
inputs = inputs.to(torch.float32)

Next, let us discuss our model. The heart of the model will of course be an RNN. The input dimension will be V, as we plan to present the input as one-hot encoded vectors. We have already seen the forward function of the RNN in the last blog post, and it is not difficult to put this into a class that is a torch.nn.Module. Keep in mind, however, that the weights need to be wrapped into instances of torch.nn.Parameter so that they are detected by the optimizer during learning.

The output of the RNN will be the output of the hidden layer and will be of shape (L,D), where L is the length of the input sequence and D is the inner dimension of the model. To predict the next elements of the sequence from this, we add a linear layer that maps this back into a tensor of shape (L, V). We then take the last element of the output, which is a tensor of shape V, and apply a softmax to get a probability distribution. To make a prediction, we could now either sample according to this multinomial distribution, or just take the element with the highest probability weight – we will discuss more advanced sampling methods in a later post.

So here is the code for our model – note that we again allow a previous hidden layer value to be used as optional input.

class MyModel(torch.nn.Module):

  def __init__(self, d_in, d_hidden):
    self._d_hidden = d_hidden
    self._d_in = d_in
    super().__init__()
    self._rnn = RNN(d_in = d_in, d_hidden = d_hidden)
    self._linear = torch.nn.Linear(in_features = d_hidden, out_features = d_in)

  def forward(self, x, h = None):
    rnn_out, hidden = self._rnn(x, h)
    out = self._linear(rnn_out)
    return out, hidden

We can now train our model by putting the logic for the generation of samples above into a Torch dataset, firing up a data loader, instantiating a model and going through the usual training procedure. Our data set is sufficiently small to make the model converge quickly (this is of course a massive case of overfitting, but for our purposes this is good enough). I used a hidden dimension of 32 and a batch size corresponding to half of the dataset, so that one epoch involves two gradient updates. Here is a diagram showing the training loss per epoch over time.

Having trained our model, we can now go ahead and make predictions. We have already indicated how this works. To predict the next item of a given sequence, we feed the sequence into the model – note that this sequence can be longer or shorter than those used during training. The output of the model will be a tensor of shape (L, V). We only use the last time step for prediction, apply a softmax to it and pick the element with the highest weight.

#
# Input is the sequence [7,8,9,10]
#
input = torch.arange(7, 11, dtype=torch.long)
print(input)
input = torch.nn.functional.one_hot(input, num_classes = V)
input = input.to(torch.float32)
out, hidden = model(input.to(device))
#
# Output has shape (L, V) 
# Strip off last output and apply softmax
# to obtain a probability distribution p of length V
#
p = torch.softmax(out[-1], dim = -1)
#
# Predict
#
guess = torch.argmax(p).item()
print(guess)

If everything worked, the result will be 11, as expected, so our model learns what it is supposed to learn.

There is an important lesson to learn from this simple example. During inference, the output that we actually use is the output of the last time step, i.e. of the last iteration inside the forward method (this is not the case for all tasks on which RNNs are typically trained, but for many of them). At this point, the model has only access to the last input x[t], so that all information about previous time steps that the model needs to make a prediction have to be part of the hidden layer. In that sense, the hidden layer really serves as a memory and helps the model to remember previously seen input in the same sequence.

Of course our example is a bit of an exception, as the model only needs the last value to make the actual prediction. In the next post, we will challenge the model a bit more and ask it to make a prediction that really requires a memory, namely to predict the first element of a sequence, which the model thus needs to remember until the very last element is processed. We will be able to see nicely that this gets more difficult as the sequence length grows and discuss a special type of RNNs called long-short term memory neural networks (LSTM for short) that have been designed to increase the ability of a network to learn long-range dependencies.

Mastering large language models – Part II: Words and vector spaces

In the last post, we have looked at how a text is pre-processed to make it accessible for a neural network and have seen that the first step is to convert a text into a sequence of numbers, where each number is the index of the corresponding word in a vocabulary. Let us now discuss how we can convert each of these numbers into a vector.

Most machine learning models which are used for natural language processing have a property called the model dimension which we will abbreviate by D. A model dimension of, say, 768, simply means that internally, all words are represented by vectors in a vector space of dimension 768. Thus, a single word is a one-dimensional tensor of length D = 768. Our task is therefore to assign to each word in a vocabulary of size V a vector in a D-dimensional space. This assignment, called the embedding, can be nicely represented as a matrix of dimension D x V, so that the column at position i represents the word with index i in the vocabulary.

Of course there are endless possibilities to construct such an embedding. In most cases, the embedding is a learned parameter, i.e. we start training with a randomly initialized embedding and then apply gradient descent to the embedding matrix as to any other parameter during learning. However, it has become increasingly popular to use an embedding which has already been pre-trained so that training does not start from zero and the model hopefully converges faster. One method that facilitates such a pre-training is called the word2vec algorithm.

The idea of the word2vec algorithm (and of many other approaches to constructing embeddings) is to start with a larger model that contains the embedding we wish to train and a second model part that is adapted to a certain downstream task. We then train the entire model on this downstream task, hoping that the embedding layer will capture not only the specific information required for the downstream task, but more general patterns that are useful for other tasks as well. We then throw away the upper part of the model and reuse the embedding layer for other tasks.

The diagram above shows this architecture. The model consists of an embedding layer which translates a word represented by an index between 0 and V – 1 into a vector of dimension D, the internal model dimension. You can think of this as the combination of a one-hot encoding that turns the index into a vector of dimension V and a linear layer without bias that projects onto D dimensions. This part of the model is, once trained, the actual artefact that we might reuse in other, more complex models.

The upper part of the model is adapted to the specific downstream task on which word2vec has been trained. The original paper actually explains two downstream tasks called CBOW and Skipgram, we will focus on CBOW in this post.

Before describing CBOW, let us first try to explain the underlying objective of the training. We want to construct embeddings that capture not only a word itself, but the meaning of a word. Put differently, we want words that have a similar meaning to end up as nearby vectors. To make this precise, we have to define a notion of similarity for our embeddings, i.e. for D-dimensional vectors, and for words.

For vectors, this is easy. In two-dimensional linear algebra, we would call two vectors similar if they point roughly in the same direction, i.e. if the angle between them is small, or in other words if the cosine of the angle is close to one. There is no good notion of an angle in D-dimensional space, but there is a good replacement for the cosine, namely the dot product. So to measure the similary of two vectors x and y, we can take the normed dot product and simply define this to be the cosine

$\textrm{cos}(x,y) = \frac{\langle x, y \rangle}{|x| \cdot |y|}$

Defining similarity between words is a bit more complicated. The approach that word2vec takes is to assume that two words have a similar meaning if they tend to appear in the same context, i.e. surrounded by similar sets of words.

To explain this, let us consider a simple sentence (I did not make this up, this sentence actually appears almost verbatim in the training data that we will use later).

“A team of 24 players was selected from an initial pool of 49 candidates”

Let us pick a word in this sentence, say “from”. We can then define the context of this center word to be the set of all words in the sentence that appear within in certain range around the center word. For example if we choose a window size of four, the region that makes up the context extends by two words to the left and two words to the right of the center word. Thus, the context words for the center word “from” are

“was”, “selected”, “an”, “initial”

So the context of a center word is simply the set of all words in the window around the center without the center word itself. The idea of word2vec is that the meaning of a word can be captured by the context words that appear in combination with it. If two center words appear most of the time surrounded by the same context words, then they are considered to have a similar meaning.

To see that this makes sense, consider another example – the sentence “The mighty king is sitting on a golden throne”. If we replace king by “ruler”, the resulting sentence would still be likely to appear in a large corpus of text. As the words “king” and “ruler” can replace each other in their respective context while still making sense, we would consider them to have a similar meaning.

To turn this idea into a training objective, the CBOW algorithm proceeds as follows. First, we go through our data and for each center word, we determine the context as above. Each pair of center word and context will constitute one training sample. We now train the model to predict the center word from the given context. More specifically, we first turn the context into a single vector by feeding each context word into the embedding layer and taking the average of the resulting vectors (this is why the model is called CBOW which is the abbreviation for “continuous bag of words”, as taking the average ignores the position of the word in the context). We now have a single vector of dimension D which we can use as input for a second linear layer which turns it back into a vector of dimension V. This vector is then the input for a softmax so that we eventually obtain an index in the range between 0 and V – 1. The target is our center word and we can apply the usual cross entropy loss function. So CBOW is essentially a classification problem in which the label is the center word and the input is the averaged context.

Note that both the embedding layer and the linear layer do not have a bias, so that they are fully determined by their weight matrices, say U and V. The function to which we apply the softmax is then essentially the matrix product of U with the transpose of V. This in turn is the dot product of the rows of U and the rows of V, which we can both interpret as embeddings. If we write this out in terms of scalar products, you see the cosines emerging and develop an intuition why this training objective does indeed foster the learning of similarities. But instead of diving deeper into this, let us go ahead and discuss the implementation in PyTorch.

To implement the embedding layer without having to make the one-hot encoding explicit, we can use the torch.nn.Embedding class provided by PyTorch. The forward method of this module accepts an index or a sequence of indices and returns a vector or a sequence of vectors, which is exactly what we need. The output layer is an ordinary linear layer. Following the usual practice, we do not add the final softmax layer to our model but use the cross entropy loss function which includes the calculation of the softmax. With this, our model is rather simple:

EMBED_MAX_NORM = 1

class CBOW(torch.nn.Module):
    
    def __init__(self, model_dim, V, bias = False):
        super().__init__()
        self.embedding = torch.nn.Embedding(
            num_embeddings=V,
            embedding_dim=model_dim,
            max_norm=EMBED_MAX_NORM,
        )
        self.linear = torch.nn.Linear(
            in_features=model_dim,
            out_features=V,
            bias = bias
        )
        
        
    def forward(self, Y):
        E = self.embedding(Y).mean(axis=1)
        U = self.linear(E)
        return U

Note the max_norm parameter which re-normalizes the embeddings if they exceed a certain size. Setting this parameter turns out to be helpful during training (a fact that I discovered after reading this excellent blog post by O. Chernytska which turned out to be a very valuable resource when training my model).

We see that the forward method does what we have sketched earlier – we first apply the embeddings which will give us a tensor of shape (B, W, D) where B is the batch size, W is the window size and D is the model dimension. We then take the mean along the middle dimension, i.e. the mean of embedding of all words in the context, which will give us a tensor of shape (B, D). We then apply the output layer to get a batch of shape (B, V) which we use as input to our loss function.

Let us now discuss our data and the preprocessing. To train our model, we will use the WikiText2 dataset which consists of roughly 2 million token taken from Wikipedia and is available via the Torchtext library. Each item in the dataset is a paragraph. Some paragraphs consist of a title only which we remove. We then apply a tokenizer to the remaining paragraphs and collect them in one large list, in which each item is again a list of token.

ds = torchtext.datasets.WikiText2(split="train")
tokenizer = torchtext.data.utils.get_tokenizer("basic_english")
paragraphs = []
for item in ds:
    # Remove trailing whitespace and special characters
    item = re.sub("^\s+", "", item)
    item = re.sub("@", "", item)
    if not re.match("^=", item):
        p = tokenizer(item)
        if len(p):
            paragraphs.append(p)

Next, we build a vocabulary using again the torchtext library. We add a special token “<unk>” to the vocabulary that stands for an unknown word (and is in fact already present in the input data). We also only add token to the vocabulary which appear more than a given number of times, i.e. have a certain minimum frequency.

vocab = torchtext.vocab.build_vocab_from_iterator(paragraphs, 
                                                  min_freq=min_freq,
                                                  specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

We can then encode the paragraphs as usual using our vocabulary. Next, we need to create pairs of center word and context out of our training data. Here, it is helpful that we maintain the paragraph structure, as a context that spans across a paragraph is probably less useful, so we want to avoid this. One way to generate the center/context pairs is using a Python generator function, like this.

def yield_context(paragraphs, window_size = 8):
    for p in paragraphs:
        half = window_size // 2
        #
        # If we are not yet at the last token in the paragraph, 
        # yield window and advance center. 
        #
        for index, center in  enumerate(p):
            context = p[max(0, index - half):index]
            context.extend(p[index + 1:min(len(p), index + half + 1)])
            yield center, context

Here we visit each position in each paragraph and use it as center position. We then carve out half of the window size to the left of the center token and half of the window size to the right and concatenate the result lists, which we return along with the center.

To train our model, we can put all this code into a PyTorch dataset so that we can use it along with a PyTorch data loader. Training is now straightforward. I have used an initial learning rate of 0.1 for a model dimension of 300 and a batch size of 20000. I apply a linear rate scheduler and use an Adam optimizer.

To make it a bit easier for you to try this out, I put together all the code in a notebook that is available in the GitHub repository for this series. To run this, you have several options. First, you can run it locally (or in your favorite cloud environment, of course) by cloning the repository and following the instructions in the README, i.e.

git clone https://github.com/christianb93/MLLM
cd MLLM
python3 -m venv .venv
source .venv/bin/activate
pip3 install -r requirements.txt
pip3 install ipykernel
python3 -m ipykernel install --user --name=MLLM
jupyter-lab

Then navigate to the notebook in the word2vec directory and run it. Alternatively, you can run it in Google Colab by simply clicking on this link. Note that the second cell will install the portalocker package if not yet present, so you might have to restart the runtime afterwards to make sure that the package can be used (use Runtime – restart and run all if you get an error in the third cell saying that portalocker cannot be found).

With the chosen parameters, training is rather smooth and takes less than 10 epochs to achieve reasonable results. In the notebook, I train for 7 epochs to achieve a mean training loss of a bit more than 4.5 in the last epoch (to keep things simple, I do not measure the validation loss).

Once we have trained the model, we can check that it does what we are up to – making sure that similar words or semantically related words receive similar embeddings. To verify this, let us pick a word that appears in the vocabulary and try to find those words in the vocabulary that are closest to it, i.e. have the largest cosines. To do this, we can extract the weights (which PyTorch stores internally with shape (V, D) so that the rows are the embeddings) using the attribute weight of torch.nn.Embedding. Given the embedding of a fixed token, we now need to take the dot product of all rows of the weight matrix with the vector representing the fixed token, which can be conveniently organized as a matrix multiplication. We can then sort the resulting vector and extract the five largest entries. Here is a short piece of code doing this.

 def print_most_similar(token, embeddings, vocab):
    #
    # Normalize embeddings
    #
    _embeddings = torch.nn.functional.normalize(embeddings, dim = 1)
    #
    # get u, the embedding of our token
    #
    u = _embeddings[vocab[token], :]
    #
    # do dot products as one large matrix multiplication
    #
    v = torch.matmul(_embeddings, u)
    #
    # Sort this
    #
    values, indices = torch.sort(v, descending=True)
    print(f"Most similar token for {token}")
    for i in range(5):
        print(f"      {vocab.lookup_token(indices[i])} -- {values[i]}")

If we run this after training for the word “king”, the results we get (which are to some extent random, so you might get slightly different results if you run this yourself) are

king -- 0.9999999403953552
son -- 0.5954159498214722
earl -- 0.59091717004776
archbishop -- 0.57264244556427
pope -- 0.5617164969444275

This is not bad for two minutes of training! Except the second one, all others are clearly some sort of ruler and in that sense will probably appear in similar semantic roles as the word “king”.

There is one more experiment that we can make. Remember that the second layer of our model converts the vectors from the internal dimension back into the vocabulary. This is a linear layer with a weight matrix that has the same shape as that of the embedding! Thus we actually learn two embeddings – the embeddings that we have modelled as a torch.nn.Embedding layer and that we apply to the context vectors (the context embedding) and the embedding that is implicit in the linear layer of type torch.nn.Linear that one might call the center embedding. We can repeat the test above with the center embedding (again, look at the notebook for the details) and get a very similar output.

 king -- 1.0
 pope -- 0.6784377694129944
 lord -- 0.6100903749465942
 henry -- 0.5989689230918884
 queen -- 0.5779016017913818

With this, let us close this post for today. If you want to read more on word2vec, the Skipgram mechanism that we have not presented and more advanced versions of the algorithm, I have listed a few valuable reads below. In this series, we will continue with an introduction to RNNs, a generation of network architectures that are important to understand as many training methods and terms used for transformers as well go back to them.

[1] Chapter 6 of “Speech and Language Processing” by Jurafsky and Martin
[2] Section 15.1 of “Dive into Deep Learning”
[3] The original paper introducing word2vec available here
[4] A PyTorch implementation by O. Chernytska
[5] The illustrated word2vec by J. Alammar

Mastering large language models – Part I: Overview

In the history of AI, progress has always come from several sources – more powerful hardware, more high-quality training data or refined training methods. And sometimes, we have seen a step change triggered by a new and innovative generation of models.

Some of you might remember the time when the term Deep Learning was coined, referring to machine learning models consisting of several stacked layers of neural networks. Later, convolutional neural networks (CNNs) took computer vision and image recognition by storm, and recurrent neural networks and in particular LSTMs where widely applied to boost applications in natural language processing like machine translation or sentiment analysis.

Similarly, the recent revolution in natural language processing has been triggered by a novel type of neural networks called Transformers which is very well adapted to the specific challenges of processing long sentences. In fact, all of the currently hyped language models like GPT , Googles LaMDA or Facebooks LLaMA are based on the transformer architecture – so clearly that is something that everybody interested in machine learning should probably understand.

This is what we will do in this series – take a deeper look at the transformer architecture, understand how these models are implemented and trained and learn how pre-trained, publicly available models can easily be downloaded and used. We will cover both the theory, referring to the original papers on the subject if that makes senses, and practice, using the PyTorch framework to implement some networks and training scripts ourselves.

More specifically, we will cover the following topics in this series:

the basics of NLP – tokenization, vocabularies and language modelling tasks
representing words as vectors and word similarity
a brief look at the generation of models preceding transformers, i.e. recurrent neural networks, LSTM networks and encoder-decoder architectures
Project I: training a character level LSTM on Tolstoys War and Peace
the attention mechanism and how it is used in transformer models
transformer blocks, encoders and decoders
Subword tokenization and BPE
Project II: training a transformer on Wikipedia
Using pretrained transformers like GPT-2 or DialoGPT
Project III: building a chatbot using DialoGPT and the Huggingface Transformer library
Instruction fine-tuning and FLAN-T5
Proximal Policy Optimization (PPO) and reinforcement learning

Obviously, we cannot start from zero to be able to cover all this. So I assume some familiarity with machine learning and neural networks (if this is new to you, you might want to read some of the excellent available introductions like chapter 7 of Speech and Language processing by Jurafsky and Martin, which also covers parts of what we will discuss, chapter 3 and 4 of Dive into Deep Learning or chapter 5 – 7 of Machine learning with neural networks by B. Mehlig). I also assume that you understand the basics of PyTorch, if not, I recommend the excellent introduction to PyTorch which is part of the official documentation.

Tokenization, vocabularies and basis tasks in NLP

After this short outlook of what’s ahead, let us dive right into the content. We start by explaining some terms that you will see over and over again if you get into the field of natural language processing (NLP).

First, let us discuss some of the common problems that NLP tries to solve. Some of these problems can be described as a classification task. The input that machine learning models receive is a text, consisting of a sequence of words or, more generally, token (we get to this in a minute), and the output is a label. An example for this type of task is sentiment analysis. The model is given a text, for instance a review of a movie or a book, and is asked to predict whether the overall sentiment of the review is positive or negative. Note that the input to the model, the text, has a linear structure – each word has a position, so there is a notion of time in the input – while the output is simply a label.

A second class of tasks is sometimes called sequence-to-sequence and consists of receiving a sequence of words as an input and providing a sequence of words as output. The prime example is machine translation, which receives a sequence of words in the source language as input and produces a translation into the target language. Note that the length of the target and length of the source are, in general, different.

Finally, a third class of problems (there are many more in NLP) that we will be concerned with is text generation. Here the task is to create a text which appears natural and fluent (and of course grammatically correct) from scratch, maybe be completing a short piece of text fed into the model called the prompt. We will see later that machine translation can actually be expressed as conditional text generation, i.e. text generation giving some context which, in the case of a translation task will be an encoding of the input sequence.

For all of these tasks, we will have to find a reasonable way to encode a sequence of words as number or, more precisely, as vectors. Typically, this proceeds in several steps. The first step is tokenization which means that we break down our text into a list of tokens. Initially, a token will be a word or a punctuation character, but later we will turn to more general tokens which can be parts of words or even individual characters. The most straightforward way to do this is to split a text along spaces, i.e. to do something like

text = "My name is John. What is your name?"
token = text.split()
print(token)

which would give the toutput

['My', 'name', 'is', 'John.', 'What', 'is', 'your', 'name?']

This is simple, maybe too simple. We combine, for instance, a punctuation mark along with the word after which it follows, which might not be a good idea as a punctuation mark has an independent syntactical meaning. We also do not convert our words to lower- or uppercase, so that “Name” and “name” would be different token. There are many ready-to-use implementations of more sophisticated approaches. For now, we will use the tokenizer that is part of the Torchtext library. To use this, please make sure that you have torch and torchtext installed, I have used version 2.0.0 of PyTorch and version 0.15.1 of Torchtext, but older versions should work as well. We can than tokenize the same text as before as follows.

import torchtext

tokenizer = torchtext.data.utils.get_tokenizer("basic_english")
print(tokenizer(text))

This time, the output that we get is

['my', 'name', 'is', 'john', '.', 'what', 'is', 'your', 'name', '?']

We see that the tokenizer has converted all words into lower-case and has translated punctuation marks into individual token.

The next stage consists of building a list of all known token that appear in our text, i.e. of all unique token. This can conveniently be done using a counter. Here is a short code snippet that creates a list of all unique token.

import collections
token = tokenizer(text)
counter = collections.Counter(token)
vocabulary = [t for t in counter.keys()]
print(vocabulary)
# Output: ['my', 'name', 'is', 'john', '.', 'what', 'your', '?']

So far, our token are still words. To feed them into a neural network, we will have to encode them as numbers. For that purpose, we replace each token in the original text by its index in the vocabulary, so that the initial text is turned into a sequence of numbers. Note that this sequence still preserves the sequential structure, i.e. the order of numbers is the same as the order of the corresponding words in the original sentence (there are other models, commonly referred to as bag-of-word models, in which only the unordered set of token is considered).

stois = dict()
for idx, t in enumerate(vocabulary):
    stois[t] = idx
encoded_text = [stois[t] for t in token]
print(encoded_text)
# Output: [0, 1, 2, 3, 4, 5, 2, 6, 1, 7]

Of course, we can revert this process by replacing each index in the list by the corresponding token, a process known as decoding.

decoded_text = " ".join([vocabulary[idx] for idx in encoded_text])
print(decoded_text)
# Output: my name is john . what is your name ?

Most tokenizers will, in addition to the token generated by identifying words in the text, use additional special token that represent of instance unknown words (i.e. words which are not in the vocabulary as they have not been part of the text used to build the vocabulary) or the end or beginning of a sentence.

At this point, we have converted text into a sequence of numbers. In order to be meaningful as input to a neural network, we now have to turn each of these numbers into a vector. A straightforward approach would be to use one-hot encoding. Suppose that our vocabulary has V items. Then we can turn an index i into a vector in V-dimensional space which is one at position i and zero at all other positions. In other words, the encodings form a base of the vector space on which the model will then operate. This encoding is simply, but has two major drawbacks. First, it treats all words in the same way, regardless of their meaning. It would be nice to have an embedding that translates words into vectors in such a way that similar words end up as somehow similar vectors. Second, the vector space become huge. A vocabulary can easily be as big as 50 k or more token, so our vector space would have 50.000 dimensions, blowing up the model unnecessarily. For those reasons, other procedures to turn words into vectors are more common, which will be the topic of the next post. If you want to try out what we have discussed today, you can download a notebook here and play with it.

More on Paperspace Gradient

Its been a few days since I started to play with Paperspace, and I have come across a couple of interesting features that the platform has – enough for a second post on this topic.

First, GIT integration. Recall that the usual process is to zip the current working directory and submit the resulting file along with the job, the ZIP file is then unzipped in the container in which the job is running and the contents of the ZIP file constitute the working directory. However, if you want to run code that requires, for instance, custom libraries, it is much easier to instruct Paperspace to get the contents of the working directory from GitHub. You can do that by supplying a GIT URL using the --workspace switch. The example below, for instance, instructs Paperspace to pull my code for an RBM from GitHub and to run it as a job.

#!/bin/sh
#
# Run the RBM as a job on Paperspace. Assume that you have the paperspace NodeJS
# CLI and have done a paperspace login before to store your credentials
#
#
~/node_modules/.bin/paperspace jobs create  \
        --workspace "git+https://github.com/christianb93/MachineLearning" \
        --command "export MPLBACKEND=AGG ; python3 RBM.py \
        --N=28 --data=MNIST \
        --save=1 \
        --tmpdir=/artifacts \
        --hidden=128 \
        --pattern=256 --batch_size=128 \
        --epochs=40000 \
        --run_samples=1 \
        --sample_size=6,6 \
        --beta=1.0 --sample=200000 \
        --algorithm=PCDTF --precision=32" \
        --machineType K80 \
        --container "paperspace/tensorflow-python" \
        --project "MachineLearning"

Be careful, the spelling of the URL must be exactly like this to be recognized as a GIT URL, i.e. “git+https” followed by the hostname without the “www”, if you use http instead of https or http://www.github.com instead of github.com, the job will fail (the documentation at this point could be better, and I have even had to look at the source code of the CLI to figure out the syntax). This is a nice feature, using that along with the job logs, I can easily reconstruct which version of the code has actually been executed, and it supports working in a team that is sharing GitHub repositories well.

Quite recently, Paperspace did apparently also add the option to use persistent storage in jobs to store data across job runs (see this announcement). Theoretically, the storage should be shared between notebooks and jobs in the same region, but as I have not yet found out how to start a notebook in a specific region, I could not try this out.

Another feature that I liked is that the container that you specify can actually be any container from the Docker Hub, for instance ubuntu. The only restriction is that Paperspace seems to overwrite the entrypoint in any case and will try to run bashinside the container to finally execute the command that you provide, so containers that do not have a bash in the standard execution path will not work. Still, you could use this to prepare your own containers, maybe with pre-installed data sets or libraries, and ask Paperspace to run them.

Finally, for those of us who are Python addicts, there is also a Python API for submitting and managing jobs in Paperspace. Actually, this API offers you two ways to run a Python script on Paperspace. First, you can import the paperspace package into your script and then, inside the script, do a paperspace.run(), as in the following example.

import paperspace
paperspace.run()
print('This will only be running on Paperspace')

What will happen behind the scenes is that the paperspace module takes your code, removes any occurrences of the paperspace package itself, puts the code into a temporary file and submits that as a job to Paperspace. You can then work with that job as with any other job, like monitoring it on the console or via the CLI.

That is nice and easy, but not everyone likes to hardcode the execution environment into the code. Fortunately, you can also simply import the paperspace package and use it to submit an arbitrary job, much like the NodeJs based CLI can do it. The code below demonstrates how to create a job using the Python API and download the output automatically (this script can also be found on GitHub).

import paperspace.jobs
from paperspace.login import apikey
import paperspace.config

import requests


#
# Define parameters
#
params = {}
# 
# We want to use GIT, so we use the parameter workspaceFileName
# instead of workspace
#
params['workspaceFileName'] = "git+https://github.com/christianb93/MachineLearning"
params['machineType'] = "K80"
params['command'] = "export MPLBACKEND=AGG ; python3 RBM.py \
                --N=28 --data=MNIST \
                --save=1 \
                --tmpdir=/artifacts \
                --hidden=128 \
                --pattern=256 --batch_size=128 \
                --epochs=40000 \
                --run_samples=1 \
                --sample_size=6,6 \
                --beta=1.0 --sample=200000 \
                --algorithm=PCDTF --precision=32"
params['container'] = 'paperspace/tensorflow-python'
params['project'] = "MachineLearning"
params['dest'] = "/tmp"

#
# Get API key
#
apiKey = apikey()
print("Using API key ", apiKey)
#
# Create the job. We do NOT use the create method as it cannot
# handle the GIT feature, but assemble the request ourselves
#
http_method = 'POST'
path = '/' + 'jobs' + '/' + 'createJob'


r = requests.request(http_method, paperspace.config.CONFIG_HOST + path,
                             headers={'x-api-key': apiKey},
                             params=params, files={})
job = r.json()
    
if 'id' not in job:
    print("Error, could not get jobId")
    print(job)
    exit(1)

jobId = job['id']
print("Started job with jobId ", jobId)
params['jobId']  = jobId

#
# Now poll until the job is complete

if job['state'] == 'Pending':
    print('Waiting for job to run...')
    job = paperspace.jobs.waitfor({'jobId': jobId, 'state': 'Running'})

print("Job is now running")
print("Use the following command to observe its logs: ~/node_modules/.bin/paperspace jobs logs --jobId ", jobId, "--tail")

job = paperspace.jobs.waitfor({'jobId': jobId, 'state': 'Stopped'})
print("Job is complete: ", job)

#
# Finally get artifacts
#
print("Downloading artifacts to directory ", params['dest'])
paperspace.jobs.artifactsGet(params)

There are some additional features that the Python API seems to have that I have not yet tried out. First, you can apparently specify an init script that will be run before the command that you provide (though the use of that is limited, as you could put this into your command as well). Second, and more important, you can provide a requirements file according to the pip standard to ask Paperspace to install any libraries that are not available in the container before running your command.

Overall, my impression is that these APIs make it comparatively easy to work with jobs on Paperspace. You can submit jobs, monitor them and get their outputs, and you enjoy the benefit that you are only billed for the actual duration of the job. So if you are interested in a job based execution environment for your Machine Learning models, it is definitely worth a try, even though it takes some time to get familiar with the environment.

First steps with Paperspace Gradient

So far, I have exclusively been using AWS EC2 when I needed access to a GPU – not because I have carefully compared the available offerings and taken a deliberate decision, but simply because I already had an EC2 account and know the platform.

However, I though it would be interesting to try out other platforms as well. In this post, I will talk a bit about my experience with Paperspace. This provider has several offerings – Core, which is basically an IaaS service, and Gradient, which allows you to access Jupyter notebooks and run jobs in ready-made environments optimized for Machine Learning – and of course I wanted to try this.

It should be noted that some time has passed between trying this out for the first time (roughly in May) and publication of this post in July, so bear with me when some of the details have changed in the meantime – Paperspace is still under development.

First steps

After signing up, you are routed to a page where you can choose between two products – Paperspace Core and Paperspace Gradient. I did choose Gradient (after providing the requested credit card information). The first thing I did try was to bring up a Jupyter notebook.

When you select that option, you have to make two choices. First, Jupyter notebooks are started in Docker containers, and you have to pick one of the available containers. Second, and more important, you have to select a machine – you have a choice between several CPU based and several GPU based models with different fees associated with them.

After a few seconds, your notebook is up and running (with the base account, you can only have one notebook server at any point in time). If you hit “Open”, a new tab will open and you will see the usual Jupyter home screen.

Your notebook folder will be prepopulated with some tutorials. The one I tried first is one of the classical MNIST / CNN tutorials. Unfortunately, when I tried to run it, the kernel died several times in a row – not very encouraging (it worked two days later, and overall there seem to be a few sporadic errors that come and go over time..).

Next, I could not resist the temptation to open a terminal. The Docker image seems to be based on a very basic Ubuntu distribution. I could successfully do an apt-get update && apt-get install git. So you could probably start to download things and work directly from the console – but of course this is not really the idea.

After playing for some time with the notebook, you can – again on the Paperspace notebook console stop your notebook (make sure to do this, you will be charged while the notebook is running). Once the notebook has stopped, you can click on the little arrow to the right of the notebook name, which will give you the option to download any files in the notebook directory that you have created in your session.

Once stopped, there is no way to restart a notebook, but you can clone a notebook which will create a copy of the previous notebook and start that copy, so you can continue to work where you left off. This works, but is a bit tiresome as you have to delete the obsolete copy manually.

Jobs

The next thing I tried is to create a job. For that purpose, you will first have to install the Paperspace CLI which in turn requires node.js and npm. So here is what you need to do on Ubuntu:

$ cd ~
$ apt install nodejs
$ apt install npm
$ npm install paperspace-node
$ sudo ln -s /usr/bin/nodejs /usr/bin/node

This will create a directory node_modules in your home directory and within that directory, a directory .bin. To test the paperspace CLI, you can run a command like

$ ./node_modules/.bin/paperspace --version

Next, switch to an empty directory and in that directory, run

$ ~/node_modules/.bin/paperspace project init

This will initiate a new paperspace project, i.e. it will create a subdirectory .ps_project containing a JSON configuration file. Next, you need an API key that you can get on your Paperspace home page. The API key is an authentication token that is used by the API – store that number in a safe place.

Once we have that token, it is time to start our first job.

 ~/node_modules/.bin/paperspace jobs create --container Test-Container --command "nvidia-smi" --apiKey "xxxxx" --machineType K80

where xxxxx needs to be replaced by your API key. Instead of providing your API key with every command, you can also run

$ ~/node_modules/.bin/paperspace login

which will add your credentials to a file in the .paperspace directory in your home directory.

Essentially, what happens when you run a job is that the local directory and all its subdirectories will be zipped into a file, a container will be set up on a Paperspace server, the content of the ZIP file will be extracted into this container and the command that you have specified will execute.

You can now get a list of your processes and their status either on the Paperspace console, where you also have immediate access to the log output, or from the command line using

$ ~/node_modules/.bin/paperspace jobs list

At this point, I was again a bit disappointed – the job appears to be running and is even displayed in the web console, but when it completes, I get an error “503 – Service unavailable” and no log output is provided. I raised a request with the support, and roughly 2 hours later the submission suddenly worked – I have not yet found out whether the support has really done anything or whether a part of the infrastructure was really down at this point in time.

As a temporary workaround, I managed to run a job by redirecting error output and standard output to a file. For instance, to run the script KMeans.py, I did use

$ ~/node_modules/.bin/paperspace jobs create  --command "export MPLBACKEND=AGG ; python KMeans.py > /artifacts/log 2>&1" --machineType C2

Once the job is complete, you can download whatever it has added to the directory /artifacts using

$ ~/node_modules/.bin/paperspace jobs artifactsGet --jobId "js26y3pi6jk056"

where the job ID is the ID of the job and will be displayed by the create command. Finally, the command jobs destroy --jobId=... can be used to delete a job after execution.

So far, I have to admit that I am not so happy with what I have seen. I hit upon several issues in the standard setup, and when playing around, I found that it can take a long time for a job to be scheduled on a GPU (depending very much on the machine type – my impression was that for machine types for which Paperspace uses GCP, like K80 or P100, your job will run quickly, but for other types like GPU+ it can take a long time). In addition, as everything is running in a container, the initial steps in job can be time consuming. TensorFlow, for instance, is known to take longer when it is started the first time, and in a fresh container, every time is a first time, so you will see a significant startup time. This gets worse if you need to download data sets, as this will have to be repeated with every new run. It is apparently not yet possible to mount a permanent volume into your container to avoid this or to reuse a stopped container (update: as of July, Paperspace has announced that the /storage directory is a persistent storage available across notebooks and jobs, but I have not yet tried this).

But maybe this is a premature judgement, and I decided that I will still continue to try it out. In one of the next posts, I will present some more advanced features and in particular the Python API that Paperspace offers.

The EM algorithm and Gaussian mixture models – part II

In this post, I will discuss the general form of the EM algorithm to obtain a maximum likelihood estimator for a model with latent variables.

First, let us describe our model. We suppose that we are given some joint distribution of a random variable X (the observed variables) and and random variable Z (the latent variables) and are interested in maximizing the likelihood of an observed sample x of the visible variable X. We also assume that the joint distribution depends on a parameter $\Theta$ , in practice this could be weights, bias terms or any other parameters. To simplify things a bit, we will also assume that the latent variable is finite. Our aim is to maximize the log likelihood, which we can – under these assumptions – express as follows.

$\ln P(x |\Theta) = \ln \sum_z P(x,z | \Theta)$

Even if the joint distribution belongs to some exponential family, the fact that we need to consider the logarithm of the sum and not the sum of the logarithms makes this expression and its gradient difficult to calculate. So let us look for a different approach.

To do this, let us assume that we are given a value $\Theta$ of the parameter and let us try to understand how the likelihood changes if we pass from $\Theta$ to some other value $\Theta'$ . For that purpose, we introduce a term that is traditionally called Q and defined as follows (all this is a bit abstract, but will become clearer later when we do an example):

$Q(\Theta'; \Theta) = E \left[ \ln P(x,z | \Theta') | x, \Theta \right]$

That looks a bit complicated, so let me explain the notation a bit. We want to define a function Q that will be a function of the new parameter value $\Theta'$ . This function will, in addition, depend on the current value $\Theta$ which we consider as a parameter.

The right hand side is an expectation value. In fact, for each value of the visible variable x and the parameter $\Theta$ , we have a probability distribution on the space in which Z lives, given by the conditional probability of Z given x and $\Theta$ . Whenever we have a function depending on z, we can therefore form the expectation value as usual, i.e. as the weighted sum over all function values, weighted by the probability of z. In particular, we can do this for the function $\ln P(x,z | \Theta')$ of z. Thus the right hand side is, spelled out

$E \left[ \ln P(x,z | \Theta') | x, \Theta \right] = \sum_z \ln P(x,z | \Theta') P(z | x, \Theta)$

That is now again a sum of logarithms, not a logarithm of a sum, and we can hope to be able to deal with this much better.

This is nice, but so far we have only introduced a rather complicated additional object – what do we gain? It turns out that essentially, maxizing Q will effectively maximize the likelihood. Let us make this bold statement a bit more precise. Suppose we are able to iteratively maximize Q. Expressed formally, this would mean that we are able to find a sequence $\Theta^0, \Theta^1, \dots$ of parameters such that when passing from $\Theta^t$ to $\Theta^{t+1}$ , the value of Q does not decrease, i.e.

$Q(\Theta^{(t+1)};\Theta^{(t)}) \geq Q(\Theta^{(t)} ; \Theta^{(t)})$

Then this very same sequence will be a sequence of parameters for which the log-likelihood is non-decreasing as well, i.e.

$\ln P(x | \Theta^{(t+1)}) \geq \ln P(x | \Theta^{(t)})$

I will not include a proof for this in this post (the proof identifies the difference between any two subsequent steps as a Kullback-Leibler divergence and makes use of the fact that a Kullback-Leibler divergence is never negative, you can find a a proof in the references, in particular in [2], or in my more detailed notes on the EM algorithm and Gaussian mixture models, where I also briefly touch on convergence). Instead, let us try to understand how this can be used in practice.

Suppose that we have already constructed some parameter $\Theta^t$ . The algorithm then proceeds in two steps. First, we calculate Q as a function of the new parameter $\Theta^{t+1}$ . As Q is defined as an expectation value, this step is called the expectation step. Once we have that, we try to maximize Q, i.e. we try to find a value for the parameter $\Theta^{t+1}$ such that $Q(\Theta^{t+1}; \Theta^t)$ is maximized. This part of the algorithm is therefore called the maximization step. Then we start over, using $\Theta^{t+1}$ as new starting point. Thus the algorithm alternates between an expectation and a maximization step, leading to the name EM algorithm.

To see how this works in practice, let us now return to our original example – Gaussian mixtures. Here the parameter $\Theta$ is given by

$\Theta = (\mu, \pi, \Sigma)$

We consider a random variable X which has N components, each of which being a vector X_n in a d-dimensional space and corresponding to one sample vector (so in the language of machine learning, N will be our batch size). Similarly, Z consists of N components z_n which again are subject to the restriction that only one of the z_nk be different from zero. We assume that the joint representation of our model is given by

$P(x , z) = \prod_n \prod_k \pi_k^{z_{nk}}{\mathcal N} (x_n, \mu_k , \Sigma_k)^{z_{nk}}$

Now we need to compute Q. The calculation is a bit lengthy, and I will skip most of it here (you can find the details in the references or my notes for this post). To state the result, we have to introduce a quantity called responsibility which is defined as follows.

$r_{nk} = \frac{\pi_k {\mathcal N}(x_n ; \mu_k, \Sigma_k)} {\sum_{j} \pi_j {\mathcal N}(x_n ; \mu_j, \Sigma_j) }$

From this definition, is it clear that the responsibility is always between zero and one, and in fact is has an interpretation of a (conditional) probability that sample n belongs to cluster k. Note that the responsibility is a function of the model parameters $\mu, \pi$ and $\Sigma$ . Using this notation, we can now write down the result of calculating the Q function:

$Q(\Theta'; \Theta) = \sum_n \sum_k r_{nk} \ln \pi'_k + \sum_n \sum_k r_{nk} \ln {\mathcal N}(x ; \mu'_k, \Sigma'_k)$

where the responsibilities are calculated using the old parameter set $\Theta = (\mu, \pi, \Sigma)$ . For a fixed $\Theta$ , this is a function of the new parameters $\mu', \pi'$ and $\Sigma'$ , and we can now try to maximize this function with respect to these parameters. This calculation is not difficult, but again a bit tiresome (and requires the use of Lagrangian multipliers as there is a constraint on the $\pi_k$ ), and I again refer to my notes for the details. When the dust settles, we obtain three simple expressions. First, the new values for the cluster means are given by

$\mu'_k = \frac{\sum_n r_{nk} x_n}{\sum_n r_{nk}}$

This starts to look familiar – this is the same expression that we did obtain for the cluster centers for the k-means algorithm! In fact, if we arrange the r_nk as a matrix, the denominator is the sum across column k and the numerators is the weighted sum over all data points. However, there is one important difference – it is no longer true that given n, only one of the r_nk will be different from one. Instead, the r_nk are soft assignments that model the probability that the data point x_n belongs to cluster k.

To write down the expression for the new value of the covariance matrix, we again need a notation first. Set

$N_k = \sum_n r_{nk}$

which can be interpreted as the number of soft assignments of points to cluster k. With that notation, the new value for the covariance matrix is

$\Sigma'_k = \frac{1}{N_k} \sum_n r_{nk} (x_n - {\mu'}_k)(x_n - {\mu'}_k)^T$

Finally, we can use the method of Lagrange multipliers to maximize with respect to the weights $\pi_k$ (which always need to sum up to one), and again obtain a rather simple expression for the new value.

$\pi'_k = \frac{N_k}{N}$

Again, this is intuitively very appealing – the probability to be in cluster k is updated to be the number of points with a soft assignment to k divided by the total number of assignments

We now have all the ingredients in place to apply the algorithm in practice. Let us summarize how this will work. First, we start with some initial value for the parameters – the weights $\pi$ , the covariance matrices $\Sigma_k$ and the means $\mu_k$ – which could, for instance, be chosen randomly. Then, we calculate the responsibilities as above – essentially, this is the expectation step, as it amounts to finding Q.

In the M-step, we then use these responsibilities and the formulas above to calculate the new values of the weights, the means and the covariance matrix. This involves a few matrix operations, which can be nicely expressed by the operations provide by the numpy library.

In the example above, we have created two sets of 500 sample points from different Gaussian mixture distributions and then applied the k-means algorithm and the EM algorithm. In the top row, we see the results of running the k-means algorithm. The color indicates the result of the algorithm, the shape of the marker indicates the original cluster to which the point belongs. The bottom row displays the results of the EM algorithm, using the same pattern.

We see that while the first sample (diagrams on the left) can be clustered equally well by both algorithms, the k-means algorithm is not able to properly cluster the second sample (diagrams on the right), while the EM algorithm is still able to assign most of the points to the correct cluster.

If you want to run this yourself, you can – as always – find the source code on GitHub. When playing with the code and the parameters, you will notice that the results can differ substantially between two consecutive runs, this is due to the random choice of the initial parameters that have a huge impact on convergence (and sometimes the code will even fail because the covariance matrix can become singular, I have not yet fixed this). Using the switch --data=Iris, you can also apply the EM algorithm to the Iris data set (you need to have a copy of the Iris data file in the current working directory) and find again that the results vary significantly with different starting points.

The EM algorithm has the advantage of being very general, and can therefore be applied to a wide range of problems and not just Gaussian mixture models. A nice example is the Baum Welch algorithm for training Hidden Markov models which is actually an instance of the EM algorithm. The EM algorithm can even be applied to classical multi-layer feed forward networks, treating the hidden units as latent variables, see [3] for an overview and some results. So clearly this algorithm should be part of every Data Scientists toolbox.

References

1. C.M. Bishop, Pattern recognition and machine learning, Springer, New York 2006
2. A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM-algorithm, Journ. Royal Stat. Soc. Series B. Vol. 39 No. 1 (1977), pp. 1-38
3. Shu-Kay Ng, G.J. McLachlan, Using the EM Algorithm to Train Neural Networks: Misconceptions and a New Algorithm for Multiclass Classification, IEEE Transaction on Neural Networks, Vol. 15, No. 3, May 2004

The EM algorithm and Gaussian mixture models – part I

In the last few posts on machine learning, we have looked in detail at restricted Boltzmann machines. RBMs are a prime example for unsupervised learning – they learn a given distribution and are able to extract features from a data set, without the need to label the data upfront.

However, there are of course many other, by now almost classical, algorithms for unsupervised learning. In this and the following posts, I will explain one of the most commonly applied algorithms in this field, namely the EM algorithm.

As a preparation, let us start with a very fundamental exercise – clustering. So suppose that we are given a data set which we suspect to consist of K clusters. We want to identify these clusters, i.e. to each point in the dataset, we want to assign the number of the cluster to which the point belongs.

More precisely, let us suppose that we are given a set ${\mathcal D} = \{ x_i, \dots, x_N\}$ of points in some euclidian space and a number K of clusters. We want to identify the centre $\mu_i$ of each cluster and then assign the data points x_i to some of the $\mu_j$ , namely to the $\mu_j$ which is closest to x_i. We can encode the assignment of data points to clusters in a matrix R_ij where R_ij = 1 if data point x_i belongs to cluster j with centre $\mu_j$ . Thus each row of R corresponds to a data point and is non-zero in exactly one column, where it is one. This is known as 1-of-K encoding.

For each value of i, the squared distance of x_i to the centre of the cluster to which it is assigned is then given by

$\sum_j R_{ij} \| x_i - \mu_j |^2$

where we note that only one of the summands is different from zero. Assigning the data points to the clusters and positioning the centre points of the clusters then amounts to minimizing the loss function

$l(\mu, R) = \sum_i \sum_j R_{ij} \langle x_i - \mu_j , x_i - \mu_j \rangle$

where the brackets denote the usual euclidean scalar product.

Now let us see how we can optimize this function. For a fixed vector $\mu$ of cluster centers, we can easily minimize R_ij by assigning each data point to the cluster whose center is closest to x_i. Thus we set

$R_{ij} = \begin{cases} 1 & \text{if} \, j = \arg \min \| x_i - \mu_j \|^2 \\ 0 & \text{otherwise} \end{cases}$

Conversely, given a matrix R which we hold fixed, it is likewise easy to minimize $\mu_j$ . As there are no further constraints on $\mu_j$ , we can find the minimum by differentiating the loss function. We find that the derivative is zero if and only if

$0 = \sum_i R_{ij}(x_i - \mu_j)$

holds for all j. Assuming for a moment that each cluster has at least one data point assigned to it, i.e. that none of the columns of R contains zeroes only, we can solve this by

$\mu_j = \frac{\sum_i x_i R_{ij}}{\sum_i R_{ij}}$

which is also easily seen to be a local minimum by taking the second derivative.

Note that this term has an obvious geometric interpretation. The denominator in this expression is the number of data points that are currently assigned to cluster j. The numerator is the sum of all data points assigned to this cluster. Thus the minimum is the mean value of the positions of the data points that are currently assigned to the cluster (a physicist would call this the center of gravity of the cluster). This is the reason why this method is called the k-means algorithm. If no data points are assigned to the cluster, the loss function does not depend on $\mu_j$ and we can choose any value that we want.

The algorithm now works as follows. First, we choose centers $\mu_j$ at will – it is common to use some of the data points for this purpose. Then we assign each data point to a center using the formula for R_ij derived above. We then adjust the center points $\mu_j$ and reallocate the points to the new centers and so forth.

From our discussion above, it is clear that each full iteration of this procedure will reduce the loss function. This does of course not imply that the algorithm converges to a global minimum, as it might get stuck in local minima or saddle points. In practice, the algorithm is executed until the cluster assignments and centers do not change any more substantially or for a given number of steps.

The diagram above shows the result of applying this algorithm to a set of points that is organized in two clusters. To generate the data, 100 samples were drawn from 2-dimensional Gaussian distributions. On the left hand side, half of the the samples were centered at the point (5,1), the other samples at (1,4), and both had standard deviation 0.6. On the right hand side, the same centers were used, but only a small number of samples were drawn from the second distribution which had standard deviation 0.5, whereas most samples came from the first distribution with standard deviation 0.8. Then 10 iterations of the k-means algorithm were applied to the resulting sample set. The points in the sample were then plotted with a color indicating the assignment to clusters resulting from the matrix R. The actual cluster from which the sample was drawn is indicated by the shape – a point is cluster one, a diamond is cluster two.

We see that in the example on the left hand side, the algorithm has correctly assigned all points to their original cluster. For the example on the right hand side, the situation is different – the algorithm did actually assign significantly more points to the blue cluster, i.e. there are many wrong assignments (blue points). This does not change substantially if we increase the number of iterations, even with 100 iterations, we still see many wrong assigments for this example. If you want to reproduce the example and play a bit with the parameters, you can get the sourcecode for a straightforward implementation in Python from my GitHub repository.

The K-means algorithm is very simple and straightforward, but seems to have limitations because it cannot determine the shape of a distribution, only its center. It turns out that K-means is a special case of a broader class of algorithms that we now study, hoping to find more robust algorithms.

In our example, we have generated sample data as a combination of two Gaussian distributions. What if we just change the game and simply assume that our data is of this type? In other words, we assume an underlying probabilistic model for our sample data. Once we have that, we can pull all the tricks that statistical inference can offer – we can calculate maximum likelihood and maximum posterior probability, we can try to determine the posterior or even sample from the model.

Thus let us try to fit our data to a model of the form

$P(x) = \sum_k \pi_k {\mathcal N}(x ; \mu_k, \Sigma_k)$

where ${\mathcal N}(x ; \mu_k, \Sigma_k)$ is a multivariate Gaussian distribution with mean $\mu_k$ and covariance matrix $\Sigma_k$ , i.e.

${\mathcal N}(x ; \mu_k, \Sigma_k) = \frac{1}{\sqrt{(2\pi)^d \det(\Sigma)}} \exp (-\frac{1}{2} \langle x - \mu_k, \Sigma_k^{-1}(x - \mu_k)\rangle)$

and the $\pi_k$ are non-negative real numbers with $\sum_k \pi_k = 1$ .

Let us now see how this equation looks like if we use a 1-of-K encoding. We introduce a random variable Z that takes values in $\{ 0, 1\}^K$ with the additional constraint that only one of the Z_k is allowed to be different from zero. We interpret $\pi_k$ as the probability

$\pi_k = P(Z_k = 1)$

Then

$P(Z = z) = \prod_k \pi_k ^{z_k}$

and we can write

$P(X=x) = \sum_z P(Z=z) P(X=x | Z=z) = \sum_z P(x,z)$

where $P(z)$ is as above and

$P(X = x | Z = z) = \prod_k {\mathcal N}(x ; \mu_k, \Sigma_k)^{z_k}$

This is a very general type of distribution which reflects a common pattern in machine learning, namely the introduction of so called latent or hidden variables. In general, latent or hidden variables are random variables that are a part of the model which cannot be observed, i.e. are not part of the input or the output of the model. We have seen latent variables in action several times – adding hidden units to a neural network introduces latent variables and makes the model much more powerful, the hidden layer of a restricted Boltzmann machine serves as memory to learn features, and latent variables that are used to construct a mixture of Gaussians as above allow us to model a much broader class of distributions than a model with just one Gaussian.

Intuitively, it is also clear how to sample from such a model. In a first step, we sample from Z, in other words we determine the index k randomly according to the distribution given by the $\pi_k$ . Once we have k, we then sample from the conditional distribution $P(X = x | Z = z)$ . As we already have k, this amounts to sampling from the Gaussian distribution ${\mathcal N}(x ; \mu_k, \Sigma_k)$ with mean $\mu_k$ and covariance matrix $\Sigma_k$ .

In the example above, we have first applied this procedure to a one-dimensional Gaussian mixture with K=2. The histogram shows the result of the sampling procedure, the solid lines are the probability density functions of the two individual Gaussian distributions, multiplied by the respective weight. On the right hand side, the result of sampling a Gaussian mixture in two dimensions is displayed (you can download the notebook used to produce this example here).

When we now try to fit this model to the sample data, we again have to calculate the likelihood function and try to maximize it. However, it turns out that the gradient of the likelihood is not so easy to calculate, and it might very well be that there is no closed form solution or that we obtain rather complicated expressions for the gradient. Fortunately, there is an alternative that works for very general models and does not require knowledge of the gradient – the EM algorithm. In the next post, I will present the algorithm in this general setup, before we apply it to our original problem and compare the results with the k-means algorithm.

References

Why you need statistics to understand neuronal networks

When I tried to learn about neuronal networks first, I did what probably most of us would do – I started to look for tutorials, blogs etc. on the web and was surprised by the vast amount of resources that I found. Almost every blog or webpage about neuronal networks has a section on training a simple neuronal network, maybe on the MNIST data set, using a framework like TensorFlow, Theano or MXNET. When you follow such a tutorial, a network is presented as a collection of units and weights. You see how the output of the network is calculated and then an error function – sometimes least squares, sometimes something else – is presented. Often, a regulation term is applied, and then you are being told that the automatic gradient calculation features of the framework will do the gradient descent algorithm for you and you just have to decide on an optimizer and run the network and enjoy the results.

Sooner or later, however, you will maybe start to ask a few questions. Why that particular choice of the loss function? Where does the regulator come from? What is a good initial value for the weights and why? Where does the sigmoid function come from? And many, many other questions….

If you then decide to dig deeper, using one of the many excellent textbooks or even try to read some of the original research papers (and some are actually quite readable), you will very soon be confronted with terms like entropy, maximum likelihood, posterior distribution, Gaussian mixtures and so on, and you will realize that the mathematics of neuronal networks has a strong overlap with mathematical statistics. But why? Why is that a good language to discuss neuronal networks, and why should you take the time to refresh your statistics knowledge if you really want to understand neuronal networks? In this post, I will try to argue that statistical inference comes up very naturally when you try to study neuronal networks.

Many neuronal networks are designed to excel at classification tasks. As an example, suppose you wanted to design and train a neuronal network that, given data about an animal, classifies the animal as either a bird or some other animal (called a “non-bird” for convenience). So our starting point is a set modelling all possible objects that could be presented to the network. How exactly we model this set is not so important, more important is that in general, the network will not have access to all the data about the animal, but only to certain attributes of elements in the set called features. So there could be a feature which we call X₁ and which is defined as

$X_1 = \text{the animal can fly}$

taking values in $\{0,1\}$ . Another data point the network could get is

$X_2 = \text{length of animal in cm}$

taking values in the reals and so forth. More generally, we assume that on the set of all possible objects, we have certain functions X_i taking values in, say, the real numbers. Based on these numbers, the network will then try to take a decision whether a given animal is a bird or not. Thus we do not have to deal directly with our space of objects, but use the functions X_i as primary objects.

If the network had a chance to look at every possible animal, this would be easy, even though it would cost a lot memory – it could simply remember all possible combinations of features and for each feature, store the correct answer. In reality however, this does not work. Instead, we have access to a small subset of the data – a sample – for which we can evaluate the X_i. Based on this subset, we then have to derive a model which gives the right answer in as many cases as possible. Thus we try to make a statement about the full space of things that could be presented to our network for classification based on a small sample.

This is where probabilities come into play very naturally. We need to assume that our sample has been chosen randomly, but still we need to make assertions about the full set. This is exactly what inferential statistics is doing. The fact that our sample is chosen randomly turns our X_i into random variables. Similarly, the variable

$Y = \text{is a bird}$

taking values in $\{0,1\}$ is a random variable, and we try to gain information on the distribution of Y across the full population based on its values on a given set of labelled samples, i.e. a set of samples where Y is known. Thus Y would represent the labels or targets in the language of neuronal networks. Applying the methods of statistical inference to this situation would typically start by choosing a statistical model and than using estimators or hypothesis testing to make deductions.

Apart from the fact that we have to derive information on the full population based on a sample, there is another reason why probabilities appear naturally in the theory of machine learning. In many cases, the available input – being a reduction of the full set of data – is not sufficient to classify the sample with full certainty. To see this, let us go back to our examples. How would you derive the property “bird” from the given data “can fly” and “length”? Not all animals than can fly are birds – and not all birds can fly. So we have to try to distinguish for instance a butterfly from a hummingbird based on the length. The smallest hummingbird – a bee hummingbird – is about 5 cm in length. The largest known butterfly – the Queen’s Alexandra birdwing – can be as long as 8 cm (both informations taken from Wikipedia). Thus our data is not sufficient to clearly distinguish butterflies and birds in all cases!

However, very small birds and very large butterflies have one thing in common – they are rare. So chances are that a given animal that can fly and is larger than 5 cm is actually a bird (yes, I know, there are bats….). In other words, if again Y denotes the variable which is 1 on birds and 0 on all other animals, we can in general not hope that Y is a function of the X_i, but we can hope that given some values of the X_i, the probability $P(Y=1)$ to be a bird is a function of the X_i. In other words, using the language of conditional probabilities,

$P(Y=1 | X = x) = f(x)$

with some unknown function f. In a Bayesian interpretation of probability, the certainty with which can say “this animal is a bird” is a function of the values x_i of the observable variables X_i.

With these considerations, we now arrive at the following mathematical model for what a classification algorithm is about. We are given a probability space $(P, \Omega)$ with a vector valued random variable X. The attributes of a sample are described by the feature vector X in some subset of m-dimensional euclidian space, where m is the number of different features. In our example, m=2, as we try to classify animals based on two properties. The result of the classification is described by a random variable Y taking – for the simple case of a binary classification problem – values in $\{0,1\}$ . We then assume that

$P(Y =1 | X=x) = f(x;w_0)$

where $f(\cdot;w)$ is a function parametrized by some parameter w that we call the weights of the model. The actual value w₀ of w is unknown. Based on a sample for X and Y, we then try to fit the model, i.e. we try to find a value for w such that $f(\cdot, w)$ models the actual conditional distribution of Y as good as possible. Once the fitting phase is completed, we can then use the model to derive predictions about objects which are not in our initial sample set.

This model sounds a bit abstract, but many feed forward neuronal networks can be described with this or similar models. And we can now apply the full machinery of mathematical statistics – we can calculate cross entropies and maximum likelihood, we can analyse converge and variance, we can apply the framework of Bayesian statistics and Monte Carlo methods. This is the reason why statistics is so essential when it comes to describing and analyzing neuronal networks. So on the next rainy Sunday afternoon, you might want to grab a steaming hot cup of coffee, head towards your arm chair and spent some time with one of the many good exposures on this topic, like chapter IV in MacKays book on Machine Learning, or Bishops “Pattern recognition and machine learning” or chapter 3 of the deep learning book by Goodfellow, Bengio and Courville.

The Metropolis-Hastings algorithm

In this post, we will investigate the Metropolis-Hastings algorithm, which is still one of the most popular algorithms in the field of Markov chain Monte Carlo methods, even though its first appearence (see [1]) happened in 1953, more than 60 years in the past. It does for instance appear on the CiSe top ten list of the most important algorithms of the 20th century (I got this and the link from this post on WordPress).

Before we get into the algorithm, let us once more state the problem that the algorithm is trying to solve. Suppose you are given a probability distribution $\pi$ on some state space X (most often this will be a real euclidian space on which you can do floating point arithmetic). You might want to imagine the state space as describing possible states of a physical system, like spin configurations in a ferromagnetic medium similar to what we looked at in my post on the Ising model. The distribution $\pi$ then describes the probability for the system to be in a specific state. You then have some quantity, given as a function f on the state space. Theoretically, this is a quantity that you can calculate for each individual state. In most applications, however, you will never be able to observe an individual state. Instead, you will observe an average, weighted by the probability of occurence. In other words, you observe the expectation value

$\langle f \rangle = \int_X f d\pi$

of the quantity f. Thus to make a prediction that can be verified or falsified by an observation, you will have to calculate integrals of this type.

Now, in practice, this can be very hard. One issue is that in order to naively calculate the integral, you would have to transverse the entire state space, which is not feasible for most realistic problems as this tends to be a very high dimensional space. Closely related to this is a second problem. Remember, for instance, that a typical distribution like the Boltzmann distribution is given by

$\pi(x) = \frac{1}{Z} e^{-\beta E(x)}$

The term in the numerator is comparatively easy to calculate. However, the term in the denominator is the partition function, and is itself an integral over the state space! This makes even the calculation of $\pi(x)$ for a single point in the state space intractable.

But there is hope – even though calculating the values of $\pi$ for one point might be impossible, in a distribution like this, calculating ratios of probabilities is easy, as the partition function cancels out and we are left with the exponential of an energy difference! The Metropolis-Hasting algorithm leverages this and also solves our state space problem by using a Markov chain to approximate the integral. So the idea is to build a Markov chain X_t that converges and has $\pi$ as an invariant distribution, so that we can approximate the integral by

$\langle f \rangle = \int_X f d\pi \approx \frac{1}{N} \sum_{t=1}^N f(X_t)$

for large values of N.

But how do we construct a Markov chain that converges to a given distribution? The Metropolis Hastings approach to solve this works as follows.

The first thing that we do is to choose a proposal density q on our state space X, i.e. a measurable function

$q \colon X \times X \rightarrow [0,\infty)$

such that for each x, $\int q(x,y) dy = 1$ .

Then q defines a Markov chain, where the probability to transition into a measurable set A being at a point x is given by the integral

$Q(x,A) = \int_{\mathcal X} q(x,y) dy$

Of course this is not yet the Markov chain that we want – it has nothing to do with $\pi$ , so there is no reason it should converge to $\pi$ . To fix this, we now adjust the kernel to incorporate the behaviour of $\pi$ . For that purpose, define

$\alpha(x,y) = \begin{cases} \min \{ 1, \frac{\pi(y)q(y,x)}{\pi(x)q(x,y)} \} & \text{if } \pi(x) q(x,y) > 0 \\ 1 & \text{if } \pi(x) q(x,y) = 0 \end{cases}$

This number is called the acceptance probability, and, as promised, it only contains ratios of probabilities, so that factors like the partition function cancel and do not have to be computed.

The Metropolis Hastings algorithm now proceeds as follows. We start with some arbitrary point x₀. When the chain has arrived at x_n, we first draw a candidate y for the next location from the proposal distribution $q(x_n, \cdot)$ . We now calculate $\alpha$ according to the formula above. We then accept the proposal with probability $\alpha$ , i.e. we draw a random sample U from a uniform distribution and accept if $U \leq \alpha$ . If the proposal is accepted, we set x_n+1 = y, otherwise we set x_n+1 = x_n, i.e. we stay where we are.

Clearly, the x_n are samples from a Markov chain, as the position at step x_n only depends on the position at step x_n-1. But is still appears to be a bit mysterious why this should work. To shed light on this, let us consider a case where the expressions above simplify a bit. So let us assume that the proposal density q is symmetric, i.e. that

$q(x,y) = q(y, x)$

This is the original Metropolis algorithm as proposed in [1]. If we also assume that $\pi$ and q are nowhere zero, the acceptance probability simplifies to

$\alpha(x,y) = \min \{ 1, \frac{\pi(y)}{\pi(x)} \}$

Thus we accept the proposal if $\pi(y) \geq \pi(x)$ with probability one. This is very similar to a random search for a global maximum – we start at some point x, choose a candidate for a point with higher value of $\pi$ at random and proceed to this point. The major difference is that we also accept candidates with $\pi(y) < \pi(x)$ with a non-zero probability. This allows the algorithm to escape a local maximum much better. Intuitively, the algorithm will still try to spend more time in regions with large values of $\pi$ , as we would expect from an attempt to sample from the distribution $\pi$ .

The image above illustrates this procedure. The red graph displays the distribution $\pi$ . If our algorithm is currently at step x_n, the purpose is to move “up-hill”, i.e. to the left in our example. If we draw a point like y from q which goes already in the right directory, we will always accept this proposal and move to y. If, however, we draw a point like y’, at which $\pi$ is smaller, we would accept this point with a non-zero probability. Thus if we have reached a local maximum like the one on the right hand side of the diagram, there is still a chance that we can escape from there and move towards the real maximum to the left.

In this form, the algorithm is extremely easy to implement. All we need is a function propose that creates the next proposal, and a function p that calculates the value of the probability density $\pi$ at some point. Then an implementation in Python is as follows.

import numpy as np
chain = []
X = 0
chain.append(X)
for n in range(args.steps):
  Y = propose(X)
  U = np.random.uniform()
  alpha = p(Y) / p(X)
  if (U <= alpha):
    X = Y
  chain.append(X)

In the diagram below, this algorithm has been applied to a Cauchy distribution with mode zero and scale one, using a normal distribution with mean x and standard deviation 0.5 as a proposal for the next location. The chain was calculated for 500.000 steps. The diagram in the upper part shows the values of the chain during the simulation.

Then the first 100.000 steps were discarded and considered as "burn-in" time for the chain to stabilize. Out of the remaining 400.000 sample points, points where chosen with a distance of 500 time steps to obtain a sample which is approximately independent and identically distributed. This is called subsampling and typically not necessary for Monte Carlo integration (see [2], chapter 1 for a short discussion of the need of subsampling), but is done here for the sake of illustration. The resulting subsample is plotted as a histogramm in the lower left corner of the diagram. The yellow line is the actual probability density.

We see that after a few thousand steps, the chain converges, but continues to have spikes. However, the sampled distribution is very close to the sample generated by the Python standard method (which is to take the quotient of two independent samples from a standard normal distribution).

In the diagram at the bottom, I have displayed how the integral of two functions ( $\sin(x)$ and $\cos(x)$ ) approximated using the partial sums develops over time. We see that even though we still have huge spikes, the integral remains comparatively stable and converges already after a few thousand iterations. Even if we run the simulation only for 1000 steps, we already get close to the actual values zero (for $\sin(x)$ for symmetry reasons) and $\approx 0.3678$ (for $\cos(x)$ , obtained using the scipy.integrate.quad integration routine).

In the second diagram in the middle row, I have plotted the autocorrelation versus the lag, as an indicator for the failure of the sample points to be independent. Recall that for two samples X and Y, the Pearson correlation coefficient is the number

$\frac{E((X-\bar{X})(Y-\bar{(Y)})}{\sigma_X \sigma_Y}$

where $\sigma_X$ and $\sigma_Y$ are the standard deviations of X and Y. In our case, given a lag, i.e. a number l less than the length of the chain, we can form two samples, one consisting of the points $X_0, X_2, \dots$ and the second one consisting of the points of the shifted series $X_l, X_{l+1}, X_{l+2}, \dots$ . The autocorrelation with lag l is then defined to be the correlation coefficient between these two series. In the diagram, we can see how the autocorrelation depends on the lag. We see that for a large lag, the autocorrelation becomes small, supporting our intuition that the series and the shifted series become independent. However, if we execute several simulation runs, we will also find that in some cases, the convergence of the autocorrelation is very slow, so care needs to be taken when trying to obtain a nearly independent sample from the chain.

In practice, the autocorrelation is probably not a good measure for the convergence of a Markov chain. It is important to keep in mind that obtaining an independent sample is not the point of the Markov chain – the real point is that even though the sample is autocorrelated, we can approximate expectation values fairly well. However, I have included the autocorrelation here for the sake of illustration.

This form of proposal distributions is not the only one that is commonly used. Another choice that appears often is called an independence sampler. Here the proposal distribution is chosen to be independent of the current location x of the chain. This gives us an algorithm that resembles the importance sampling method and also shares some of the difficulties associated with it – in my notes on Markov chain Monte Carlo methods, I have included a short discussion and a few examples. These notes also contain further references and a short discussion of why and when the Markov chain underlying a Metropolis-Hastings sampler converges.

Other variants of the algorithm work by updating – in a high-dimensional space – either only one variable at a time or entire blocks of variables that are known to be independent.

Finally, if we are dealing with a state space that can be split as a product $X_1 \times X_2$ , we can use the conditional probability given either x₁ or x₂ as a proposal distribution. Thus, we first fix x₂, and draw a new value for x₁ from the conditional probability for x₁ given the current value of x₂. Then we move to this new coordinate, fix x₁, draw from the conditional distribution of x₂ given x₁ and set the new value of x₂ accordingly. It can be shown (see for example [5]) that the acceptance probability is one in this case. So we end up with the Gibbs sampling algorithm that we have already used in the previous post on Ising models.

Monte Carlo sampling methods are a broad field, and even though this has already been a long post, we have only scratched the surface. I invite you to consult some of the references below and / or my notes for more details. As always, you will also find the sample code on GitHub and might want to play with this to reproduce the examples above and see how different settings impact the result.

In a certain sense, this post is the last post in the series on restricted Boltzmann machines, as it provides (at least some of) the mathematical background behind the Gibbs sampling approach that we used there. Boltzmann machines are examples for stochastic neuronal networks that can be applied to unsupervised learning, i.e. the allow a model to learn from a sample distribution without the need for labeled data. In the next few posts on machine learning, I will take a closer look at some other algorithms that can be used for unsupervised learning.

References

1. N. Metropolis,A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, E. Teller, Equation of state calculation by fast computing machines, J. Chem. Phys. Vol. 21, No. 6 (1953), pp. 1087-1092
2. S. Brooks, A. Gelman, C.L. Jones,X.L. Meng (ed.), Handbook of Markov chain Monte Carlo, Chapman Hall / CRC Press, Boca Raton 2011
3. W.K. Hastings, Monte Carlo sampling methods using Markov chains and their applications, Biometrika, Vol. 57 No. 1 (1970), pp. 97-109
R.M. Neal, Probabilistic inference using Markov chain Monte Carlo methods, Technical Report CRG-TR-93-1, Department of Computer Science, University of Toronto, 1993
5. C.P. Robert, G. Casella, Monte Carlo Statistical Methods,
Springer, New York 1999