Mastering large language models – Part XVII: reinforcement learning and PPO

A large part of the success of GPT-3.5 and GPT-4 is attributed to the fact that these models did undergo, in addition to pre-training and supervised instruction fine-tuning, a third phase of learning called reinforcement learning with human feedback. There are many posts on this, but unfortunately most of them fall short of explaining how…More

Mastering large language models – Part XIV: Huggingface transformers

In the previous post, we have completed our journey to being able to code and train a transformer-based language model from scratch. However, we have also seen that in order to obtain professional results, we need an enormous amount of resources, i.e. training data and compute power. Luckily, there are some pre-trained transformer models that…More

Mastering large language models – Part XIII: putting it all together

Today, we will apply what we have learned in the previous posts on BPE, attention and transformers h- we will implement and train a simple decoder-only transformer model from scratch. First, let us discuss our model. The overall structure of the model is in fact quite similar to the model we have used when training…More

Mastering large language models – Part XII: byte-pair encoding

On our way to actually coding and training a transformer-based model, there is one last challenge that we have to master – encoding our input by splitting it in a meaningful way into subwords. Today, we will learn how to do this using an algorithm known as byte-pair encoding (BPE). Let us first quickly recall…More

Autonomous agents and LLMs: AutoGPT, Langchain and all that

Over the last couple of months, a usage pattern for large language models that leverages the model for decision making has become popular in the LLM community. Today, we will take a closer look at how this approach is implemented in two frameworks – Langchain and AutoGPT. One of the most common traditional use cases…More

Mastering large language models – Part XI: encoding positions

In our last post, we have seen that the attention mechanism is invariant to position, i.e. that reordering of the words in a sentence yields the same output, which implies that position information is lost. Clearly position information does matter in NLP, and therefore transformer based models apply positional embeddings in addition to the pure…More

Mastering large language models – Part X: Transformer blocks

Today, we will start to look in greater detail at the transformer architecture. At the end of the day, a transformer is built out of individual layers, the so-called transformer blocks, stacked on top of each other, and our objective for today is to understand how these blocks are implemented. Learning about transformer architectures can…More

Mastering large language models – Part IX: self-attention with PyTorch

In the previous post, we have discussed how attention can be applied to avoid bottlenecks in encoder-decoder architectures. In transformer-based models, attention appears in different flavours, the most important being what is called self-attention – the topic of todays post. Code included. Before getting into coding, let us first describe the attention mechanism presented in…More