A large part of the success of GPT-3.5 and GPT-4 is attributed to the fact that these models did undergo, in addition to pre-training and supervised instruction fine-tuning, a third phase of learning called reinforcement learning with human feedback. There are many posts on this, but unfortunately most of them fall short of explaining how…More
Mastering large language models – Part XVI: instruction fine-tuning and FLAN-T5
In most of our previous posts, we have discussed and used transformer networks that have been trained on a large set of data using teacher forcing. These models are good at completing a sentence with the most likely next token, but are not optimized for following instructions. Today, we will look at a specific family…More
Mastering large language models Part XV: building a chat-bot with DialoGPT and Streamlit
Arguably, a large part of the success of models like GPT is the fact that they have been equipped with a chat frontend and proved able to have a dialogue with a user which is perceived as comparable to a dialogue you would have with another human. Today, we will see how transformer based language…More
Mastering large language models – Part XIV: Huggingface transformers
In the previous post, we have completed our journey to being able to code and train a transformer-based language model from scratch. However, we have also seen that in order to obtain professional results, we need an enormous amount of resources, i.e. training data and compute power. Luckily, there are some pre-trained transformer models that…More
Mastering large language models – Part XIII: putting it all together
Today, we will apply what we have learned in the previous posts on BPE, attention and transformers h- we will implement and train a simple decoder-only transformer model from scratch. First, let us discuss our model. The overall structure of the model is in fact quite similar to the model we have used when training…More
Mastering large language models – Part XII: byte-pair encoding
On our way to actually coding and training a transformer-based model, there is one last challenge that we have to master – encoding our input by splitting it in a meaningful way into subwords. Today, we will learn how to do this using an algorithm known as byte-pair encoding (BPE). Let us first quickly recall…More
Autonomous agents and LLMs: AutoGPT, Langchain and all that
Over the last couple of months, a usage pattern for large language models that leverages the model for decision making has become popular in the LLM community. Today, we will take a closer look at how this approach is implemented in two frameworks – Langchain and AutoGPT. One of the most common traditional use cases…More
Mastering large language models – Part XI: encoding positions
In our last post, we have seen that the attention mechanism is invariant to position, i.e. that reordering of the words in a sentence yields the same output, which implies that position information is lost. Clearly position information does matter in NLP, and therefore transformer based models apply positional embeddings in addition to the pure…More
Mastering large language models – Part X: Transformer blocks
Today, we will start to look in greater detail at the transformer architecture. At the end of the day, a transformer is built out of individual layers, the so-called transformer blocks, stacked on top of each other, and our objective for today is to understand how these blocks are implemented. Learning about transformer architectures can…More
Mastering large language models – Part IX: self-attention with PyTorch
In the previous post, we have discussed how attention can be applied to avoid bottlenecks in encoder-decoder architectures. In transformer-based models, attention appears in different flavours, the most important being what is called self-attention – the topic of todays post. Code included. Before getting into coding, let us first describe the attention mechanism presented in…More