Accessing your hard drive – the OS developers moment of truth

When building your own operating system, the moment when you first write data to a real physical hard disk of a real PC is nothing less than thrilling – after all, making a mistake at this point could mean that you happily overwrite data on your hard drive randomly and wipe out important data on your test machine.

So working with real hard drives is nothing for the faint of heart – but course, no operating system would be complete without that ability. In this post, we look at the basic concepts behind the communication between the operating system and the hard drive.

First, let us see how data on a hard drive is stored. The most basic conceivable hard drive consists of a thin disk, called a platter, on which the data is stored, and a device called head that is able to read and write data. The disk is spinning and the head can be moved forth and back across the disk, so that it can be positioned at an arbitrary distance from the center of the disk.

Physically, the disk is organized in concentric tracks. When the head is positioned at a certain distance from the center of the disk, it can therefore access an individual track. Tracks are further separated into sectors, as shown in the upper part of the diagram below (taken from this Wikipedia page). Finally, a block is the intersection of a sector with a track. So to identify a specific block, you could use the combination of a sector and the track on which the block is located.

Cylinder_Head_Sector

This simple design is actually wasting some space, as you could only use one side of the disk. If you want to store data on both sides, you need two heads – one below and one above the disk. You could even increase the capacity further by building a stack of platters, which heads being located between the platters so that each side of a platter corresponds to one head reading from it and writing to it, as displayed in the lower part of the diagram.

In this setup, the union of tracks locates at corresponding positions on all platters is called a cylinder. If you know the cylinder and the head, you know the track. You could therefore specify the location of a block on the disk by the combination of a cylinder, a head and a sector. This addressing method is therefore called the CHS addressing and was the first addressing mode used on early IBM PCs and compatible machines.

Later, the actual geometry of the hard drive became more complicated, and addressing schemes that decouple the actual geometry of the hard drive from the address were introduced, most notably the so-called logical block addressing (LBA). In that addressing mode, which is the current standard, a hard disk is thought of as a sequence of blocks, numbered starting with zero. To access a block, you simply specify the number of the block. The hard disk controller, i.e the circuitry that is actually operating the hard drive and sitting between the CPU and the actual hard drive, will convert this LBA number into the actual physical location of the data on the drive.

So how do we actually access a block? The easiest way to do this is usually called programmed input output (PIO). To explain this, we first need to recall that a CPU is not only accessing the system memory, but can also talk to devices using designated channels, so called ports. If the CPU writes to a port, the data will be sent to the device, and vice versa data written by the device can be accessed by the CPU by reading from a port.

So to read from a hard drive, we could proceed as follows. First, we would write a command to a specific port associated with the hard drive. This command would tell the hard drive what to do – for instance to read – and it would contain the LBA number of the block we want to read. Then the hard drive controller would perform the actual read operation, and once complete, would send the data back to the CPU that would then read it from a port.

The format of the commands and the exact interaction between the CPU and the device in this process were defined in a standard called ATA, also called IDE or Parallel ATA (PATA). This specification did not only describe the interaction between the hard disk controller and the CPU, but also the low-level electronics, cables, connectors and so forth.

This method of reading and writing data is simple, but has a major disadvantage – it keeps the CPU busy. In the worst case, the CPU has to write the command, wait in a busy loop and then read the data one word at a time – and if you wanted to read several blocks, you had to do this over and over again. Not surprising that alternatives were developed soon.

The first improvement we can make is to make the operation interrupt driven. In this version, the CPU would send a command and could then go off and do something else while the hard disk controller is working. Once the data has been read, the controller informs the CPU that data is available by raising an interrupt, and the CPU can go ahead and fetch the data. This is already more efficient, but still suffers from the fact that the data has to be read byte for byte.

Consequently, the next step in the development was what is called Direct Memory Access (DMA). With DMA, the hard disk controller can communicate directly with the memory controller and transfer the data read directly into a designated area of the systems main memory without any involvement of the CPU. Only once all the data has been read, an interrupt is raised to raised to inform the CPU that the data has been transferred and can be worked with. With this approach, large blocks of data can be read and written while the CPU is able to work on other threads.

In 2000, a serial version of the ATA standard called Serial ATA (SATA) was introduced. Due to the serial processing, SATA allows much higher transfer rates between the hard drive and the motherboard and therefore a much higher I/O throughput and soon involved into the de-facto standard, also for consumer PCs. A SATA device can still operate in a so-called legacy mode, where it is presented to the CPU as a PATA/IDE device. However, to take advantages of some new features of the SATA protocol, for instance a higher number of devices per controller, a new protocol for the communication between the CPU and the controller needs to be used which is called AHCI. The AHCI protocol is significantly more complicated than the ATA protocol, but in fact both protocols still follow the same logical steps to read and write data.

Let us now combine this understanding with what we have learned about interrupts and multitasking to see how the operating system would typically orchestrate a disk drive access, for instance to read data from the disk. The following diagram illustrates the process.

HardDriveRead

First, the operating system reserves an area of system memory for the DMA transfer. Then, the kernel prepares the command. The format of the command depends on the used protocol (AHCI or legacy IDE), but typically, the command would encompass the location of the data on the disk to be read, the number of blocks to be read and the location of the DMA buffer. Then the command is sent to the hard drive.

At this point in time, the CPU is free to work on something else, so the kernel would typically put the current task to sleep and re-schedule so that a different task can start executing.

While this happens, the hard drive starts to execute the command. It reads the requested data and places it in the DMA buffer prepared by the OS. Then it raises an interrupt to indicate that the data has arrived in main memory.

Once the operating system receives the interrupt, it will stop the currently executing task and switch back to the task that requested the read. This task can now continue execution and work with the data.

Of course, the real picture is more complicated. What the image above calls the “device”, for instance, is in reality a combination of the actual hard disk controller and components of your motherboards chip set, like the AHCI controller or the DMA controller. In addition, several hard drives can operate in parallel, so the kernel needs a way to keep track of the currently executing requests and map incoming interrupts to the requests currently in flight. And, in an SMP environment with more than one CPU, semaphores or other locking mechanism need to be in place to make sure that different requests do not collide. If you are interested in all those details, you might want to take a look at the documentation of the ctOS hard disk driver.

We now understand how a hard drive and an operating system kernel communicate to read or write sectors. However, an application does of course not see individual blocks of a hard disk, instead the hard disk appears as a collection of files and directories. The translation between this logical view and the low-level view is done by the file system at which we will look in the next post in this series.

The protected mode in the x86 architecture

Modern operating systems would not be possible without the ability of a CPU to execute code at different privilege levels. This feature became available for mainstream PCs in the early eighties, when Intel introduced its 80286 and 80386 CPUs, and was readily employed by operating systems like Windows 3.11 and, of course, Linux, which Linus Torvalds once called “a project to teach me about the 386”.

When an 80286 or 80386 CPU (and all later models) is starting up, it is initially operating in what is called real mode, and behaves like the earlier 8086 which is the CPU used by the famous IBM PC XT. To enjoy the benefits of protected mode, the operating system has to initialize a few system tables and can then switch to protected mode. It is also possible, though a bit harder, to switch back to real mode – and some operating systems actually do this in order to utilize functions of the legacy BIOS which is designed to be called from real mode.

One of the features that the protected mode offers is virtual memory and paging, a technology that we have already discussed in a previous post. In addition, and maybe even more important, the protected mode allows code to be execute in one of four privilege levels that are traditionally called rings.

Privilege levels

At any point in time, the CPU executes code at one of four ring (ring 0 to ring 3). The current ring determines which instructions are allowed and which instructions are forbidden, with ring 0 – sometimes called kernel mode or supervisor mode – being the most privileged level, and ring 3 – typically called user mode – being the lowest privilege level.

Rings

Some instructions can only be executed in ring 0. Examples are the instructions STI and CLI that start and stop interrupt processing (and potentially bring everything to a grinded halt as modern operating systems are interrupt driven, as we have seen in a previous post) or the command to reload the register CR3 which contains the base address of a page table directory and therefore the ability to load a new set of page tables. It is also possible to restrict the access to hardware using I/O ports to certain rings, so that code executing in user mode can, for instance, not directly access the hard drive or other devices.

Technically, the current ring is determined by the content of a special register called the code segment register (CS) – most likely an old friend of yours if you have ever written assembly code for the 8086, we get back to this point further below. As so often in the x86 architecture, this is an indirect mechanism. The CS contains a pointer into a table called the global descriptor table (GDT) which again holds the actual descriptions of the segments. Part of each entry is a two-bit field which specifies the ring. Thus if the CS points, say, at the third entry in the GDT and those two-bits in this entry contain 0b11, the CPU executes instructions at ring 3.

Switching the privilege level

Switching the privilege level requires a bit more work than just updating CS. The easiest way to force the CPU to pick up the new settings is to raise a software interrupt (technically, there is an alternative called a far jump that we will not discuss in detail). The beauty of this is that a piece of code executing in user mode can raise an interrupt to reach ring 0, but this interrupt will execute a well defined code, namely the interrupt service handler, which again can be set up only while being in ring 0. So at startup, the operating system – executing at ring 0 – can set up the interrupt service handler to point to specific routines in the kernel, and later, during normal operations, a program executing in user mode can at any time call into the kernel by raising a software interrupt, for instance in order to read from the hard drive. However, a user mode program can only execute the specified interrupt service handler and is not able to execute arbitrary code at ring 0. Thus user mode and kernel mode are separated, but the interrupt service handlers provide specific gates to switch forth and back.

To make this work, the operating system needs to let the CPU know where each interrupt service handler is located. In protected mode, this is the purpose of the interrupt descriptor table (IDT) which an operating system needs to prepare and which contains, for each of the theoretically possible 256 interrupt service handlers, the address of the interrupt service handler as well as additional information, for instance which code segment to use (which will typically point to ring 0 to execute the interrupt in kernel mode).

My toy operating system ctOS, for instance, uses the interrupt 0x80 for system calls. A program that wants to make a system call puts a number identifying the system call into a specific register (EAX) and then raises interrupt 0x80. The interrupt service handler evaluates EAX and branches into the respective system call handler, see the ctOS system call handler documentation for more details.

Interrupts can not only be used to switch from user mode to kernel mode, but also for the way back. In fact, whenever the CPU returns from an interrupt service handler, it will return to the ring at which the execution was interrupted, using the stack to bring back the CPU into its original state. In order to move from ring 0 to ring 3, we can therefore prepare a stack that looks like being the stack arranged by the CPU if an interrupt occurs, and then execute the IRET instruction to simulate returning from an interrupt service handler.

Matters get a bit more complicated by the fact that an interrupt service handler can in turn be interrupted by another interrupt, and that an operating system might want to implement kernel level threads which run outside of an interrupt context quite similar to a thread in user mode. To differentiate between those different situations, ctOS uses the concept of an execution level which is a software-maintained number describing the current context, as visualized below.

Memory model in protected mode

Another major change that the protected mode introduced into the x86 architecture was a more complex, but also more flexible memory model. To understand this, let us see how the older 8086 CPU handled memory (which is the mechanism that all modern Intel CPUs still use when being in real mode).

The 8086 was a 16-bit CPU, meaning that its registers were 16 bits wide. Therefore, a command like “put 0x10 into the memory location referenced by register AX” could only address 216=65536 different memory locations. To be able to address more memory, the Intel engineers therefore came up with a trick. When memory was accessed, the content of two registers was used. In the example above, this would be AX (16 bits) and the code segment register CS, which has 16 bits as well. To determine the actual memory location, the content of the CS register was shifted left by four bits and then added to the content of AX. Thus, if, for instance, AX contained 0x200 and CS contained 0x100, the address would be

(0x100 << 4) + 0x200 = 0x1000 + 0x200 = 0x1200

This addressing mode is sometimes called segmentation because we can think of it as dividing the memory space into (overlapping) segments of 64 kByte and the CS register as selecting one segment.

Using segmentation, the 8086 CPU could address a bit more than 1 MB of memory and thus work around the limitations of a 16 bit CPU. With a 32 bit CPU, this is not really needed any more, but the 80386 nevertheless expanded on this idea and came up with a bit more complicated model.

In fact, in protected mode, there are three different types of addresses which can be used to describe the location of a specific byte in memory.

The first type of address is the logical address. Similarly to real mode, a logical address consists of a segment selector (16 bit) which specifies the memory segment the byte is located in and an offset (32 bit) which specifies the location of the memory space within the segment.

A program running in user mode usually only sees the offset – this is what appears for instance if you dump the value of a pointer in a user mode C program. The segment selector is set by the operating system and usually not changed by a user mode program. So from the programs point of view, memory is entirely described by a 32 bit address and hence appears as a 4 GB virtual address space.

When accessing memory, the CPU uses the global descriptor table to convert this logical address into a linear address which is simply a 32 bit value. This is similar to real mode where the segment and offset are combined into a linear 20 bit wide address, with the difference that the base address of the segment is not directly taken from the CS register but the CS register only serves as pointer into the GDT that holds the actual base address.

Logical and linear address are still virtual addresses. To convert the linear address into a physical address, the CPU finally uses a page table directory and a page table as explained in one of my last posts. When the CPU has first been brought into protected mode, paging is still disabled, so this final translation step is skipped and linear and physical address are the same. Note, however, that the translation between logical and linear address cannot be turned off and is always active. So setting up this translation mechanism and in particular the GDT is one of the basis initialization step which needs to be done when switching to protected mode.

Addressing

Support for hardware multi-tasking

Apart from privilege levels and virtual memory, there is even more that the protected mode can do for you. In fact, the protected mode offers hardware support for multi-tasking.

Recall from my post on multi-tasking that a task switch essentially comes down to storing the content of all registers in a safe place and restoring them later. In the approach presented in my post, this is done by manually placing the registers on a stack. However, the x86 protected mode offers an alternative – the task state segment (TSS). This is an area of memory managed by the CPU that fully describes the state of a task. An operating system can initiate a task switch, and the CPU will do all the work to reload the CPU state from the TSS.

However, this mechanism is rarely used. It has a reputation of being slower than software task switching (though I have never seen any benchmarks on this), it introduces a lot of dependencies to the CPU and I am not aware of any modern operating system using this mechanism (Linux apparently used it in early versions of the kernel). Still, at least one TSS needs to be present in protected mode even if that mechanism is not used, as the TSS is also used to store some registers when an interrupt is raised while executing in ring 3 (see the corresponding section of the ctOS documentation for more details).

Long mode

Having followed me the long and dusty road down to an understanding of the protected mode, this one will shock you: protected mode is legacy.

In fact, the protected mode as we have discussed it is a 32 bit operating mode. Since a few years, however, almost every PC you can buy is equipped with a 64 bit CPU based on the x86-64 architecture. This architecture is fully backwards compatible, i.e. you can still run a protected mode operating system on such a PC. However, this does not allow you to address more than 4 GB of memory, as pointers are limited to 32 bit.

In long mode, the CPU uses 64 bit registers and is able to use 48 bit addressing, which is good enough for 256 TB of RAM.

Fortunately, if you understand the protected mode, getting used to long mode is not too difficult. The basis concepts of the various tables that we have discussed (page tables, the IDT and the GDT) remain unchanged, but of course the layout of these tables changes to accommodate for the additional address bits. The register set, however, changes considerably. The general purpose registers are extended to 64 bits and are renamed (RAX instead of EAX etc.). In addition, a few new registers become available. There are eight additional integer registers (R8 – R15) – which are typically used to pass arguments to subroutines instead of using the stack for this purpose – and eight more SSE registers.

Apart from that, long mode is just another mode of the x86 CPU family. And this is not the end of the story – there is unreal mode, there is virtual 8086 mode, there is the system management mode and there are new modes (“VMX root mode”) that support virtualization. If you want to learn more, I recommend the excellent OSDev Wiki – maybe the most valuable single OSDev site out there – and of course the Intel Software Developer manuals. Good luck!

How does multitasking really work?

The first PC on which I was running a copy of Linux back in the nineties did of course only have one CPU, so it was clear to me that it could physically only execute one instruction at a time – but still, it magically created the impression to run several programs in parallel without visible delay. How could that ever work? Almost twenty years later, I finally found the answer.

In fact, trying to understand how multitasking and multithreading actually work was one of my main motivations when I developed the ctOS kernel a couple of years ago. In this post, I will try to explain the basic mechanism behind the magic.

The basic idea sounds simple enough. Suppose a task has been running for some time and the operating system decides that it is time to switch to a different task. Then the operating system kernel will interrupt the current execution and save the current state of the CPU somewhere in memory. It then locates a different task that is ready to be run – a process which is called scheduling– and retrieves the saved CPU state representing this task from memory. Then, it will let the CPU continue execution with that saved state. As this state is exactly the state of the CPU that was in place when the newly scheduled task was stopped, the CPU will continue to process this task as if it had never been interrupted.

Unfortunately, this simple explanation leaves a number of questions open. First, if a task is executing, it means that the CPU is busy running this task. How, then, can the operating system “do” something? Second, what is the “state of the CPU”? How can we store it and how can we put the CPU back in a saved state? And then, how do we decide that “it is time to switch to a different task”? It turns out that the answer to these questions have to do with interrupts.

Task switching, interrupts and the stack

To explain this, let us for a moment assume a vastly simplified world where all tasks currently active do not require I/O operations. Suppose task A is currently being processed, i.e. the CPU is executing the stream of instructions (the “program”) that make up this task. At some point, we will see an interrupt firing. If there is no I/O, this is usually the timer interrupt that fires periodically, say one thousand times per second.

If this happens, the CPU will stop the execution of task A and start to execute an interrupt service handler. Now the interrupt service handler has been set up by the operating system and points at code within the operating system kernel – so at this point, control is given to the operating system! At the same time, we have learned that the CPU will store the value of some registers like EIP on the stack, and a smart interrupt handler will store the value of additional registers on the stack as well in order to be able to use and reconstruct their value later, and, right at its end, contain code to pop the register values off the stack again. So we could as well set up our ISR in such a way that it does in fact store the value of all registers on the stack. This already provides two ingredients for the task switching mechanism – a way to make sure that the operating system gains control of the CPU at least once a millisecond and a saved CPU state on the stack!

Now the operating system will usually maintain a list of tasks and scan that list according to certain criteria to find a task which should be executed next – we will look at this process in greater detail further below. Suppose that it determines a task which we call B which should be run next. For the sake of simplicity, let us also assume that this task is not entirely new, but has already been running for some time (setting up a new task is a bit more complicated). Thus, when task B was interrupted for the last time, its state was also saved on the stack somewhere, and our operating system has hopefully stored a pointer to this stack.

What we now want to do is to exchange the stored state of task A with that of B. The easiest way to do this is not to copy around the stacks, but to manipulate the stack pointer such that the execution resumes using the stack of task B instead of task A! When we do this and reach the end of the interrupt service handler, the part of the interrupt service handler that is responsible for popping all registers off the stack again as well as the final IRET will use the stack of task B. In other words, the CPU will pick up the state of task B instead the state of task A. It will therefore continue execution with the state of the registers and the instruction pointer that task B had when it was interrupted – in other words, task B will continue to execute instead of task A, and we have switched from task A to task B.

TaskSwitch

In reality, there are a few subtle points that need to be worked out in more detail when you really want to build this. You will have to think about what the stack actually is – does every task have its own stack, how do you isolate them etc. In addition, a task switch might invoke a switch to a different process and thus to a different virtual address space, so at some point, addresses become invalid and cannot be accessed any more. The exact process that ctOS uses is described in the section on task switching in the interrupt handling documentation of ctOS.

We have explained the process of task switching using the timer interrupt as an example. However, this is not the only possible trigger for a task switch. There are of course other hardware interrupts that might trigger a task switch, but more importantly, many task switches do in fact occur during a system call! Assume for instance that a task wants to read data from the hard drive. It will then invoke a system call to do this, i.e. it will hand over control to the kernel voluntarily. The kernel will ask the hard drive to go and get the data, but of course this takes some time. Thus the kernel will in general not return to this task, but schedule a different task and execute a task switch. If (as ctOS does it) a system call is itself implemented as a software interrupt, the mechanism is exactly the same as if a task switch had been done during the processing of a hardware interrupt (if the faster SYSENTER/SYSEXIT mechanism is used, the processing is a bit different but the idea is the same).

Scheduling

What we have not explained so far is how the operating system does actually select a new task to be executed. The exact details how to do this are different for each operating system, but it is probably fair to say that there are in general two main ingredients – the time the task has been running until the last task switch and the state of the task.

First, there is the CPU time that a task has consumed since the last task switch, commonly measured by the quantum. The quantum is a number that is set to some initial value when the task gets control of the CPU. During each processing of the timer interrupt, the quantum of the currently active task is decreased by one. As the timer interrupt fires periodically, the quantum thus measures the time that the task has already consumed – the lower the remaining quantum, the longer the task did already run. If a task has reached quantum zero, it might be a good idea to schedule a different task.

Second, there is the state of a task. Not every task is ready to be executed at any given point in time. A task might for instance be waiting for some event (user input, I/O, ..) or might otherwise be blocked. Our a task could be in the status new, indicating that it is currently still being set up and not yet ready to run. Or it might have been stopped by a signal and be waiting for the signal to continue. And finally, a task can either be active i.e. currently executing, or ready, i.e. waiting to be scheduled.

TaskStatus

During scheduling, only tasks in status ready will be considered as candidates for execution. One way to implement scheduling is to put all ready tasks onto a queue, called the ready queue. Scheduling than basically proceeds as follows.

  • Check whether the currently active task has used up its quantum
  • If no, decrease the quantum by one and return – no task switch will be done. Otherwise proceed with the next step
  • Set the state of the currently active task to “ready”, set its quantum back to its initial value and queue the task at the end
  • Select the first task in the ready queue. Update its status to active and mark it as target of the next task switch

When a task that was not ready but, for instance, blocked, becomes ready, it is added at the end of the ready queue with an initial quantum.

Scheduler.png

There is a more advanced version of this simple algorithm called round-robin scheduling with dynamic priorities. In this model, there is more than one ready queue, in fact there is a list of priorities, say 0 – 15, and there is one queue for each priority. If a new task to be scheduled needs to be determined, these queues are scanned for candidates, starting with the highest priority queue. The priority can be adjusted at run-time based on the current behaviour of the process – see the documentation of the ctOS scheduler for more details. This approach is still comparatively simple, and designing a highly optimized scheduler is an art that can make the difference between a good and a lousy performance. Currently, Linux uses a different scheduling mechanism called CFS, but scheduling remains a tricky tasks, especially in an SMP environment, and we will probably see additional work going into this in the future (see for instance this article for a more in-depth discussion).

Now multi-tasking is nice, but to really execute several programs at the same time has an additional dimension – memory. If you want to run different programs on one physical machine that might even belong to different users, you need to make sure that a program cannot access memory reserved for a different program, i.e. you need some level of memory space isolation. Fortunately, modern CPUs offer a mechanism called virtual memory that we can use for this purpose and that we will explain in the next post.

Get your kernel going – the boot process

Most careers in operating system development probably start with a seemingly simple task – produce a program that, at start time, takes full control of a computer and prepares for the execution of the actual operating system, i.e. boot the computer. It turns out, however, that – thanks to the complexity of modern x86 based hardware – this is easier said than done. In this post, we will look at the boot process in a bit more detail.

History of the boot process

To understand the boot process in a modern PC, it is actually useful to go back into the past a bit. The first home computer that I owned was an Amstrad CPC 464 with a Zilog Z80 CPU. Like any other CPU I am aware of, this CPU stores the memory location of the next instruction to be executed in a register, called – in this case – the program counter (PC). When you turn on that machine, the CPU is in its initial state, with the value of the PC being equal to its initial value 0x0000. Now, at this point in memory space, there was actually no RAM, but a ROM that contained the CPC firmware. Starting at address 0x0000, the so-called restart block was located which then took over and initialized the system. And let the fun begin …(picture taken on the fabulous CPCBox, a browser-based CPC emulator).

cpc4641.png

This mechanism is straightforward and fast – when you turn on the machine, the operating system is available immediately and you can start to work with it. However, there are obvious disadvantages, the most important one being that updates are difficult once a machine is shipped, and even late in the manufacturing process.

The developers of the famous IBM PC XT decided to use a different approach. Instead of loading the full operating system from ROM, the ROM did only contain a firmware called the BIOS, a small set of routines needed for some basic operations, like reading from the hard drive or a disk, and a short startup routine. When the system comes up, the CPU again starts execution at some predefined point in memory, where the BIOS startup routine is located. This routine would then typically execute some self-tests and then load the actual operating system from the hard drive or a floppy disk.

Loading the operating system does actually happen in two phases. In the first phase, the BIOS loads the so-called boot loader. The boot loader is located at a fixed location of the hard drive, specifically in the first sector (called the master boot record (MBR)), and occupies a bit less than 512 bytes. This boot loader is brought into memory and then executed. It is then responsible for locating the actual operating system and loading it.

So far, this is comparatively simple. Unfortunately, PC hardware then started to evolve, and with every step in the evolution, the boot process had to be adapted so that it could make use of new features, but still be backwards compatible. For instance, when the x86 CPU was enhanced by adding the so-called protected mode in 1985, the BIOS would still operate in the older real mode to stay compatible with legacy operating systems. When an operating system wanted to make use of the new features of the protected mode, either the boot loader or the early stages of the operating system kernel had to switch the CPU into protected mode. However, as the BIOS is still 16 bit code, its routines can no longer be accessed once that switch has been completed, in particular reading from a hard drive is no longer easily possible until the initialization of the operating system has reached a certain point. This makes for interesting chicken-and-egg problems: if the operating system that needs to be loaded exceeds a certain size, you will have to switch to protected mode to be able to utilize the full memory, but once this is done it becomes much more complicated to access the hard drive to load the operating system files.

To solve this, several sophisticated boot mechanisms were implemented by various boot loaders. One approach was to actually compress the operating system image so that it could be loaded into memory once the BIOS was still accessible, then perform the switch to protected mode and then decompress the image. Another approach is to boot the system in several stages. The boot code for stage 0 would, for instance, be stored in the MBR, which would then load a stage 1. This stage 1 would contain sufficient functionality to read data from a standard file system, including a hard disk driver. It would then switch to protected mode and use this functionality to load the stage 2, which would then start the actual operating system. The GRUB bootloader, for instance, uses a similar (but slightly different) mechanism.

Booting a modern PC

Roughly around 2005, the industry started to introduce the UEFI as the next generation standardized PC firmware, aiming to become the successor of the outdated BIOS. The UEFI is much more powerful than the BIOS. Instead of loading one sector from a specified location on the hard drive, the UEFI is able to read from a file system. Thus the initial operating system image can be an ordinary executable file, stored on a file system. The UEFI will search certain standard directories for files with predefined names (like BOOTX64.EFI) and execute such a file if it is found. This file then executes in an UEFI environment in 32 bit protected mode or 64 bit long mode and has access to UEFI services to perform operations like input, output or hard drive access. These services can then be used to load the operating system, before eventually control of the system is handed over to the operating system and the UEFI services must no longer be used.

However, the BIOS used to be more than just a set of routines – it also contained several configuration tables that provide basic information on the hardware relevant for the operating system, like the available graphics modes, the number and type of available CPUs, the available memory and reserved memory areas and information on the configuration of the system interrupts. This information is now presented to an operating system as part of the ACPI (Advanced configuration and power interface). The ACPI is a complex collection of tables, some of which are static, some of which are actually dynamic, i.e. contain byte code that needs to be run by a bytecode interpreter.

So far, the description has focused on the standard case for a home PC – booting from a local storage like hard drive, an USB stick or a CD-ROM. However, several methods exist to boot over a network as well. The PXE boot protocol, for example, is a mechanism to boot an operating system over a network. When PXE boot is used, either a boot loader or a sufficiently modern BIOS initializes the clients network card and establishes a connection to server. The server runs a TFTP server (TFTP is a simplified version of the FTP protocol) and the client uses the TFTP protocol to transfer an operating system image over the network. This is the technology behind thin clients, i.e. clients that only contain a ROM based firmware capable of executing a PXE based boot process, but no local storage and no local copy of the operating system.

To get an idea how the full boot process looks like on a modern PC, let us take a look at the sequence of steps that are executed when my homebrew operating system ctOS starts up. Here is a screenshot of the system once the startup sequence is completed (taken using the QEMU emulator).

ctOS_Startup

The most important steps that actually execute in the background and produce this output are as follows.

  1. The hardware (or the emulator) loads the BIOS (SeaBIOS in this case)
  2. The BIOS locates the boot loader (GRUB2 in my case) on the hard drive and loads it in several stages
  3. GRUB2 starts to execute on the first CPU in the system, called the bootstrap CPU
  4. When GRUB2 hands over control to the operating system, the system is already in protected mode, however certain system tables are in a preliminary or incomplete state. Thus these tables are initialized first
  5. Then all components of the kernel are initialized. This includes device drivers, interrupts, the memory manager, the process manager and the TCP/IP stack.
  6. During this stage, the system will also read the BIOS or ACPI configuration tables and detect and initialize other hardware components like the interrupt controller, the system clock, the keyboard, or attached PCI devices like hard drives and network cards

  7. Next, all the other CPUs in the system are initialized and put into protected mode
  8. Finally, the root process is started and the command line interface is located on the disk and started – at this point the boot process is complete and the user takes over control of the system

Of course for a more advanced operating system, the actual boot process would do much more – loading a graphical user interface, for instance, or starting system daemons. However, technically, once we get to the point that a user space program can execute, most of the low level subtleties are mastered and we can work in a stable environment.

This concludes our discussion of the boot process. With the next post in this series, I will start our trip through the various components of an operating system. Happy booting!

Why building an operating system from scratch?

What happens if you turn on a PC? How is an operating system able to run multiple tasks in parallel? What happens if you hit a key on your keyboard? And what actually is a process? If you have ever thought for more than a second about one of these things, then read on…

A few years back, actually in late 2010 or early 2011, I was at a point where my growing interest in computer science naturally lead me to those questions. So I did what most of us would probably do – I started to search for information on this and tried to understand what I got.

One day, I hit upon a tutorial that explained the boot process of a PC, and actually contained a short instruction on how to write a very simple program that would act as a boot loader – you would copy this program to a floppy disk (yes, that still existed in some machines at this time) or a CD-ROM and boot from there, and – voila – instead of a Windows or Linux coming up, the system would run your program and print some deep wisdom like “Hello World!” onto the screen.

Now printing something is easy, but what if I wanted the user to be able to enter something and print a response? Obviously, this requires some sort of communication with the keyboard and there is no operating system around that can help you with that, so I needed to learn about a thing called keyboard controller. But wait, why could I only use a few kB of memory? I found that I needed to switch from real mode to protected mode, install interrupt handlers … and slowly, the attempt to just build something that would boot turned into more.

Over the next roughly 18 months, I continued to walk down this path, and eventually had build a small Unix operating system kernel. I then added a simple command line interface, graphics mode and a C library and eventually got to a point where I was able to take simple programs originally written for Linux or other Unix-like operating systems, and compile and run them on my new operating system, which I did call ctOS. Here is a screenshot – my OS is running in a virtual machine on the left, and on the right you see a few measurements of the physical CPU load while running some multi-threading tests within ctOS.

ctOS_SMP_Test

As you might imagine, this required quite some work, and I eventually got to a point where I spend more time working on other things and further development came to an end.

Earlier this year, I was cleaning up a few things on my hard drive and – more or less by accident – hit upon my old code archive again. Being a curious person, I was wondering whether I would still be able to compile and run the system – and of course it badly failed when I tried it. This was not really surprising – my development environment had changed from a 32 bit system to a 64 bit system, and the emulators that I used to run and test the OS had changed quite a bit as well, fixing defects that did actually hide away some bugs in my OS. And of course the “average PC” had changed quite a lot since 2011. The BIOS is now a thing of the past, PCs use UEFI to boot and ACPI tables to store their configuration instead of BIOS tables, and you will hardly ever find a system with a floppy disk drive and only one CPU any more.

So I decided to invest some time to make my OS work again and to adapt it to the current generation of hardware. This included building a new development environment, fixing some defects that become suddenly apparent as the emulator had changed, making the system ACPI aware, removing all dependencies on the BIOS etc. I did also clean up the source code and migrated everything to a Github repository and added a few additional ports.

I have a realistic view on how much time I will probably have in the future to further develop and improve my OS. However, it can still serve a purpose – education. After all, this is what it was created for: helping me to understand some of the inner workings of an operating system.

In fact, my plan is to publish a series of blog posts on the principles and structures behind a Unix operating system, using ctOS to illustrate these concepts and linking to the documentation I have created over the years for the details. This will cover topics like the overall structure of an operating system, processes and the scheduler, interrupts, memory management, device driver and the networking stack. So if you ever wanted to understand how an operating system really works this is the time for you to learn it – stay tuned and join me again for the first part of the series. Until then, you might want to check out the ctOS source code on Github and the documentation that comes with it – if you want, you can even download some of the binary packages on the release page and play with it.

Networking basics – the layered networking model

Recently, I picked up an old project of mine – implementing a Unix like operating kernel from scratch. I will post more on this later, but one of the first things I stumbled across when browsing my old code and my old documentation was the networking stack. I used this as an opportunity to refresh my own understanding of networking basics, and reckoned it might help others to post my findings here. So I decided to write a short series of posts on the basics of networking – ethernet, ARP, IP, TCP/UDP and all that stuff.

Before we start to get into details on the different networking technologies, let me first explain the idea of a layered networking architecture.

Ultimately, networking is about physical media (copper cables, for instance) connecting physical machines. Each of these machines will be connected to the network using a network adapter. These adapters will write messages into the network, i.e. create a certain sequence of electrical signals on the medium, and read other messages coming from the network.

However, the magic of a modern networking architecture is that the same physical network can be used to connect different machines with different operating systems using different networking protocols. When you use your web browser to connect to some page on the web – this one for instance – your web browser will use a protocol called HTTP(S) for that purpose. From the web browsers point of view, this appears to be a direct connection to the web server. The underlying physical network can be quite different – it can be a combination of Ethernet or WLAN to connect your PC to a router, some technology specific to your ISP or even a mobile network. The beauty of that is that the browser does not have to care. As often in computer science, this becomes possible by an abstract model organizing networking capabilities into layers.

Several models for this layering exist, like the OSI model and the TCP/IP layered model defined in RFC1122. For the sake of simplicity, let us use the four-layer TCP/IP model as an example.

 

TCPIPLayers

 

The lowest layer in this model is called the link layer. Roughly speaking, this layer is the layer which is responsible for the actual physical connection between hosts. This covers things like the physical media connecting the machines and the mechanisms used to avoid collisions on the media, but also the addressing of network interfaces on this level.

One of the most commonly used link layer protocols is the Ethernet protocol. When the Ethernet protocol is used, hosts in the network are addressed using the so-called MAC address which uniquely identifies a network card within the network. In the Ethernet protocol (and probably in most other protocols), the data is transmitted in small units called packets or frames. Each packet contains a header with some control data like source and destination, the actual data and maybe a checksum and some end-of-data marker.

Now Ethernet is the most common, but by far not the only available link layer protocol. Another protocol which was quite popular at some point at the end of the last century is the Token Ring protocol, and in modern networks, a part of the path between two stations could be bridged by a provider specific technology or a mobile network. So we need a way to make hosts talk to each other which are not necessarily connected via a direct Ethernet link.

The approach taken by the layered TCP/IP model to this is as follows. Suppose you have – as in the diagram below – two sets of machines that are organized in networks. On the left hand side, we see an Ethernet network with three hosts that can talk to each other using Ethernet. On the right hand side, there is a smaller network that consists of two hosts, also connected with each other via Ethernet (this picture is a bit misleading, as in a real modern network, the topology would be different, but let us ignore this for a moment).

NetworksAndGatewaysBoth networks are connected by a third network, indicated by the dashed line. This network can use any other link layer technology. Each network contains a dedicated host called the gateway that connects the network to this third network.

Now suppose host A wants to send a network message to host B. Then, instead of directly using the link layer protocol in its network, i.e. Ethernet, it composes a network message according to a protocol that is in the second layer of the TCP/IP networking model, called the Internet layer which uses the IP protocol. Similar to an Ethernet packet, this IP packet contains again a header with target and source address, the actual data and a checksum.

Then host A takes this message, puts that into an Ethernet packet and sends it to the gateway. The gateway will now extract the IP message from the Ethernet packet, put it into a message specific to the networking connecting the two gateways and transmit this message.

When the message arrives at the gateway on the right hand side, this gateway will again extract the IP message, put it into an Ethernet message and send this via Ethernet to host B. So eventually, the unchanged IP message will reach host B, after traveling through several networks, piggybacked on messages specific to the used network protocols. However, for applications running on host A and B, all this is irrelevant – they will only get to see IP messages and do not have to care about the details and layer below.

In this way, the internet connects many different networks using various different technologies – you can access hosts on the internet from your mobile device using mobile link layer protocols, and still communicate with a host in a traditional data center using Ethernet or something else. In this sense, the internet is a network of networks, powered by the layer model.

But we are not yet done. The IP layer is nice – it allows us to send messages across the internet to other hosts, using the addresses specific to the IP layer (yes, this is the IP address). But it does, for instance, not yet guarantee that the message ever arrives, nor does it provide a way to distinguish between different communication end points (aka ports) on one host.

These items are the concern of the next layer, the transport layer. The best known example of a transport layer protocol is TCP. On top of IP, TCP offers features like stream oriented processing, re-transmission of lost messages and ports as additional address components. For an application using TCP, a TCP connection appears a bit like a file into which bytes can be written and out of which bytes can be read, in a well defined order and with guaranteed delivery.

Finally, the last layer is the application layer. Common application layer protocols are HTTP (the basis of the world wide web), FTP (the file transfer protocol), or SMTP (the mail transport protocol).

The transitions between the different layers work very much like the transition between the internet layer and the link layer. For instance, if an application uses TCP to send a message, the operating system will take the message (not exactly true, as TCP is byte oriented and not message oriented, but let us gracefully ignore this), add a TCP header, an IP header and finally an Ethernet header, and ask the network card to transmit the message on the Ethernet network. When the message is received by the target host, the operating system of that host strips off the various headers one by one and finally obtains the original data sent by the first host.

The part of the operating system responsible for this mechanism is naturally itself organized in layers  – there could for instance be an IP layer that receives data from higher layers without having to care whether the data represents a TCP message or a UDP message. This layer then adds an IP header and forwards the resulting message to another layer responsible for operating the Ethernet network device and so forth. Due to this layered architecture stacking different components on top of each other, the part of the operating system handling networking is often called the networking stack. The Linux kernel, for instance, is loosely organized in this way (see for instance this article).

This completes our short overview of the networking stack. In the next few posts, we will look at each layer one by one, getting our hands dirty again, i.e. we will create and inspect actual network messages and see the entire stack in action.