## Grover’s algorithm – unstructured search with a quantum computer

In the last post, we have looked at the Deutsch-Jozsa algorithm that is considered to be the first example of a quantum algorithm that is structurally more efficient than any classical algorithm can probably be. However, the problem solved by the algorithm is rather special. This does, of course, raise the question whether a similar speed-up can be achieved for problems that are more relevant to practical applications.

In this post, we will discuss an algorithm of this type – Grover’s algorithm. Even though the speed-up provided by this algorithm is rather limited (which is of a certain theoretical interest in its own right), the algorithm is interesting due to its very general nature. Roughly speaking, the algorithm is concerned with an unstructured search. We are given a set of N = 2n elements, labeled by the numbers 0 to 2n-1, exactly one of which having a property denoted by P. We can model this property as a binary valued function P on the set $\{0,N-1\}$ that is zero on all but one elements. The task is to locate the element x0 for which P(x0) is true.

## Grover’s algorithm

Grover’s algorithm presented in [1] proceeds as follows to locate this element. First, we again apply the Hadamard-Walsh operator W to the state $|0 \rangle$ of an n-qubit system to obtain a superposition of all basis states. Then, we iteratively apply the following sequence of operations.

1. Apply a conditional phase shift S, i.e. apply the unique unitary transformation that maps $|x \rangle$ to $(-1)^{f(x)} |x \rangle$.
2. Apply the unitary transformation D called diffusion that we will describe below

Finally, after a defined number of outcomes, we perform a measurement which will collaps the system into one of the states $|x \rangle$. We claim – and will see why this is true below – that for the right number of iterations, this value x will, with a high likelihood, be the solution to our problem, i.e. equal to x0.

Before we can proceed, we need to define the matrix D. This matrix is $\frac{2}{N} - 1$ along the diagonal, with N = 2n, and $\frac{2}{N}$ away from the diagonal. In terms of basis vectors, the mapping is given by

$|i \rangle \mapsto \frac{2}{N} (\sum_j |j \rangle) - |i \rangle$

Consequently, we see that

$D \sum_i a_i |i \rangle = \sum_i (2 \bar{a} - a_i) |i \rangle$

where $\bar{a}$ is the average across the amplitudes $a_i$. Thus geometrically, the operation D performs an inversion around the average. Grover shows that this operation can be written as minus a Hadamard-Walsh operation followed by the operation that flips the sign for $|0 \rangle$, followed by a second Hadamard-Walsh transformation.

For the sake of completeness, let us also briefly discuss the first transformation employed by the algorithm, the conditional phase shift. We have already seen a similar transformation while studying the Deutsch-Jozsa algorithm. In fact, we have shown in the respective blog post that the circuit displayed below (with the notation slightly changed)

performs the required operation

$|\psi \rangle = \sum_x a_x |x \rangle \mapsto |\psi' \rangle = \sum_x a_x (-1)^{P(x)} |x \rangle$

Let us now see how why Grover’s algorithm works. Instead of going through the careful analysis in [1], we will use bar charts to visualize the quantum states (exploiting that all involved matrices are actually real valued).

It is not difficult to simulate the transformation in a simple Python notebook, at least for small values of N. This script performs several iterations of the algorithm and prints the result. The diagrams below show the outcome of this test.

Let us go through the diagrams one by one. The first diagram shows the initial state of the algorithm. I have used 3 qubits, i.e. n = 3 and N = 8. The initial state, after applying the Hadamard-Walsh transform to the zero state, is displayed in the first line. As expected, all amplitudes are equal to 1 over the square root of eight, which is approximately 0.35, i.e. we have a balanced superposition of all states.

We now apply one iteration of the algorithm. First, we apply the conditional phase flip. The element we are looking for is in this case located at x = 2. Thus, the phase flip will leave all basis vectors unchanged except for $|2 \rangle$ and it will change the amplitude of this vector to – 0.35. This will change the average amplitude to a value slightly below 0.35. If we now perform the inversion around the average, the amplitudes of all basis vectors different from $|2 \rangle$ will actually decrease, whereas the amplitude of $|2 \rangle$ will increase. The result is displayed in the second line of the diagram.

Thus, what really happens in this case is an amplitude amplification – we increase the amplitude of one component of the superposition while decreasing all the others.

The next few lines show the result of repeating these two steps. We see that after the second iteration, almost all of the amplitude is concentrated on the vector $|2 \rangle$, which represents the solution we are looking for. If we now perform a measurement, the result will be 2 with a very high probability!

It is interesting to see that when we perform one more iteration, the difference between the amplitude of the solution and the amplitudes of all other components decreases again. Thus the correct choice for the number of iterations is critical to make the algorithm work. In the last line, we have plotted the difference between the amplitude of $|2 \rangle$ and the other amplitudes (more precisely, the ratio between the amplitude of $|2 \rangle$ and the second largest amplitude) on the y-axis for the different number of iterations on the x-axis. We see that the optimal number of iterations is significantly below 10 (actually five iterations give the best result in this case), and more iterations decrease the likelihood of getting the correct result out of the measurement again. In fact, a careful analysis carried out in [2] shows that for large values of N, the best number of iterations is given by $\frac{\pi}{4} \sqrt{N}$, and that doubling the number of iterations does in general lead to a less optimal result.

## Generalizations and amplitude amplification

In a later paper ([3]), Grover describes a more general setup which is helpful to understand the basic reason why the algorithm works – the amplitude amplification. In this paper, Grover argues that given any unitary transformation U and a target state (in our case, the state representing the solution to the search problem), the probability to meet the target state by applying U to a given initial state can be amplified by a sequence of operations very much to the one considered above. We will not go into details, but present a graphical representation of the algorithm.

So suppose that we are given an n-qubit quantum system and two basis vectors – the vector t representing the target state and an initial state s. In addition, assume we are given a unitary transformation U. The goal is to reach t from s by subsequently applying U itself and a small number of additional gates.

Grover considers the two-dimensional subspace spanned by the vectors s and U-1t. If, within this subspace, we ever manage to reach U-1t, then of course one more application of U will move us into the desired target state.

Now let us consider the transformation

$Q = - I_s U^{-1} I_t U$

where Ix denotes a conditional phase shift that flips the phase on the vector $|x \rangle$. Grover shows that this transformation does in fact leave our two-dimensional subspace invariant, as indicated in the diagram below.

He then proceeds to show that for sufficiently small values of the matrix element Uts, the action of Q on this subspace can approximately be described as a rotation by the angle $\frac{2}{|U_{ts}|}$. Applying the operation Q n times will then approximately result in a superposition of the form

$\cos (\frac{2n}{|U_{ts}|})|s \rangle + a U^{-1}|t \rangle$

Thus if we can make the first coefficient very small, i.e. if $\frac{4n}{|U_{ts}|}$ is close to a multiple of $\pi$, then one application of U will take our state to a state very close to t.

Let us link this description to the version of the Grover algorithm discussed above. In this version, the initial state s is the state $|0 \rangle$. The transformation U is the Hadamard-Walsh transformation W. The target state is $|x_0 \rangle$ where x0 is the solution to the search problem. Thus the operation It is the conditional phase shift that we have denoted by S earlier. In addition, Grover shows in [1] already that the diffusion operator D can be expressed as -W I0 W. Now suppose we apply the transformation Q n times to the initial state and then apply U = W once more. Then our state will be

$W Q \dots Q |0 \rangle$

This can be written as

$W (-I_0 W S W) (-I_0 W S W) \dots (-I_0 W S W) |0 \rangle$

Regrouping this and using the relation D = -W I0 W, we see that this is the same as

$- (W I_0 W) S (- W I_0 W) S \dots \dots (- W I_0 W) S W |0 \rangle = D S \dots D S W |0 \rangle$

Thus the algorithm can equally well be described as applying W once to obtain a balanced superposition and then applying the sequence DS n times, which is the formulation of the algorithm used above. As $|U_{ts} | = \frac{1}{\sqrt{N}}$ in this case, we also recover the result that the optimal number of iterations is $\frac{\pi}{4} \sqrt{N}$ for large N.

## Applications

Grover’s algorithm is highly relevant for theoretical reasons – it applies to a very generic problem and (see the discussions in [1] and [2]) is optimal, in the sense that it provides a quadratic speedup compared to the best classical algorithm that requires O(N) operations, and that this cannot be improved further. Thus Grover’s algorithm provides an interesting example for a problem where a quantum algorithm delivers a significant speedup, but no exponential speedup as we will see it later for Shor’s algorithm.

However, as discussed in detail in section 9.6 of [4], the relevance of the algorithm for practical applications is limited. First, the algorithm applies to an unstructured search, i.e. a search over unstructured data. In most practical applications, we deal with databases that have some sort of structure, and then more efficient search algorithms are known. Second, we have seen that the algorithm requires $O(\sqrt{N})$ applications of the transformation UP. Whether this is better than the classical algorithm does of course depend on the efficiency with which we can implement this with quantum gates. If applying UP requires O(N) operations, the advantage of the algorithm is lost. Thus the algorithm only provides a significant speedup if the operation UP can be implemented efficiently and there is no additional structure that a classical algorithm could exploit.

Examples of such problems are brute forces searches as they appear in some crypto-systems. Suppose for instance we are trying to break a message that is encrypted with a symmetric key K, and suppose that we know the first few characters of the original text. We could then try to use an unstructured search over the space of all keys to find a key which matches at least the few characters that we know.

In [5], a more detailed analysis of the complexity in terms of qubits and gates that a quantum computer would have to attack AES-256 is made, arriving at a size of a few thousand logical quantum bits. Given the current ambition level, this does not appear to be completely out of reach. It does, however, not render AES completely unsecure. In fact, as Grover’s algorithm essentially results in a quadratic speedup, a code with a key length of n bits in a pre-quantum world is essentially as secure as the same code with a key length of 2n in a post-quantum world, i.e. roughly speaking, doubling the key length compensates the advantage of quantum computing in this case. This is the reason why the NIST report on post-quantum cryptography still classifies AES as inherently secure assuming increased key sizes.

In addition, the feasibility of a quantum algorithm is not only determined by the number of qubits required, but also by other factors like the depth, i.e. the number of operations required, and the number of quantum gates – and for AES, the estimates in [5] are significant, for instance a depth of more than 2145 for AES-256, which is roughly 1043. Even if we assume a switching time of only 10-12 seconds, we still would require astronomical 1031 seconds, i.e. in the order of 1023 years, to run the algorithm.

Even a much less sophisticated analysis nicely demonstrates the problem behind these numbers – the number of iterations required. Suppose we are dealing with a key length of n bits. Then we know that the algorithm requires

$\frac{\pi}{4} \sqrt{2^n} \approx 0.8 \sqrt{2}^n$

iterations. Taking the decimal logarithm, we see that this is in the order of 100.15*n. Thus, for n = 256, we need in the order of 1038 iterations – a number that makes it obvious that AES-256 can still be considered secure for all practical purposes.

So overall, there is no reason to be overly concerned about serious attacks to AES with sufficiently large keys in the near future. For asymmetric keys, however, we will soon see that the situation is completely different – algorithms like RSA or Elliptic curve cryptography are once and for all broken as soon as large-scale usable quantum computer become reality. This is a consequence of Shor’s algorithm that we will study soon. But first, we need some more preliminaries that we will discuss in the next post, namely quantum Fourier transforms.

## References

1. L.K. Grover, A fast quantum mechanical algorithm for database search, Proceedings, 28th Annual ACM Symposium on the Theory of Computing (STOC), May 1996, pages 212-219, available as arXiv:quant-ph/9605043v3
[2] M. Boyer, G. Brassard, P. Høyer, A. Tapp, Tight bounds on quantum searching, arXiv:quant-ph/9605034
3. L.K. Grover, A framework for fast quantum mechanical algorithms, arXiv:quant-ph/9711043
4. E. Rieffel, W. Polak, Quantum computing – a gentle introduction, MIT Press
5. M. Grassl, B. Langenberg, M. Roetteler, R. Steinwandt, Applying Grover’s algorithm to AES: quantum resource estimates, arXiv:1512.04965

## Get your kernel going – the boot process

Most careers in operating system development probably start with a seemingly simple task – produce a program that, at start time, takes full control of a computer and prepares for the execution of the actual operating system, i.e. boot the computer. It turns out, however, that – thanks to the complexity of modern x86 based hardware – this is easier said than done. In this post, we will look at the boot process in a bit more detail.

## History of the boot process

To understand the boot process in a modern PC, it is actually useful to go back into the past a bit. The first home computer that I owned was an Amstrad CPC 464 with a Zilog Z80 CPU. Like any other CPU I am aware of, this CPU stores the memory location of the next instruction to be executed in a register, called – in this case – the program counter (PC). When you turn on that machine, the CPU is in its initial state, with the value of the PC being equal to its initial value 0x0000. Now, at this point in memory space, there was actually no RAM, but a ROM that contained the CPC firmware. Starting at address 0x0000, the so-called restart block was located which then took over and initialized the system. And let the fun begin …(picture taken on the fabulous CPCBox, a browser-based CPC emulator).

This mechanism is straightforward and fast – when you turn on the machine, the operating system is available immediately and you can start to work with it. However, there are obvious disadvantages, the most important one being that updates are difficult once a machine is shipped, and even late in the manufacturing process.

The developers of the famous IBM PC XT decided to use a different approach. Instead of loading the full operating system from ROM, the ROM did only contain a firmware called the BIOS, a small set of routines needed for some basic operations, like reading from the hard drive or a disk, and a short startup routine. When the system comes up, the CPU again starts execution at some predefined point in memory, where the BIOS startup routine is located. This routine would then typically execute some self-tests and then load the actual operating system from the hard drive or a floppy disk.

Loading the operating system does actually happen in two phases. In the first phase, the BIOS loads the so-called boot loader. The boot loader is located at a fixed location of the hard drive, specifically in the first sector (called the master boot record (MBR)), and occupies a bit less than 512 bytes. This boot loader is brought into memory and then executed. It is then responsible for locating the actual operating system and loading it.

So far, this is comparatively simple. Unfortunately, PC hardware then started to evolve, and with every step in the evolution, the boot process had to be adapted so that it could make use of new features, but still be backwards compatible. For instance, when the x86 CPU was enhanced by adding the so-called protected mode in 1985, the BIOS would still operate in the older real mode to stay compatible with legacy operating systems. When an operating system wanted to make use of the new features of the protected mode, either the boot loader or the early stages of the operating system kernel had to switch the CPU into protected mode. However, as the BIOS is still 16 bit code, its routines can no longer be accessed once that switch has been completed, in particular reading from a hard drive is no longer easily possible until the initialization of the operating system has reached a certain point. This makes for interesting chicken-and-egg problems: if the operating system that needs to be loaded exceeds a certain size, you will have to switch to protected mode to be able to utilize the full memory, but once this is done it becomes much more complicated to access the hard drive to load the operating system files.

To solve this, several sophisticated boot mechanisms were implemented by various boot loaders. One approach was to actually compress the operating system image so that it could be loaded into memory once the BIOS was still accessible, then perform the switch to protected mode and then decompress the image. Another approach is to boot the system in several stages. The boot code for stage 0 would, for instance, be stored in the MBR, which would then load a stage 1. This stage 1 would contain sufficient functionality to read data from a standard file system, including a hard disk driver. It would then switch to protected mode and use this functionality to load the stage 2, which would then start the actual operating system. The GRUB bootloader, for instance, uses a similar (but slightly different) mechanism.

## Booting a modern PC

Roughly around 2005, the industry started to introduce the UEFI as the next generation standardized PC firmware, aiming to become the successor of the outdated BIOS. The UEFI is much more powerful than the BIOS. Instead of loading one sector from a specified location on the hard drive, the UEFI is able to read from a file system. Thus the initial operating system image can be an ordinary executable file, stored on a file system. The UEFI will search certain standard directories for files with predefined names (like BOOTX64.EFI) and execute such a file if it is found. This file then executes in an UEFI environment in 32 bit protected mode or 64 bit long mode and has access to UEFI services to perform operations like input, output or hard drive access. These services can then be used to load the operating system, before eventually control of the system is handed over to the operating system and the UEFI services must no longer be used.

However, the BIOS used to be more than just a set of routines – it also contained several configuration tables that provide basic information on the hardware relevant for the operating system, like the available graphics modes, the number and type of available CPUs, the available memory and reserved memory areas and information on the configuration of the system interrupts. This information is now presented to an operating system as part of the ACPI (Advanced configuration and power interface). The ACPI is a complex collection of tables, some of which are static, some of which are actually dynamic, i.e. contain byte code that needs to be run by a bytecode interpreter.

So far, the description has focused on the standard case for a home PC – booting from a local storage like hard drive, an USB stick or a CD-ROM. However, several methods exist to boot over a network as well. The PXE boot protocol, for example, is a mechanism to boot an operating system over a network. When PXE boot is used, either a boot loader or a sufficiently modern BIOS initializes the clients network card and establishes a connection to server. The server runs a TFTP server (TFTP is a simplified version of the FTP protocol) and the client uses the TFTP protocol to transfer an operating system image over the network. This is the technology behind thin clients, i.e. clients that only contain a ROM based firmware capable of executing a PXE based boot process, but no local storage and no local copy of the operating system.

To get an idea how the full boot process looks like on a modern PC, let us take a look at the sequence of steps that are executed when my homebrew operating system ctOS starts up. Here is a screenshot of the system once the startup sequence is completed (taken using the QEMU emulator).

The most important steps that actually execute in the background and produce this output are as follows.

1. The hardware (or the emulator) loads the BIOS (SeaBIOS in this case)
2. The BIOS locates the boot loader (GRUB2 in my case) on the hard drive and loads it in several stages
3. GRUB2 starts to execute on the first CPU in the system, called the bootstrap CPU
4. When GRUB2 hands over control to the operating system, the system is already in protected mode, however certain system tables are in a preliminary or incomplete state. Thus these tables are initialized first
5. Then all components of the kernel are initialized. This includes device drivers, interrupts, the memory manager, the process manager and the TCP/IP stack.
6. During this stage, the system will also read the BIOS or ACPI configuration tables and detect and initialize other hardware components like the interrupt controller, the system clock, the keyboard, or attached PCI devices like hard drives and network cards

7. Next, all the other CPUs in the system are initialized and put into protected mode
8. Finally, the root process is started and the command line interface is located on the disk and started – at this point the boot process is complete and the user takes over control of the system

Of course for a more advanced operating system, the actual boot process would do much more – loading a graphical user interface, for instance, or starting system daemons. However, technically, once we get to the point that a user space program can execute, most of the low level subtleties are mastered and we can work in a stable environment.

This concludes our discussion of the boot process. With the next post in this series, I will start our trip through the various components of an operating system. Happy booting!

## Quantum algorithms – a first example

We have now seen how quantum gates implement unitary transformations on qubits. Intuitively, we should, as in the classical case, now be able to combine these quantum gates into quantum algorithms that perform more complex sequences of manipulations of a quantum state. In this post, we will look at a first example to get an idea how this works. The material in this section is well covered by standard textbooks, like [1] section 3.8 or [2], chapter 6 and chapter 7. In addition, the fundamental structure of quantum algorithm is discussed in the classical papers [3] and [4].

In general, a quantum algorithm can be broken down into several phases. First, we need to initialize the system in a known initial state. Then, we apply a series of transformations, which, as we know, can be described in terms of quantum gates. Finally, we perform a measurement to derive a result.

To illustrate this, let us assume that we are given a more or less arbitrary classical problem that we want to solve using a quantum computer. Formally, this could be a function

$f \colon {\{0,1\}}^n \rightarrow {\{0,1\}}^m$

mapping binary numbers of length n to binary numbers of length m. We might have an algorithm to calculate this function on a classical computer and are looking for a quantum version of this algorithm.

The first problem we need to solve is that, as we know, quantum algorithms are given by unitary transformations, in particular they are reversible. In the most general setting however, a function f like the one above is not reversible – even if n is equal to m, there is no guarantee that the function is invertible. Fortunately, that can easily be fixed. If we let $l = m + n$ and consider the function

$\{0,1\}^l \rightarrow \{0,1\}^l$

that maps a pair $(x,y)$ with $x \in \{0,1\}^n$ and $y \in \{0,1\}^m$ to $(x, y \oplus f(x))$, where $\oplus$ is the bitwise exlusive or, we obtain an invertible function (as $y = (y \oplus f(x)) \oplus f(x))$), and we can recover our old function $f$ if we let $y = 0$.

Now let us consider a quantum computer that has l qubits, so that its states are given by elements of a Hilbert space of dimension 2l. We can then identify the bit strings of length l, i.e. the elements of $\{0,1\}^l$, with the natural numbers between 0 and 2l-1 and these numbers in turn with elements of the standard basis – the bit string $0 \dots 11$ for instance would correspond to the element of the standard basis that we denote by $|3\rangle$. Thus, the function described above defines a permutation of the elements of the standard basis. Consequently, by the rules of linear algebra, there is exactly one unitary transformation Uf with the property that (with that identification)

$U_f(|x, y\rangle) = |x, y \oplus f(x)\rangle$

The transformation Uf is a unitary transformation and, as any unitary transformation, can therefore be decomposed into a sequence of elementary quantum gates. Let us now see how these quantum gates yield an algorithm to calculate f.

## Preparing the initial state

First, we need to place our quantum system in some well defined state. For our example, we will use the so-called Hadamard-Walsh transformation to do this. This transformation is defined as the n-fold tensor product of the Hadamard operator H with itself and often denoted by W. Let us see how this operator acts on the state $|00 \dots 0 \rangle$. First, we can write this as

$W |00 \dots 0 \rangle = (H \otimes \dots \otimes H) (|0\rangle \otimes \dots \otimes |0 \rangle)$

Applying the Hadamard operator to each of the n factors individually, we see that this is

$W |00 \dots 0 \rangle = \frac{1}{\sqrt{2^n}} (|0\rangle + |1\rangle) \otimes \dots \otimes (|0 \rangle + |1\rangle)$

Multiplying out this expression, we see that each possible combination of ones and zeros appears exactly once within the sum. In other words, each of the states $|x\rangle$ appears exactly once, where x ranges from 0 to 2n-1. Consequently, we see that

$W |00 \dots 0 \rangle = \frac{1}{\sqrt{2^n}} \sum_0^{2^{n-1}} |x\rangle$

In our case, the state space is a product of n qubits represented by the variable x and m qubits represented by the variable y. As initial state for our quantum algorithm, we will now choose the state that is obtained by applying the Hadamard-Walsh operator W to the first n qubits of the state $|0,0\rangle$ and the identity to the remaining m qubits. Thus, our initial state will be

$(W \otimes I) (|0, 0\rangle) = \frac{1}{\sqrt{2^n}} \sum_0^{2^{n-1}} |x, 0\rangle$

## Transformation and measurement

Having defined the initial state, let us now apply the transformation Uf. Using the fact that this transformation is of course linear and the formula for $U_f|x,y\rangle$ obtained above, we easily see that applying this transformation to our initial state yields the state

$U_f (W \otimes I) (|0, 0\rangle) = \frac{1}{\sqrt{2^n}} \sum_0^{2^{n-1}} |x, f(x)\rangle$

It is worth pausing for a moment to take a closer look at this expression. The right hand side of this expression is a superposition of states that encodes the value of f(x) for all possible values of x. Thus, if we can find a way to physically implement quantum devices that are able to

• Initialize an n+m qubit quantum system in a well-known state like $|0,0\rangle$
• realize the transformations $W \otimes I$ and $U_f$ by a set of quantum gates

we are able to create a state that somehow computes all values f(x) in only one pass of the quantum algorithm.

At the first glance, this sounds like magic. Regardless how big n is, we would be able to calculate all values of f(x) in only one pass of our quantum algorithm! In other words, we would calculate all values of f(x) fully in parallel. This (apparent) feature of quantum computers is often referred to be the term quantum parallelism.

However, there is a twist. Even if we are able to apply a quantum algorithm in order to put our system in a state as above encoding all values of f(x), we still need to be able to extract the actual values of f(x) from that state. Thus, we need to perform a measurement.

If, for instance, we are using a measurement device that corresponds to a projection onto the individual states $|x,y\rangle$, i.e. to the basis of our Hilbert space, then applying that measurement would yield a state

$|x, f(x)\rangle$

for some value of x. Unfortunately, we cannot tell upfront to which of these vectors we would project. In fact, all these vectors appear with equal amplitude and the probability to obtain each of these vectors is $\frac{1}{2^n}$. Thus to retrieve the value of f(x) for a given value of x, we would have a rather small probability to be successful with only one pass of the algorithm.

But the situation is even worse – once we have executed our measurement, the superposition created by our algorithm is destroyed and the system has collapsed to one of the eigenstates. Thus, if we wanted to perform a large number of measurements in order to obtain f(x) for a given value of x with high probability, we would have to execute our algorithm again for each of the measurements, resulting in a large number of passes. It appears that the advantage of quantum parallelism is lost again….

Fortunately, the situation is not as hopeless as one might think. In many cases, we are not actually interested in all values of f(x). Rather, we are interested in some property of the function f, maybe even in a property that has a simple yes-no answer and can therefore be represented by a single bit. If we find a way to adapt our measuring process by projecting to a different basis or by transforming our state further before measuring, we might be able to get our answer with one measurement only. In the next section, we will look at an algorithm which is in a certain sense the blueprint and inspiration of many other quantum algorithms exploiting this approach.

## The Deutsch-Jozsa algorithm

The Deutsch-Jozsa algorithm is a quantum algorithm that tackles the following problem. Suppose we are given a function

$f \colon \{0,1\}^n \rightarrow \{0,1\}$

We call such a function balanced if the number of times this function takes the value 0 is equal to the number of times this function takes the value 1 (which is obviously only possible if n is even which we will assume).

Now assume that we already know that either f is balanced or f is constant. We want to determine which of these two properties holds. With a classical algorithm, we would have to call f at most n/2 + 1 times, but might as well need this number of invocations for some functions f. In [4], Deutsch and Jozsa presented a quantum algorithm to solve this problem which requires only two applications of the unitary transformation Uf which is the equivalent of applying f in the quantum world, which is generally considered to be the first quantum algorithm that demonstrates a true advantage compared to a classical computation.

To describe the algorithm, we follow the general approach outlined above. We consider a quantum system with n+1 qubits and again denote a basis of the corresponding Hilbert space by $|x, y \rangle$ where x is a n-bit number and y is 0 or 1. We again obtain an initial state

$|\Phi_0 \rangle = (H \otimes \dots H \otimes I) |0,0 \rangle = \frac{1}{\sqrt{2^n}} \sum_x |x,0 \rangle$

by applying the Hadamard-Walsh transformation to the state $|0,0 \rangle$. The algorithm then proceeds by applying Uf resulting in the state

$U_f |\Phi_0 \rangle = \frac{1}{\sqrt{2^n}} \sum_x |x, f(x) \rangle$

Now an additional “smart” operation comes into play. First, we use an operator which is denoted by S in [4] and is defined by

$S|x,y\rangle = (-1)^y |x,y \rangle$

(this is a bit in conflict with our previous notation where we have used the symbol S for the phase shift operator, but let us again this for now to stay aligned with the notation in [4]). When we apply the operator S to our state, we obtain

$S U_f |\Phi_0 \rangle = \frac{1}{\sqrt{2^n}} \sum_x (-1)^{f(x)} |x, f(x) \rangle$

Finally, we apply the inverse of Uf, which is actually Uf itself, as a straightforward calculation on basis vectors using the definition of Uf shows. Thus we end up with the state

$|\psi \rangle = U_f S U_f |\Phi_0 \rangle = \frac{1}{\sqrt{2^n}} \sum_x (-1)^{f(x)} |x, 0 \rangle$

Let us now compute the overlap, i.e. scalar product, of this state with the state $|\Phi_0 \rangle$. As clearly

$\langle \Phi_0 | x,0 \rangle = \frac{1}{\sqrt{2^n}}$

we obtain

$\langle \Phi_0 | U_f S U_f |\Phi_0 \rangle = \frac{1}{2^n} \sum_x (-1)^{f(x)}$

Now suppose that we perform a measurement corresponding to the projection onto the subspace spanned by $|\Phi_0 \rangle$ and let us try to understand the possible outcomes of the measurement. If the function f is constant, then the above sum is simply $\pm 1$, depending on the sign of f. Thus our measurement will give us one with certainty, as our state is simply a scalar multiple of $|\Phi_0 \rangle$. If, however, f is not constant but balanced, the sum is zero. Thus we will obtain the value zero with probability one. Thus we can decide whether f is balanced or constant with only one measurement!

How would we actually perform such a measurement, assuming that we can measure the individual qubits of our quantum computer? To perform a measurement corresponding to the projection onto $|\Phi_0 \rangle$, a natural approach is to apply a transformation which takes $|\Phi_0 \rangle$ to one of the basis vectors $|x, y \rangle$, say to $|0,0 \rangle$. But this is exactly the Hadamard-Walsh transformation. As this transformation is unitary and hermitian, we have

$\langle \Phi_0 | U_f S U_f |\Phi_0 \rangle = \langle 0,0 | (W \otimes I) U_f S U_f |\Phi_0 \rangle$

and we see that

$\langle 0,0 | (W \otimes I) U_f S U_f |\Phi_0 \rangle = \frac{1}{2^n} \sum_x (-1)^{f(x)}$

We can now repeat our previous argument to link the structure of f to the outcome of a measurement. In fact, if f is constant, the scalar product is one and our final state will be a multiple of $|0,0 \rangle$, if f is balanced, the scalar product will be zero and our state has no overlap with $|0, 0 \rangle$.

Now measurement in the standard basis corresponds to successive measurement of all individual qubits. Thus in the first case, all bits will be measured as zero with certainty, and in the second case, at least one bit will be measures as one with certainty. Hence we arrive at the following form of the algorithm.

• Prepare the quantum system in the initial state $|0, 0 \rangle$
• Apply the Hadamard-Walsh transformation to the first n qubits
• Apply Uf, then apply the phase shift operator f and then apply Uf once more
• Finally apply the Hadamard-Walsh transformation once more to the first n qubits and measure all qubits
• If the measurement yields zero for all qubits, the function f is constant, otherwise the function f is balanced

This version of the algorithm, which is essentially the version presented in the original paper [4], uses two invocations of Uf. Later, a slighly more efficient version of the algorithm was developed ([5]) that only requires one application of Uf. From a theoretical point of view, it is less important whether we need one or two “calls” to Uf – the crucial point is that quantum parallelism makes the number of invocations constant, independent of the value of n. However, from a practical point of view, i.e. when it comes to an implementation on a real quantum computer, being able to reduce the number of involved qubits and gates is of course essential. We briefly sketch the required circuit which we will also see again when we discuss other algorithms.

Given a function f, this circuit will create the state $|\psi \rangle$ from the $| \Phi_0 \rangle$ with only one application of Uf. It uses an ancilla qubit that is initialized in the state $|1\rangle$. The graphical representation is as follows.

Let us quickly check that the circuit is doing what we want. It starts with the superposition across all states x that we denote by

$| \Phi \rangle = \frac{1}{\sqrt{2^n}} \sum_x |x\rangle$

Then we add an ancilla qubit initialized to $|1\rangle$ to which we apply the Hadamard transformation. Now our state is

$| \Phi \rangle \otimes H|1\rangle = \frac{1}{\sqrt{2^{n+1}}} \sum_x (|x, 0 \rangle - |x,1 \rangle)$

We now let Uf act on this state. If f(x) is zero for some value of x, this transformation maps $|x, 0 \rangle$ to itself and $|x, 1 \rangle$ to itself. If, however, f(x) is one, it maps $|x, 0 \rangle$ to $|x, 1 \rangle$ and $|x, 1 \rangle$ to $|x, 0 \rangle$. Thus we find that

$U_f \sum_x (|x, 0 \rangle - |x,1 \rangle) = \sum_x (-1)^{f(x)} ( |x, 0 \rangle - |x,1 \rangle)$

and consequently

$U_f | \Phi \rangle \otimes H|1\rangle = \frac{1}{\sqrt{2^{n+1}}} \sum_x (-1)^{f(x)} (|x, 0 \rangle - |x,1 \rangle)$

which is the same as

$\frac{1}{\sqrt{2^n}} (\sum_x (-1)^{f(x)} |x\rangle) \otimes H|1\rangle$

Now applying the Hadamard operator once more to the ancilla qubit followed by the negation operator X transforms the ancilla qubit to $|0 \rangle$ and we end up with the state $|\psi \rangle$ as claimed.

The Deutsch-Jozsa algorithm is widely acknowledged as the first real quantum algorithm that leverages the potential of a quantum computer and achieves a real improvement compared to classical hardware, while still yielding a deterministic result despite the stochastic nature of quantum systems. Of course, the problem solved by the algorithm is mainly of academical interest. In the next few posts, we will move on to study other algorithms that have been developed in the sequel that are more relevant for actual applications.

## References

1. G. Benenti, G. Casati, G. Strini, Principles of Quantum Computation and Information, Volume 1, World Scientific
2. E. Rieffel, W. Polak, Quantum computing – a gentle introduction, MIT Press
3. D. Deutsch, Quantum Theory, the Church-Turing principle and the Universal Quantum Computer, Proceedings of the Royal Society of London A. Vol. 400, No. 1818 (1985), pp. 97-117
4. D. Deutsch, R. Jozsa, Rapid Solution of Problems by Quantum Computation, Proceedings of the Royal Society of London A, Vol. 439, No. 1907 (1992), pp. 553-558
5. R. Cleve, A. Ekert, C. Macchiavello, M. Mosca, Quantum algorithms revisited, arXiv:quant-ph/9708016

## Why building an operating system from scratch?

What happens if you turn on a PC? How is an operating system able to run multiple tasks in parallel? What happens if you hit a key on your keyboard? And what actually is a process? If you have ever thought for more than a second about one of these things, then read on…

A few years back, actually in late 2010 or early 2011, I was at a point where my growing interest in computer science naturally lead me to those questions. So I did what most of us would probably do – I started to search for information on this and tried to understand what I got.

One day, I hit upon a tutorial that explained the boot process of a PC, and actually contained a short instruction on how to write a very simple program that would act as a boot loader – you would copy this program to a floppy disk (yes, that still existed in some machines at this time) or a CD-ROM and boot from there, and – voila – instead of a Windows or Linux coming up, the system would run your program and print some deep wisdom like “Hello World!” onto the screen.

Now printing something is easy, but what if I wanted the user to be able to enter something and print a response? Obviously, this requires some sort of communication with the keyboard and there is no operating system around that can help you with that, so I needed to learn about a thing called keyboard controller. But wait, why could I only use a few kB of memory? I found that I needed to switch from real mode to protected mode, install interrupt handlers … and slowly, the attempt to just build something that would boot turned into more.

Over the next roughly 18 months, I continued to walk down this path, and eventually had build a small Unix operating system kernel. I then added a simple command line interface, graphics mode and a C library and eventually got to a point where I was able to take simple programs originally written for Linux or other Unix-like operating systems, and compile and run them on my new operating system, which I did call ctOS. Here is a screenshot – my OS is running in a virtual machine on the left, and on the right you see a few measurements of the physical CPU load while running some multi-threading tests within ctOS.

As you might imagine, this required quite some work, and I eventually got to a point where I spend more time working on other things and further development came to an end.

Earlier this year, I was cleaning up a few things on my hard drive and – more or less by accident – hit upon my old code archive again. Being a curious person, I was wondering whether I would still be able to compile and run the system – and of course it badly failed when I tried it. This was not really surprising – my development environment had changed from a 32 bit system to a 64 bit system, and the emulators that I used to run and test the OS had changed quite a bit as well, fixing defects that did actually hide away some bugs in my OS. And of course the “average PC” had changed quite a lot since 2011. The BIOS is now a thing of the past, PCs use UEFI to boot and ACPI tables to store their configuration instead of BIOS tables, and you will hardly ever find a system with a floppy disk drive and only one CPU any more.

So I decided to invest some time to make my OS work again and to adapt it to the current generation of hardware. This included building a new development environment, fixing some defects that become suddenly apparent as the emulator had changed, making the system ACPI aware, removing all dependencies on the BIOS etc. I did also clean up the source code and migrated everything to a Github repository and added a few additional ports.

I have a realistic view on how much time I will probably have in the future to further develop and improve my OS. However, it can still serve a purpose – education. After all, this is what it was created for: helping me to understand some of the inner workings of an operating system.

In fact, my plan is to publish a series of blog posts on the principles and structures behind a Unix operating system, using ctOS to illustrate these concepts and linking to the documentation I have created over the years for the details. This will cover topics like the overall structure of an operating system, processes and the scheduler, interrupts, memory management, device driver and the networking stack. So if you ever wanted to understand how an operating system really works this is the time for you to learn it – stay tuned and join me again for the first part of the series. Until then, you might want to check out the ctOS source code on Github and the documentation that comes with it – if you want, you can even download some of the binary packages on the release page and play with it.

## Quantum gates

So far, we have looked at states of a quantum computer and expressed these states in terms of qubits. However, just having a static state is of very limited use – to perform actual computations, we of course have to change our state over time.

In quantum mechanics, state changes are described by unitary transformations on the Hilbert space describing our quantum system. In analogy with a classical computer, where signals traverse logical gates and are manipulated by these gates to perform calculations, these transformations are called quantum gates. However, please keep in mind that in most actual implementations of quantum computers, these gates are not circuits or building blocks in the classical sense through which the qubits somehow travel. Instead, the qubits are mostly at a fixed point in space, and a gate is applied to a quantum state by manipulating the system, using for instance laser beams or magnetic fields.

## One qubit quantum gates

Let us again start with the case of a single qubit, i.e. with a two-dimensional Hilbert space. With respect to the standard basis $\{ |0\rangle, |1\rangle \}$, a unitary transformation is of course given by a unitary 2×2 matrix.

As a first example, let us consider the matrix

$X = \begin{pmatrix} 0 & 1 \\ 1 & 0\end{pmatrix}$

Obviously, this operator exchanges the states $|0 \rangle$ and $|1 \rangle$ and is therefore sometimes called the bit-flip or negation. Its eigenvalues are 1 and -1, with a pair of orthogonal and normed eigenvectors given by

$|+\rangle = \frac{1}{\sqrt{2}} (|0\rangle + |1 \rangle)$

and

$|-\rangle = \frac{1}{\sqrt{2}} (|0\rangle - |1 \rangle)$

A second commonly used operator is the operator conventionally denoted by Z and given by

$Z = \begin{pmatrix} 1 & 0 \\ 0 & -1 \end{pmatrix}$

Clearly, the eigenvalues of this operator are again +1 and -1, but this time, an eigenbasis is given by the standard basis. On a superposition, this operator changes the relative phase, so it is often called the phase change or phase flip operator. A reader with a background in quantum physics might recognize these two operators as being two of the three Pauli operators. They all have determinant -1, trace 0, are hermitian and unitary and their square is the identity operator.

The eigenstates of X form an orthogonal basis, and hence they can be mapped into the standard basis by applying a unitary transformation. The corresponding operator is the Hadamard operator given by

$H = \frac{1}{\sqrt{2}}\begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}$

As we can see, $H |0\rangle = |+ \rangle$ and $H |1 \rangle = |- \rangle$, and, as $H^2 = 1$, vice versa. Similarly, the phase operator

$S = \begin{pmatrix} 1 & 0 \\ 0 & i \end{pmatrix}$

that changes the relative phase to an imaginary value transforms between the eigenstates of X and the third Pauli operator. The two gates H and S are sometimes referred to as Clifford gates (they have the property that conjugation with them takes the set of Pauli matrices onto itself).

Similar to classical circuits, quantum gates and combinations of quantum gates are often represented graphically. A single quantum gate is represented by a box, with incoming qubits (i.e. the qubits on which the transformation acts) being indicated by incoming lines on the left and the result being represented by outgoing lines on the right.

In the example above, the first line shows two individual gates, the negation and the phase flip operator. The second line graphically represents the process of applying to a single qubit first the negation and then the phase flip. Of course, the resulting matrix is given by the matrix product

$ZX = \begin{pmatrix} 1 & 0 \\ 0 & -1 \end{pmatrix} \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix} = \begin{pmatrix} 0 & 1 \\ -1 & 0 \end{pmatrix}$

Note that the order of the matrix product is reversed, as we first apply X and then Z, reading the diagram from the left to the right which is the usual convention. We also note that this matrix is sometimes called Y, whereas other authors use Y to denote -i times this matrix. We stick to the convention to use Y for the product ZX given above.

## Multi-qubit gates – the CNOT and other controlled gates

Having discussed some of the fundamental quantum gates acting on one qubit, let us now turn again to multi-qubit systems. Of course, we can always create transformations on a multi-qubit system by acting on each qubit individually with a given unitary operator, i.e. by forming the tensor product of operators. Graphically, this would be represented as gates placed below each other, each acting only on one line, i.e. qubit. However, things get more interesting when we use gates that are actually combining multiple qubits to model some non-trivial logic.

As an example, let us consider a gate acting on two qubits that is commonly known as the CNOT gate. To express this gate as a matrix, we have to agree on an order of the basis of the tensor product, and is is natural to use the order $|0 \rangle, |1\rangle, |2\rangle, |3\rangle$ for this. In this basis, the CNOT gate is given by the following 4 x 4 matrix:

$CNOT = \begin{pmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0 \end{pmatrix}$

To understand what this operator is actually doing, let us see how it acts on a basis. Clearly, it maps $|0 \rangle = |00 \rangle$ and $|1 \rangle = |0 1 \rangle$ to itself and swaps the vectors $|2 \rangle = |10 \rangle$ and $|3 \rangle = |11 \rangle$. Thus, expressed in terms of bits, it negates the second bit if the first bit is ON and keeps the second bit as it is if the first bit is OFF. Thus, the first bit can be thought of as a control bit that controls whether the second bit – the target bit – is toggled or not, and this is where the name comes from, CNOT being the abbreviation for “controlled NOT”. Graphically, the CNOT gate is represented by a small circle on the control bit and an encircled plus sign on the target bit (this is the convention used in [1], whereas some other authors, for instance [2], use a slightly different symbol).

An interesting property of the CNOT gate is that it takes separable states to entangled states. Let us for instance compute the action of the CNOT gate on the state $(|0 \rangle + |1 \rangle) |0 \rangle$. This state can as well be written as

$|00 \rangle + |10 \rangle = |0 \rangle + |2\rangle$

from which we can read off that

$CNOT( (|0 \rangle + |1 \rangle) |0 \rangle) = |00 \rangle + |11 \rangle$

which, up to a normalization factor, is an entangled Bell state. Thus we can use the CNOT gate to create entangled states. We also see that, other than the symbol does suggest, the operator does act on the control bit as well.

The notation of a controlled NOT gate can be extended to cover more general gates. Given any unitary one-qubit transformation U, we can in fact consider the two-qubit operator that applies the identify to the second qubit if the first qubit (the control bit) is $|0 \rangle$ and U otherwise. As a matrix, this is given by a 4 x 4 block matrix, with the identity matrix in the upper left corner and the matrix representing U in the lower right corner, and graphically, such a gate is represented as below.

To get used to the graphical notation, let us take a look at the following two-qubit circuit and try to understand what it does to the standard basis.

There are several approaches to figure out what this circuit is doing. First, we can look at its action on each of the basis vectors. Let us do this for the basis vector $|00 \rangle$. We read the circuit from the left to the right and the top to the bottom. Thus, the first part of the circuit acts with H on the first qubit and with the identity on the second qubit. It therefore maps the state

$|00 \rangle = |0 \rangle \otimes |0 \rangle$

to the state

$H|0 \rangle \otimes |0 \rangle = \frac{1}{\sqrt{2}} (|0 \rangle + |1 \rangle) \otimes |0 \rangle = \frac{1}{\sqrt{2}} (|00 \rangle + |10 \rangle)$

The second part of the circuit, the CNOT gate, then maps $|00 \rangle$ to itself and $|10 \rangle$ to $|11 \rangle$. Thus the entire circuit maps the vector $|00 \rangle$ to

$\frac{1}{\sqrt{2}} (|00 \rangle +|11 \rangle)$

which again is an entangled Bell state. We could now proceed like this for the other basis vectors. Alternatively, we can find the answer by multiplying out the matrices and will arrive at the same result.

## Universal quantum gates

In classical computing, a general circuit with n input bits and one output bit can be described by a function taking $\{0, \dots, 2^n - 1 \}$ to $\{0,1\}$, and it is well known that any such function can be expressed as a combination of gates from a small set of standard gates called universal gates. In fact, it is even true that one gate is sufficient – the NAND gate. It is natural to ask whether the same is true for quantum gates. More precisely, can we find a finite set of unitary transformations such that any n-qubit unitary transformation can be composed from them using a few standard operations like the tensor product?

Unfortunately, this does not work in the quantum world. The problem is that the group of unitary matrices is not discrete, but continuous, and thus we can never hope to be able to generate it fully using only a finite number of different gates. However, something very similar is true – every gate can be approximated by combining gates from a small, universal set.

This result is typically presented and derived in two steps (see for instance [1], chapter 3). First, one shows that every unitary transformation of an n-qubit system can be reduced to a combination of single qubit operations (tensored with the identity) and CNOT gates. This is still exact, i.e. no approximation takes place (for the mathematically inclined reader, this might ring a bell – in fact, there is a relation to Cartan decompositions of unitary matrices, see for instance [3] for a discussion).

Then, in a second step, one shows that every single qubit transformation can be approximated arbitrarily close by products of a finite universal set of gates – this is known as the Solovay-Kitaev theorem. It turns out (see for instance [4]) that a possible choice for such a universal set consists of the Hadamard gate H, the phase flip gate Z and only one additional gate

$\begin{pmatrix} 1 & 0 \\ 0 & e^{i\frac{\pi}{4}} \end{pmatrix}$

which is confusingly sometimes called the $\pi / 8$ phase gate.

Finally, we mention that a three-qubit gate that is often used to construct universal sets of quantum gates is the Toffoli gate. This is like a CNOT gate, except that it has two control bits and the target bit is only flipped if both control bits are ON. It is known that the set consisting of the Toffoli gate and the Hadamard gate only is already universal in a similar sense (see this paper by D. Aharonov for a short proof of this result that goes back to Y. Shi). This result is of a certain importance because one can also show that the classical version of the Toffoli gate alone is sufficient to implement (a reversible version of) any classical circuit. Thus we do not only see that any computation that can be done by a classical circuit can also be done by a quantum circuit, but also that without the Hadamard gate, quantum computing would not be more powerful than classical computing.

This closes our short discussion of elementary quantum gates. In the next few posts, we will start to combine quantum gates into quantum algorithms to perform actual computations. Stay tuned!

## References

1. G. Benenti, G. Casati, G. Strini, Principles of Quantum Computation and Information, Volume 1, World Scientific
2. E. Rieffel, W. Polak, Quantum computing – a gentle introduction, MIT Press
3. N. Khaneja, S.J. Glaser, Cartan Decomposition of SU(2 n ), Constructive Controllability of Spin Systems and Universal Quantum Computing, arXiv:quant-ph/0010100v1
4. P. Boykin, T. Mor, M. Pulver, V. Roychowdhury, F. Vatan, A new universal and fault-tolerant quantum basis
5. D. Aharonov,A Simple Proof that Toffoli and Hadamard are Quantum Universal, arXiv:quant-ph/0301040