Shor’s quantum factoring algorithm

Until the nineties of the last century, quantum computing seemed to be an interesting theoretical possibility, but it was far from clear whether it could be useful to tackle computationally hard problems with high relevance for actual complications. This changed dramatically in 1994, when the mathematician P. Shor announced a quantum algorithm that could efficiently solve one of the most intriguing problems in applied mathematics – factoring large numbers into their constituent primes, which, for instance, can be used to break commonly used public key cryptography schemes like RSA.

Shor’s algorithm is significantly more complicated than the quantum algorithms that we have studied so far, so we start with a short overview and then look at the individual pieces in more detail.

Given a number M that we want to factorize, the first part of Shor’s algorithm is to find a number x which has no common divisor with M so that it is a unit modulo M. In practice, we can just guess some x and compute the greatest common divisor gcd(x,M) – if this is not one, we have found a factor of M and are done, if this is one, we have the number x that we need. This step can still be done efficiently on a classical computer and does not require a quantum computer.

The next part of the algorithm now uses a quantum algorithm to determine the period of x. The period is the smallest non-zero number r such that

x^r \equiv 1 \mod M

The core of this step is the quantum algorithm that we will study below. However, the quantum algorithm does not exactly return the number r, but it returns a number s which is close to a multiple of \frac{N}{r}, where N is a power of two. Getting r out of this information is again a classical step that uses the theory of continued fractions. The number N that appears here is N = 2n where n is the number of qubits that the quantum part requires, and needs to be chosen such that M2 can be represented with n bits.

Finally, the third part of the algorithm uses the period r to find a factor of M, which is again done classically using elementary number theory. Thus the overall layout of the algorithm is as follows.

  • Find the smallest number n (the number of qubits that we will need) such that M^2  \leq N = 2^n and a number x < M such that gcd(x,M) = 1
  • Use the quantum part of the algorithm to find a number s which is approximately an integer multiple of N / r
  • Use the theory of continued fractions to extract the period r from this information
  • Use the period to find a factor of M

We will know look at each of these steps in turn (yes, this is going to be a bit of a lengthy post). To make this more tangible, we use a real example and assume that we wanted to factor the number M = 21. This is of course a toy example, but it allows us to simulate and visualize the procedure efficiently on a classical computer.

Determine n and x

The first step is easy. First, we need to determine the number n of qubits that we need. As mentioned above, this is simply the bit length of M2. In our case, M2 = 441, and the next power of two is 512 = 29, so we need n = 9 qubits.

The next step is to find the number x. This can easily be done by just randomly picking some x and checking that is has no common prime factor with M. In our example, let us choose x = 11 (which is a prime number, but this is just by accident and not needed). It is important to choose this rather randomly, as the algorithm might fail in some rare instances and we need to start over, but this only makes sense if we do not pick the same choice again for our second trial.

The quantum part of the algorithm

Now the quantum part of the algorithm starts. We want to calculate the period of x = 11, i.e. the smallest number r such that xr – 1 is a multiple of M = 21.

Of course, as our numbers are small, we could easily calculate the period classically by taking successive powers of 11 and reducing modulo 21, and this would quickly tell us that the period is 6. This, however, is no longer feasible with larger numbers, and this is where our quantum algorithm comes into play.

The algorithm uses two quantum registers. The first register has n qubits, the second can have less, in fact any number of qubits will do as long as we can store all numbers up to M in it, i.e. the bit length of M will suffice. Initially, we bring the system into the superposition

\frac{1}{\sqrt{N}} \sum_k |k \rangle |0 \rangle

which we can for instance do by starting with the state with all qubits being zero and then applying the Hadamard-Walsh transformation to the first register.

Next, we consider the function f that maps a number k to xk modulo M. As for every classical function, we can again find a quantum circuit Uf that represents this function on the level of qubits and apply it to our state to obtain the state

\frac{1}{\sqrt{N}} \sum_k |k \rangle |x^k \mod M \rangle

In his original paper [2], Shor calls this part the modular exponentiation and shows that this is actually the part of the quantum algorithm where most gates are needed (not the quantum Fourier transform).

This state has already some periodicity built into it, as xk modulo M is periodic with period r. If we could measure all the amplitudes, we could easily determine r. However, every such measurement destroys the quantum state and we have to start again, so this algorithm will not be very efficient. So again, the measurement is an issue.

Now, Shor’s idea is to solve the measurement issue by first applying (the inverse of) a quantum Fourier transform to the first register and then measure the first register (we apply the inverse of the quantum Fourier transform while other sources will state that the algorithm uses the quantum Fourier transform itself, but this is just a matter of convention as to which transformation you call the Fourier transform). The outcome s of this measurement will then give us the period!

To get an idea why this is true, let us look at a simpler case. Assume that, before applying the quantum Fourier transform, we measure the value of the second register. Let us call this value y. Then, we can write y as a power of x modulo M. Let k0 be the smallest exponent such that

x^{k_0} = y

Then, due to the periodicity, all values of k such that xk = y modulo M are given by

k = k_0 + jr

Here the index j needs to be chosen such that k0 + jr is still smaller than M. Let A denote the number of possible choices for j. Then, after the measurement, our state will have collapsed to

\frac{1}{\sqrt{A}} \sum_{j=0}^{A-1} |k_0 + jr \rangle  |y \rangle

Let us now apply the inverse of the quantum Fourier transform to this state. The result will be the state

\frac{1}{\sqrt{AN}} \sum_{j=0}^{A-1} \sum_{s=0}^{N-1} \eta^{(x_0 + jr)s} |s \rangle

Now let us measure the first register. From the expression above, we can read off the probability P(s) to measure a certain value of s – we just have to add up the squares of all amplitudes with this value of s. This gives us

P(s) = \frac{1}{AN} \big| \sum_{j=0}^{A-1}  \eta^{jrs} \big|^2

This looks complicated, but in fact this is again a geometric series with coefficient q = \eta^{rs} . To see how the value of the series depends on s, let us assume for a moment that the period divides N (which is very unlikely in practice as N is a power of two, but let us assume this anyway just for the sake of argument), i.e. that N = r u with the frequency u being an integer. Thus, if s is a multiple of u, the coefficient q is equal to one (as \eta^N = 1 ) and the geometric series sums up to A, giving probability 1 / N to measure this value. If, however, s is not a multiple of u, the value of the geometric series is

\frac{1 - q^A}{1 - q}

But in our case, A is of course simply equal to u, and therefore qA is equal to one. Thus the amplitude is zero! We find – note the similarity to our analysis of the Fourier transform of a periodic sequence – that P(s) is sharply peaked at multiples of u = \frac{N}{r} !

We were able to derive this result using a few simplifications – an additional measurement and the assumption that the frequency is an integer. However, as carried out by Shor in [2], a careful analysis shows that these assumptions are not needed. In fact, one can show (if you want to see all the nitty-gritty details, you could look at Shor’s paper or at my notes on GitHub that are based on an argument that I have seen first in Preskill’s lecture notes) that with reasonably high probability, the result s of the measurement will be such that

\big| \{sr\}_N \big| \leq \frac{r}{2}

where \{sr\}_N denotes the residual of sr modulo N. Intuitively, this means that with high probability, the residual is very small, i.e. rs is close to a multiple of N, i.e. s is close to a multiple of N / r. In other words, it shows that in fact, P(s) has peaks at multiples of N / r.

The diagram below plots the probability distribution P(s) for our example, i.e. N = 512 and r = 6 (this plot has been generated using the demo Shor.py available in my GitHub account which uses the numpy package to simulate a run of Shor’s algorithm on a classical computer)

ShorSampleOutput

As expected, we see sharp peaks, located roughly at multiples of 512 / 6 = 85.33. So when we measure the first register, the value s will be close to a multiple of 512 / 6 with a very high probability.

So the quantum algorithm can be summarized as follows.

  • Prepare a superposition \frac{1}{\sqrt{N}} \sum_k |k \rangle |x^k \mod M \rangle
  • Apply the (inverse of the) quantum Fourier transform to this state
  • Measure the value of the first register and call the result s

When running the simulation during which the diagram above was created, I did in fact get a measurement at s = 427 which is very close to 5*512 / 6.

Extracting the period

So having our measurement s = 427 in our hands, how can we use this to determine the period r? We know from the considerations above that s is close to a multiple of N / r, i.e. we know that there is an integer d such that

\big| sr - dN \big| \leq \frac{r}{2}

which we can rewrite as

\big| \frac{s}{N} - \frac{d}{r} \big| \leq \frac{1}{2N} < \frac{1}{M^2}

Thus we are given two rational numbers – s / N and d / r – which we know to be very close to each other. We have the first number s / N in our hands and want to determine the second number. We also know that the denominator r of the second number is smaller than M. Is this sufficient to determine d and r?

Surprisingly, the answer is “yes”. We will not go into details at this point and gloss over some of the number theory, but refer the reader to the classical reference [1] or to my notes for more details). The first good news is that two different fractions with denominators less than M need to be at least by 1 / M2 apart, so the number d / r is unique. The situation is indicated in the diagram below.

ShorContinuedFraction

But how to find it? Luckily, the theory of continued fractions comes to the rescue. If you are not familiar with continued fractions, you can find out more in the appendix of my notes or on the very good Wikipedia page on this. Here, we will just go through the general procedure using our example at hand.

First, we write

\cfrac{427}{512} = 0 + \cfrac{1}{\cfrac{512}{427}}  = 0 + \cfrac{1}{1 + \cfrac{85}{427}}

We can do the same with 85 / 427, i.e. we can write

\cfrac{85}{427} = \cfrac{1}{\cfrac{427}{85}} =  \cfrac{1}{5 + \cfrac{2}{85}}

which will give us the decomposition

\cfrac{427}{512} = 0 + \cfrac{1}{1 + \cfrac{1}{5 + \cfrac{2}{85}}}

Driving this one step further, we finally obtain

\cfrac{427}{512} = 0 + \cfrac{1}{1 + \cfrac{1}{5 + \cfrac{1}{42 + \cfrac{1}{2}}}}

This is called the continued fraction expansion of the rational number 427 / 512. More generally, for every sequence [a_0 ; a_1, a_2, \dots ], we can form the continued fraction

a_0 + \cfrac{1}{a_1 + \cfrac{1}{a_2 + \dots}}

given by that sequence of coefficients, which is obviously a rational number. One can show that in fact every rational number has a representation as a continued fraction, and our calculation has shown that

\cfrac{427}{512} = [0; 1,5,42,2]

This sequence has five coefficients. Now given a number m, we can of course look at the sequence that we obtain by looking at the cofficients up to index m only. For instance, for m = 3, this would give us the sequence

[0; 1,5,42]

The rational number represented by this sequence is

0 + \cfrac{1}{1 + \cfrac{1}{5 + \cfrac{1}{42}}}  = \cfrac{211}{253}

and is called the m-th convergent of the original continued fraction. We have such a convergent for every m, and thus get a (finite) series of rational numbers with the last one being the original number.

One can now show that given any rational number x, the series of m-th convergents of its continued fraction expansion has the following properties.

  • The convergents are in their lowest terms
  • With increasing m, the difference between x and the m-th convergent gets smaller and smaller, i.e. the convergents form an approximation of x that gets better and better
  • The denominators of the convergents are increasing

So the convergents can be used to approximate the rational number x by fractions with smaller denominator – and this is exactly what we need: we wish to approximate the rational number s / N by a ratio d / r with smaller denominator which we then know to be the period. Thus we need to look at the convergents of 427 / 512. These can be easily calculated and turn out to be

0, 1, \cfrac{5}{6}, \cfrac{211}{253}, \cfrac{427}{512}

The last convergent whose denominator is still smaller than M = 21 is 5 / 6, and thus we obtain r = 6. This is the period that we are looking for!

So in general, the recipe to get from the measured value s to r is to calculate the convergents of the rational number s / N and pick the denominator of the last convergent that has a denominator less than M. Again, if you want to see the exact algorithm, you can take a look at my script Shor.py.

Find the factor

We are almost done. We have run the quantum algorithm to obtain an approximate multiple of N / r. We have then applied the theory of continued fractions to derive the period r of x from this measurement. The last step – which is again a purely classical step – is now to use this to find a factor of M. This is in fact comparatively easy.

Recall that – by definition of the period – we get one if we raise x to the power of r and than reduce module M. In other words, xr minus one is a multiple of M. Now assume that we are lucky and the period r is even. Then

(x^{\frac{r}{2}} - 1)(x^{\frac{r}{2}} + 1) = (x^r - 1) \equiv 0 \mod M

With a bit of elementary number theory, one can now show that this implies that the greatest common divisor gcd(xr/2-1, M) is a factor of M (unless, in fact, xr/2 is minus one modulo M, in which case the argument fails). So to get from the period to a potential factor of M, we simply calculate this greatest common divisor and check whether it divides M!

Let us do this for our case. Our period is r = 6. With x = 11, we have x3 = 1331, which is 8 module M. Thus

\text{gcd}(x^{\frac{r}{2}} - 1 \mod M, M) =  \text{gcd}(7,21) = 7

which is the factor of M = 21 that we were looking for.

Performance of the algorithm

In our derivation, we have ignored a few special cases which can make the algorithm fail. For instance, suppose we had not measured s = 427, but s = 341 after applying the Fourier transform. Then the corresponding approximation to 341 / 512 would have been 4 / 6. However, the continued fraction algorithm always produces a result that is in its lowest terms, i.e. it would give us not 4 / 6, but 2 / 3. Looking at this, we would infer that r = 3, which is not the correct result.

There are a few other things that can go wrong. For instance, we could find a period r which is odd, so that our step to derive a factor of M from r does not work, or we might measure an unlikely value of s.

In all these cases, we need to start over and repeat the algorithm. Fortunately, Shor could show that the probability that any of this happens is bounded from below. This bound is decreasing with larger values of M, but it is decreasing so slowly that the expected number of trials that we need grows at most logarithmically and does not destroy the overall performance of the algorithm.

Taking all these considerations into account and deriving bounds for the number of gates required to perform the quantum part of the algorithm, Shor was able to show that the number of steps to obtain a result grows at most polynomially with the number of bits that the number M has. This is obviously much better than the best classical algorithm that requires a bit less than exponential time to factor M. Thus, assuming that we are able to build a working quantum computer with the required number of gates and qubits, this algorithm would be able to factorize large numbers exponentially faster than any known classical algorithm.

Shor’s algorithm provides an example for a problem that is believed to be in the class NP (but not in P) on a classical computer, but in the class BQP on a quantum computer – this is the class of all problems that can be solved in polynomial time with a finite probability of success. However, even though factorization is generally believed not to be in P, i.e. not doable in polynomial time on classical hardware, there is not proof for that. And, even more important, it is not proved that factorization is NP-complete. Thus, Shor’s algorithm does not render every problem in NP solvable in polynomial time on a quantum computer. It does, however, still imply that all public key cryptography systems like RSA that rely on the assumption that large numbers are difficult to factor become inherently insecure once a large scale reliable quantum computer becomes available.

References

1. G.H. Hardy, E.M. Wright, An introduction to the theory of numbers, Oxford University Press, Oxford, 1975
2. P. Shor, Polynomial-Time Algorithms for Prime Factorization and Discrete Logarithms on a Quantum Computer, SIAM J.Sci.Statist.Comput. Vol. 26 Issue 5 (1997), pp 1484–1509, available as arXiv:quant-ph/9508027v2

How does multitasking really work?

The first PC on which I was running a copy of Linux back in the nineties did of course only have one CPU, so it was clear to me that it could physically only execute one instruction at a time – but still, it magically created the impression to run several programs in parallel without visible delay. How could that ever work? Almost twenty years later, I finally found the answer.

In fact, trying to understand how multitasking and multithreading actually work was one of my main motivations when I developed the ctOS kernel a couple of years ago. In this post, I will try to explain the basic mechanism behind the magic.

The basic idea sounds simple enough. Suppose a task has been running for some time and the operating system decides that it is time to switch to a different task. Then the operating system kernel will interrupt the current execution and save the current state of the CPU somewhere in memory. It then locates a different task that is ready to be run – a process which is called scheduling– and retrieves the saved CPU state representing this task from memory. Then, it will let the CPU continue execution with that saved state. As this state is exactly the state of the CPU that was in place when the newly scheduled task was stopped, the CPU will continue to process this task as if it had never been interrupted.

Unfortunately, this simple explanation leaves a number of questions open. First, if a task is executing, it means that the CPU is busy running this task. How, then, can the operating system “do” something? Second, what is the “state of the CPU”? How can we store it and how can we put the CPU back in a saved state? And then, how do we decide that “it is time to switch to a different task”? It turns out that the answer to these questions have to do with interrupts.

Task switching, interrupts and the stack

To explain this, let us for a moment assume a vastly simplified world where all tasks currently active do not require I/O operations. Suppose task A is currently being processed, i.e. the CPU is executing the stream of instructions (the “program”) that make up this task. At some point, we will see an interrupt firing. If there is no I/O, this is usually the timer interrupt that fires periodically, say one thousand times per second.

If this happens, the CPU will stop the execution of task A and start to execute an interrupt service handler. Now the interrupt service handler has been set up by the operating system and points at code within the operating system kernel – so at this point, control is given to the operating system! At the same time, we have learned that the CPU will store the value of some registers like EIP on the stack, and a smart interrupt handler will store the value of additional registers on the stack as well in order to be able to use and reconstruct their value later, and, right at its end, contain code to pop the register values off the stack again. So we could as well set up our ISR in such a way that it does in fact store the value of all registers on the stack. This already provides two ingredients for the task switching mechanism – a way to make sure that the operating system gains control of the CPU at least once a millisecond and a saved CPU state on the stack!

Now the operating system will usually maintain a list of tasks and scan that list according to certain criteria to find a task which should be executed next – we will look at this process in greater detail further below. Suppose that it determines a task which we call B which should be run next. For the sake of simplicity, let us also assume that this task is not entirely new, but has already been running for some time (setting up a new task is a bit more complicated). Thus, when task B was interrupted for the last time, its state was also saved on the stack somewhere, and our operating system has hopefully stored a pointer to this stack.

What we now want to do is to exchange the stored state of task A with that of B. The easiest way to do this is not to copy around the stacks, but to manipulate the stack pointer such that the execution resumes using the stack of task B instead of task A! When we do this and reach the end of the interrupt service handler, the part of the interrupt service handler that is responsible for popping all registers off the stack again as well as the final IRET will use the stack of task B. In other words, the CPU will pick up the state of task B instead the state of task A. It will therefore continue execution with the state of the registers and the instruction pointer that task B had when it was interrupted – in other words, task B will continue to execute instead of task A, and we have switched from task A to task B.

TaskSwitch

In reality, there are a few subtle points that need to be worked out in more detail when you really want to build this. You will have to think about what the stack actually is – does every task have its own stack, how do you isolate them etc. In addition, a task switch might invoke a switch to a different process and thus to a different virtual address space, so at some point, addresses become invalid and cannot be accessed any more. The exact process that ctOS uses is described in the section on task switching in the interrupt handling documentation of ctOS.

We have explained the process of task switching using the timer interrupt as an example. However, this is not the only possible trigger for a task switch. There are of course other hardware interrupts that might trigger a task switch, but more importantly, many task switches do in fact occur during a system call! Assume for instance that a task wants to read data from the hard drive. It will then invoke a system call to do this, i.e. it will hand over control to the kernel voluntarily. The kernel will ask the hard drive to go and get the data, but of course this takes some time. Thus the kernel will in general not return to this task, but schedule a different task and execute a task switch. If (as ctOS does it) a system call is itself implemented as a software interrupt, the mechanism is exactly the same as if a task switch had been done during the processing of a hardware interrupt (if the faster SYSENTER/SYSEXIT mechanism is used, the processing is a bit different but the idea is the same).

Scheduling

What we have not explained so far is how the operating system does actually select a new task to be executed. The exact details how to do this are different for each operating system, but it is probably fair to say that there are in general two main ingredients – the time the task has been running until the last task switch and the state of the task.

First, there is the CPU time that a task has consumed since the last task switch, commonly measured by the quantum. The quantum is a number that is set to some initial value when the task gets control of the CPU. During each processing of the timer interrupt, the quantum of the currently active task is decreased by one. As the timer interrupt fires periodically, the quantum thus measures the time that the task has already consumed – the lower the remaining quantum, the longer the task did already run. If a task has reached quantum zero, it might be a good idea to schedule a different task.

Second, there is the state of a task. Not every task is ready to be executed at any given point in time. A task might for instance be waiting for some event (user input, I/O, ..) or might otherwise be blocked. Our a task could be in the status new, indicating that it is currently still being set up and not yet ready to run. Or it might have been stopped by a signal and be waiting for the signal to continue. And finally, a task can either be active i.e. currently executing, or ready, i.e. waiting to be scheduled.

TaskStatus

During scheduling, only tasks in status ready will be considered as candidates for execution. One way to implement scheduling is to put all ready tasks onto a queue, called the ready queue. Scheduling than basically proceeds as follows.

  • Check whether the currently active task has used up its quantum
  • If no, decrease the quantum by one and return – no task switch will be done. Otherwise proceed with the next step
  • Set the state of the currently active task to “ready”, set its quantum back to its initial value and queue the task at the end
  • Select the first task in the ready queue. Update its status to active and mark it as target of the next task switch

When a task that was not ready but, for instance, blocked, becomes ready, it is added at the end of the ready queue with an initial quantum.

Scheduler.png

There is a more advanced version of this simple algorithm called round-robin scheduling with dynamic priorities. In this model, there is more than one ready queue, in fact there is a list of priorities, say 0 – 15, and there is one queue for each priority. If a new task to be scheduled needs to be determined, these queues are scanned for candidates, starting with the highest priority queue. The priority can be adjusted at run-time based on the current behaviour of the process – see the documentation of the ctOS scheduler for more details. This approach is still comparatively simple, and designing a highly optimized scheduler is an art that can make the difference between a good and a lousy performance. Currently, Linux uses a different scheduling mechanism called CFS, but scheduling remains a tricky tasks, especially in an SMP environment, and we will probably see additional work going into this in the future (see for instance this article for a more in-depth discussion).

Now multi-tasking is nice, but to really execute several programs at the same time has an additional dimension – memory. If you want to run different programs on one physical machine that might even belong to different users, you need to make sure that a program cannot access memory reserved for a different program, i.e. you need some level of memory space isolation. Fortunately, modern CPUs offer a mechanism called virtual memory that we can use for this purpose and that we will explain in the next post.

The quantum Fourier Transform

We are getting closer to the most spectacular early quantum algorithm – Shor’s algorithm for factoring large composite numbers which can be used to break the most widely used public key cryptography systems. But before we can tackle this algorithm, there is one more thing that we need to understand – the quantum Fourier transform.

The discrete Fourier transform

Let us leave the quantum world for a few moments and take a look at a classical area in mathematics – the discrete Fourier transform. To motivate the definition to come, let us for a moment assume that we are observing a time dependent signal, i.e. a function f(t) depending on the time t. Let us also assume that we are interested in processing this signal further, using a digital computer. This will force us to reduce the full signal to a finite number of values, i.e. to a set of values f(0), f(1), …, f(N-1) at N discrete points in time.

Now assume that we wanted to calculate the classical Fourier transform of this function, i.e. the function \hat{f} given by (the exact formula depends on a few conventions)

\hat{f}(x) = \int_{-\infty}^{\infty} f(t) e^{-i x t} dt

If we now replace the function f by the sum of those functions which have value f(i) over a range of length 1, i.e. by its discrete version, this formula turns into

\hat{f}(x) = \sum_{t=0}^{N-1} f(t) e^{-i x t}

Now we could code the discrete values f(0), f(1), .. as a sequence xk of numbers, and we could apply the same process to the Fourier transform and replace it by a sequence Xk of numbers, representing again the values at the discrete points 0, 1, …. The above formula then tells us that these numbers would be given by

X_k = \sum_{j=0}^{N-1} x_j e^{-i j k}

We could now look at this formula as giving us an abstract mapping from the set of sequences of N numbers to itself, provided by (putting in a factor 2 \pi / N )

X_k = \sum_{j=0}^{N-1} x_j e^{- \frac{2\pi i}{N} j k}

This transformation is called the discrete Fourier transform.

There is a different way to look at this which is useful to derive some of the properties of the discrete Fourier transform. For every k, we can build the N-dimensional complex vector u^k with components u^k_n by setting

u^k_n = e^{\frac{2\pi i}{N} k n}

These vectors are mutually orthogonal with respect to the standard hermitian product. In fact, the product of any two of these vectors is given by

(u^k u^{k'})^* = \sum_{j=0}^{N-1} e^{\frac{2\pi i}{N}j(k-k')}

If k = k' , this is clearly N. Otherwise, we can write this as

(u^k u^{k'})^* = \sum_{j=0}^{N-1} q^j

with

q=e^{\frac{2\pi i}{N}(k-k')}

If q is not equal to one, i.e. if k is not equal to k’, then, according to the usual formula for a geometric series, this is equal to

\frac{1-q^{N}}{1-q}

However, this is zero, because q^N = 1 due to the periodicity of the exponential function.

Using this orthogonal basis, we can write the formula for the discrete Fourier transform simply as the hermitian product

X_k = \langle x, u^k \rangle

In particular, the discrete Fourier transform can be seen as the linear transformation that maps the k-th unit vector onto the vector (u^k)^* . By adding a factor 1 / \sqrt{N} , this can be turned into a unitary transformation, i.e. in a certain sense the discrete Fourier transform is simply a rotation. This relation can also be used to easily derive a formula for the inverse of the discrete Fourier transform. In fact, if we expand the vector x into the basis \{u^k\}_k , we obtain

x = \frac{1}{N} \sum_k \langle x, u^k \rangle u^k

and therefore

x_k = \langle x, e_k \rangle  = \frac{1}{N} \sum_j \langle x, u^j \rangle \langle u^j, e_k \rangle = \frac{1}{N} \sum_j X_j e^{\frac{2\pi i}{N} jk}

Fourier transforms of a periodic sequence

For what follows, let us adapt and simplify our notation a bit. First, we add a factor 1 / \sqrt{N} to the formula for the Fourier coefficient, so that the Fourier transform is unitary. Second, we use the symbol \eta to denote the N-th root of unity, i.e.

\eta = e^{\frac{2\pi i }{N}}

With this notation, the formula for the Fourier coefficient is then

X_k = \frac{1}{\sqrt{N}} \sum_j x_j \eta^{-kj}

and the inverse is given by

x_k = \frac{1}{\sqrt{N}} \sum_j X_j \eta^{kj}

Let us now study a special case that is highly relevant for the applications to the factoring of large numbers – the Fourier transform of a periodic sequence. Thus, suppose there is a sequence xk and a number r called the period such that

x_{rs + t} = x_t

for all values of s and t that lead to valid indices. Thus, roughly speaking, after r indices, the values of the sequence repeat themselves. It is also common to reserve the term period for the smallest number with this property. We call the number u = \frac{N}{r} the frequency of the sequence.

Let us now assume that the frequency is a natural number, i.e. that the period divides N, and let us try to understand what this means for the Fourier transform. Using the periodicity, we can write the coefficients as follows.

X_k = \frac{1}{\sqrt{N}} \sum_j x_j \eta^{-jk} = \frac{1}{\sqrt{N}} \sum_{s,t} x_{sr + t}\eta^{(-sr+t)k} = \frac{1}{\sqrt{N}} (\sum_s \eta^{-srk})(\sum_t x_t \eta^{-tk})

where s runs from 0 to u-1. We can now write the first of the sums as follows.

\sum_{s=0}^{u-1} \eta^{-srk} =  \sum_{s=0}^{u-1} e^{-\frac{2\pi i}{u}s k}

Now this is again a geometric series with q = e^{-\frac{2\pi i}{u}k} ! We can therefore conclude as above that this is zero unless q is one, i.e. unless k is a multiple of the frequency u. Thus we have shown that if the sequence xk is periodic with integral frequency u, then all Fourier coefficients Xk are zero for which k is not a multiple of the frequency.

In the later applications, this fact will be applied in a situation where we know that a sequence is periodic, but the period is unknown. We will then perform a Fourier transformation and inspect the coefficients Xk. We can then conclude that the unknown frequency u must be a common divisor of all those indices k for which Xk is different from zero. This is exactly true if the period divides N, and we will see later that in important cases, it is still approximately true if the period does not divide N.

A quantum algorithm to compute the discrete Fourier transform

Let us now carry over our considerations to the world of quantum computing. In a quantum computer with n qubits, the Hilbert space of states is N-dimensional with N=2n, and is spanned by the basis |k \rangle . Every vector can be described by the sequence of its coefficients when expanding it into this basis. Consequently, the discrete Fourier transform defines a unitary transformation on the Hilbert by applying the mapping

\sum_k x_k |k \rangle \mapsto \sum_k X_k |k \rangle

Now this is a unitary transformation, and as any such transformation, can be implemented by a quantum circuit. We will not go into details on this, see for instance [1], section 7.8 or section 5.1 of [2] (but be careful, these authors use a different sign convention for the Fourier transform). The important part, however, is that a quantum Fourier transform for N=2n can be realized with O(n2) quantum gates.

In contrast to this, the best known classical algorithms for computing the discrete Fourier transform are commonly known as fast Fourier transform and require O(Nn) steps. Thus it seems that we have again found a quantum algorithm which is substantially faster than its classical counterpart.

Unfortunately, this is not really true, as we have not yet defined our measurement procedure. In fact, if we measure the result of applying a quantum Fourier transform, we destroy the superposition. In addition, the coefficients in which we are interested are the amplitudes of the possible outcomes, and there is no obvious way to measure them – even if we perform the transformation several times and measure over and over again, we only obtain approximations to the probabilities which are the absolute values of the squared amplitudes, so we do not obtain the phase information. Therefore, it is far from clear whether a quantum algorithm can help to compute the Fourier transform more efficiently than it is possible with a classical algorithm.

However, most applications of the Fourier transform in quantum algorithms are indirect, using the transform as a sort of amplification step to focus amplitudes on interesting states. In the next post, we will look at Shor’s algorithm with exploits the periodicity properties of the Fourier transform to factor large composite numbers.

References

1. E. Rieffel, W. Polak, Quantum computing: a gentle introduction, MIT Press
2. M.A. Nielsen, I.L. Chuang, Quantum computation and quantum information, Cambridge University Press

Interrupts – the heartbeat of a Unix kernel

Modern operating systems are mostly event driven – network cards receive packets, users hit keys or a mouse buttons, built-in timer create events or data arrives from a hard drive. To be able to process these events, a CPU needs a mechanism to stop whatever it is currently doing and run some code designed to deal with the event – and this is what is called an interrupt.

How do interrupts work?

Formally speaking, we could define an interrupt as an event that instructs the CPU to stop the current flow of processing, jump to a defined location in memory, execute the instructions there and later, when that stream of instructions is complete, resume processing at the point where it left of.

To understand how this works, let us briefly recall some basic concepts in CPU design. First, remember that a CPU has – apart from other units – a certain set of registers, i.e. a set of easily accessible, but small pieces of memory that the CPU uses to store intermediate results of a calculation. To be a bit more specific, let us assume that we talk about an x86 CPU in protected mode. Then there are registers called EAX, EBC, ECD and EDX (and many others) that can hold 4 bytes, i.e. 32 bits, each. Instructions like the ADD command can refer to these register and, for instance, add the content of two registers and store the result in a third register.

In addition to these general purpose registers, there are a few other special registers. One of them is the register called EIP which is the instruction pointer – it points to the memory location where the instruction just executed is stored. And there is the stack pointer ESP, which points to the bottom of the stack, i.e. to an area of memory which a program can use to temporarily store data. When a program pushes data on the stack, the stack pointer is decremented, i.e. the stack grows downwards, and the data it is written to the memory location to which the stack pointer then refers. If a program pops data from the stack, the data is taken from the memory location to which the stack pointer currently refers, and the stack pointer is incremented again.

x86RegisterSet

When instructions are executed one by one, the instruction pointer is simply increased after the processing of one instruction has been completed to point to the next instruction. However, when a program uses the CALL instruction to call a subroutine, the CPU will have to continue execution not at the next instruction, but at the start of the subroutine. When the subroutine completes, however, it needs to resume processing at the next instruction after the CALL instruction. So it has to save this address – called the return address – somewhere.

As there is only a small set of registers available and we might want to do nested calls, the CPU cannot use a register for this purpose, so it uses the stack. Thus, if we call a subroutine, the return address is pushed onto the stack. When the subroutine has completed, a special instruction called RET is placed in the code. This instruction tells the CPU to get the return address from the stack and resume execution at this point.

By now, you might be asking yourself why I am telling you all this. The answer is simple – interrupts work in a very similar way. In fact, when an interrupt is raised (we will see further below what this actually means), the CPU places the current value of the EIP register on the stack. It then jumps to a predefined point in memory and processes the instructions there, called the interrupt service handler (ISR). Once the ISR completes, it needs to execute a command called IRET. This command will instruct the CPU to take the stored value of EIP from the stack and reload it. Thus, the CPU will resume execution of the current program after the instruction during which the interrupt was generated.

This simple description glossed over a few details. First, interrupts are numbered, i.e. there is not only one interrupt, but in fact there are 256 so-called interrupt vectors. When an interrupt is raised, the CPU uses the interrupt vector to look up the address of the interrupt service handler in a table called the interrupt descriptor table (IDT). Thus an operating system needs to set up interrupt service handlers for all these 256 possible vectors and populate the IDT accordingly.

Second, the CPU does not only store the value of EIP on the stack, but some additional information like an error code. If you are interested in the details, you can find a description of the precise stack layout in the ctOS documentation.

And finally, an interrupt might in addition cause a change of privileges – we will learn more about this when we discuss the protected mode in one of the future posts.

Note that the CPU does not automatically place all other registers like EAX on the stack, so the interrupt service handler needs to make sure that these registers are restored to their original state before the ISR returns, otherwise the running program will continue execution with a corrupted register set (but an operating system can use this to on purpose change the values of the registers of a running program).

Sources of interrupts and interrupt routing

Now where do interrupts really come from? Basically, there are three sources of interrupts.

First, there are hardware interrupts. These are interrupts that are caused by hardware external to the CPU. An example could be a keyboard that uses an interrupt to signal that the user has pressed a key, or a timer that is programmed to generate specific interrupts periodically. We will see below how these interrupts are used in an operating system kernel.

Second, there are software interrupts. These can be exceptions or faults, i.e. interrupts that are raised by the CPU to signal some sort of error condition, like the attempt to divide by zero or a general protection fault. But software interrupts can also be raised by a program on purpose using the INT instruction.

The exact mechanism how a hardware interrupt travels from a peripheral device to the CPU is one of the most complicated and bloated areas of the PC architecture. In most cases, there is an additional piece of hardware sitting between a device and the CPU that is responsible for handling this communication and is called the interrupt controller. The first IBM compatible PCs had an interrupt controller called the PIC that was able to handle eight different hardware interrupts. However, with an increasing number of devices, this soon turned out to be not sufficient, so a second interrupt controller was cascaded with the first one to be able to manage sixteen interrupts. This itself was made a bit more complicated than it sounds by the fact that at this time, two bus systems were around – the new PCI bus and the old ISA bus – and that the interrupt routing could be programmed by the BIOS using the so-called programmable interrupt router (PIR). So we already had a bit of a complex setup, as indicated in the diagram below.

 

 
Then the first PCs started to have more than one CPU, and a more flexible mechanism was invented – the APIC. Unfortunately, the old interrupt controller was still around and could be used, making for a very complex configuration mechanism. And finally, newer devices use a mechanism called MSI that avoids an interrupt controller almost completely. If you want to read up the full story, again the ctOS documentation has all the nitty-gritty details.

Where are interrupts used?

I started this post with the bold claim that interrupts are the heartbeat of a modern operating system kernel – so let me try to justify this statement.

First, let us think about how a CPU would typically communicate with the hardware. Take a very simple example – user input. Of course, in a dialogue driven programming style, things are simple. You would print something to the screen and then wait for user input. More specifically, you would periodically, maybe within every few microseconds, ask the keyboard controller for any new data that has arrived.

Such a design is simple, but has a major disadvantage: while you are actively waiting (polling in the language of computer scientists) for input, you consume CPU time and the CPU cannot do anything else. This is not a major problem for a home computer operating system of the early eighties, but turns into a problem when you want to support multi-tasking or even multi-user capabilities.

Interrupts offer a way out. Instead of sitting in a loop and polling for new data every few thousand iterations, the CPU can hand over control to another task which is currently not expecting user input and can let the keyboard controller fire an interrupt if the user presses (or releases) a key. The interrupt service handler can then inspect the input and either put it into a buffer for later processing or initiate a task switch back to the program waiting for the input (we will look in more detail into the exact mechanics of a task switch in a later post).

A similar mechanism can be used for other input/output related processing. If, for instance, we want to read data from the hard disk, we can apply a procedure called direct memory access (DMA). With DMA, the CPU would ask the hard disk controller to load one or more sectors from the disk and then continue to work on other tasks. The controller would – in the background – read the data from the disk and transfer it into a designated area of the systems main memory. Once the transfer is complete, it would inform the CPU by raising an interrupt. The CPU could then, again in an interrupt service handler, get the data and process it further. Similar mechanism apply for networking – incoming packets can trigger interrupts, or an interrupt could signal to the CPU that the network controller is ready to send out data.

But interrupts are not only useful for I/O related tasks – they are also a major ingredient for the implementation of multi-tasking. In fact, the very first multi-tasking systems used for home PCs where based on a design called cooperative multi-tasking. With that design, a task would actively return control to the operating system after executing some time to allow other processes to run. You might want to compare this to a group of people on a hot summer day sharing a bottle of water. When everybody cooperates and passes on the bottle after a few sips, this works fine. But things go wrong if someone – for whatever reasons – just keeps on drinking until the bottle is empty. Translated back to the world of operating systems, this design allows a single process to consume the entire CPU time, so that no other process can execute any more, for instance because it is stuck in an infinite loop due to a programming error – the machine will appear to hang. Those of you who have still seens Windows 3.11 that uses this design know what I am talking about.

A more advanced design can be realized using interrupts. A modern PC does for instance have a timer that creates interrupts at a defined frequency, say 1000 Hz, i.e. one interrupt every millisecond. If the interrupt service routine triggered by the timer is performing a task switch, it is made sure that even if a process hangs, control will be returned to the operating system and potentially to a different process after at most one millisecond. This design is called preemptive multitasking because the task switch is not actively initiated by the currently running process, but the process is interrupted (preeempted) by the ISR and the operating system.

And there are even more applications that are worth being mentioned – you can use interrupts for debugging, several CPU cores can use special interrupts to communicate with each other and a thermal sensor can use an interrupt to let the CPU know that a certain temperature has been exceeded. In fact, in an operating system kernel, a lot of processing is done in an interrupt service handler, like dealing with hardware interrupts, scheduling a new task and handling signals – see the overview of the interrupt processing in ctOS below to get an idea.

There is one more point that we have ignored so far – interrupts can be used to switch back between kernel space and user space, or between ring 3 and ring 0 of a CPU. However, explaining this requires a basic understanding of the protected mode, so we will leave this for a later post.