Joel Laity

Book review: Vaxxers

2021-08-21T00:00:00+00:00

Intro

In January 2020 I was on a holiday in China with my boyfriend. When we had boarded the plane to Shanghai a week earlier Coronavirus was barely on our radar. Things changed very quickly. We planned to stay for 14 days but by the tenth day into our vacation all non-essential stores were shut, buying masks required hours of searching, and the Chinese government sent a text informing everyone in our area that a major highway 2 hours from our hotel would be closed “indefinitely”. We decided to leave on the evening of Tuesday 28 January, and booked flights for Wednesday morning, 3 days before our original departure.

You know shit has hit the fan when even Starbucks is closed.

If we had waited for our original departure flight, which left on Saturday, we would have been at the airport in Shanghai when the news broke that the Australian government would not let non-Australians flying from China back into the country. As a New Zealander, I would not be allowed back into the country where I had worked and lived for the past 12 months.

By the end of March, Sydney was in lockdown. And almost all news sources were warning that developing a vaccine would take many years. This seemed reasonable to me at the time; new vaccines usually took at least ten years to develop and roll out. The fastest vaccine ever developed was the mumps vaccine in the 1960s and that took 4 years. Little did I know that at that time Oxford University scientists had already analyzed the virus genome, modified it so it was suitable for a vaccine, and were now in the process of growing enough vaccine-ready material to be able to conduct a clinical trial. The development of the first batch of the AstraZeneca vaccine was nearly complete.

Summary of the book

Vaxxers is a book that chronicles the effort to develop the AstraZeneca vaccine. But unlike most such books, which are written by a journalist based on interviews, Vaxxers was written by two scientists who lead the effort to create the vaccine, Sarah Gilbert and Catherine Green.

Their main motivation for writing the book was to make vaccine development seem less mysterious and humanize the scientists who developed the vaccine. Their hope was that this would allay fears about receiving a shot, so I was a bit concerned their PR-instincts would take over and the book would be rather bland. In some respects this was true, many sections of the book were targeted at people who were skeptical of vaccines in general. But they also went into great detail explaining what it was like to be in their shoes, racing to develop the vaccine. What were they actually doing day-to-day? How do you go from a sequence of Coronavirus genome on a computer to a working vaccine in a few months? What does it feel like to actually do that work when you know that every decision you make has enormous consequences for all of humanity but you are also in a race against the worst pandemic in living memory, so decisions have to be made quickly? I could feel that both Gilbert and Green were proud and excited to share with the reader exactly how they made the vaccine and what their day-to-day lives felt like during that time.

The vaccine technology

Before I read the book I thought that the AZ vaccine was made using old, traditional methods, in contrast to the mRNA vaccines which were a new, general purpose technology.

It turns out this was not true. The method used to create the AZ vaccine is made using a “platform technology” - you can create a vaccine for any virus using the same technology (whether or not it is effective is a different story). The basic idea is to get the genome for COVID-19 virus, identify the part you want (the spike-protein part), do a few minor modifications (on a computer) and then send that data to a commercial lab. It only takes around two weeks (!!) for them to manufacture and send back a test tube with 100 billion real strands of DNA. You then combine that DNA with a non-replicating version of an chimpanzee Adenovirus and voila! You have the active ingredient in the vaccine. It then took another few months for the Oxford scientists to produce enough of it to conduct clinical trials.

Ebola

The Oxford scientists actually used this exact technology to create a vaccine for Ebola in 2014. Unfortunately, the vaccine never got to Phase III clinical trials. Ironically the concern was not that the vaccine would be unsafe, the concern was that in order to conduct a randomised controlled trial you need to give a placebo to half the participants, and by this time Ebola was so bad that it was argued (by the WHO I think?) it would be unethical to give people a placebo. The Oxford scientists were unimpressed with these arguments:

Anyone taking part in the trial would be closely monitored and, if infected, would receive care as early as possible, thus improving their chances further. And, the longer discussions about ethical trial design went on, the longer no one was receiving a vaccine that gave them any chance of protection at all.

Instead they used a “ring vaccination study” where the participants for the clinical trial are chosen in a different way and everyone in the clinical trial gets the vaccine, but some people in the trial get it later than others:

So instead, another type of trial design was eventually decided upon: a ring vaccination study with delayed vaccination in half of the rings. In order to use this type of trial design it is necessary to first identify someone who has been infected with Ebola. Then all of that person’s contacts are identified, and the contacts of their contacts, and the limits of the geographical area where they can be found are defined. That group of people forms a ‘ring’ around the initial case. Many rings are identified in this way, and each ring is randomly assigned to receive either immediate vaccination or delayed vaccination.

I don’t really see how this avoids the ethical concerns. Everybody who is not in the clinical trial is still given nothing and the more complicated trial design combined with the back and forth trying to nail down an acceptable trial design just delayed the vaccine even more.

Whilst all of the discussions about how to conduct the phase III trials had been going on, so had the Ebola crisis. It was frustrating to see the process slow down dramatically when a vaccine was so desperately needed…. by April 2015 when the ring vaccination study started – a full year after the outbreak became widespread – the case numbers were low and falling.

By the time they got approval for the trial Ebola cases were low and there was only enough Ebola going around to test one vaccine. A vaccine developed by a different group was chosen and the Adenovirus-based vaccine was never tested for efficacy.

The Oxford scientists thought this delay was unacceptable. WHO had a different take.

Why did it take so long to test for efficacy? It was four months from the outcome of the phase II trials until the start of the phase III study. I made this point at a conference where I had been invited to speak about our vaccine trial, and received a rather angry response from a WHO representative who insisted that everything had been done as fast as possible. But the fault does not lie with individuals not doing their job properly in the thick of things. The problem was a lack of preparation. The fact was that the delays meant not only that it took longer to contain the deadly Ebola virus, but also that only one vaccine ended up being tested for efficacy – a vaccine that required very low temperature storage, making it difficult and expensive to use in hot countries.

The line “the fault does not lie with individuals but … lack of preparation” seems like a cop-out to me. I understand the impulse not to blame individuals and more preparation for Ebola-like events is justified. But the problem here was that WHO’s “ethical concerns’’ don’t really make sense. WHO’s behavior is better explained by thinking of them as a Vetocracy rather than an organization primarily concerned about ethics.

Developing a COVID vaccine

Anyway, the Oxford researchers had this vaccine technology but it had never been subjected to efficacy trials. Then the coronavirus came. Within 48 hours of the COVID-19 genome being released the Oxford group had figured out which part they needed, modified it slightly, and sent the sequence to a commercial company ThermoFisher, to be manufactured. Remarkably, it only takes around a fortnight for these companies to turn DNA sequences stored in a computer into a test tube with around 100 billion strands of real DNA. While waiting for the DNA to come back from ThermoFisher the Oxford scientists had made preparations for how they were going to manufacture the vaccine. The lab available to them had already been precommitted to other projects. Thankfully they decided to say (paraphrasing) “fuck it - we’ll figure out finances later”:

My other concern was financial. The CBF [the lab they worked in] is run like a small business within the university, and in order to operate it has to cover its costs of around £1.5 million a year by charging its clients – researchers like Sarah and Tess – who in turn have to apply for grants to fund their research. The projects that Sarah was asking me to delay or deprioritise were already agreed, and their funding was secure right through to manufacture. It would be a big risk to the CBF’s operation to drop those in favour of this new project, when it wasn’t at all clear where the money might come from to pay for any of it.

They tried two different methods to manufacture the vaccine in parallel. A rapid method which was less likely to work and a slower but more reliable method.

The rapid method was originally developed to help fight cancer. You get some mutated DNA from a tumor, create a vaccine, give it to a cancer patient and their immune system will learn to recognize the mutated DNA and attack it. A cancer patient’s own immune system could learn to attack the tumour. This immune response sometimes happens without pharmaceutical intervention so the mechanism could plausibly work, but you need a vaccine personalized for every cancer patient’s exact tumour mutation. The only way this would be economically viable is with very fast vaccine manufacture.

In both methods the basic process is the same. They inserted some of the adenovirus/COVID-19 DNA into human cells. Remarkably, these cells all originate from the kidney of a single fetus that was aborted in the Netherlands in the 1970s. These human cells have been sitting in labs replicating for the past 50 years.

These human cells start producing the virus that would become the main ingredient in the vaccines. It’s not exactly clear to me exactly why the virus can replicate in these cells but can not replicate when injected into my arm but I think it’s because they do a special procedure to get it inside the human cells.

They then carefully help these human cells replicate until they have enough of the virus. In the end they made 300mls of fluid with this virus. That cup of fluid was “destined to seed the manufacture of every dose of the Oxford vaccine ever produced.” (Imagine holding that in your hand knowing that if you break it your clumsiness could be measured in millions of lives.)

Sadly the rapid method did not work. Fortunately, the traditional one did. This was one example of a broader theme in the book where they tried many different things in parallel, fully expecting to have some “wasted” effort.

To move quickly, we would still perform all the same tests as usual, we just wouldn’t wait for the results before moving on to the next part of the process. If the starting material failed any of its tests, we would have to throw out anything we had made from it. But that risk – a risk of wasted time and effort and serious money, but not of quality – was one we were prepared to take.

The next few weeks were some of the most hectic and surreal of my life. I was running in parallel half a dozen stages of vaccine development that would usually happen over years and in sequence.

Because we had never used this method before, we also made preparations with lots of different conditions: different ratios of adenovirus DNA to spike protein DNA, different ratios of cells to DNA, and so on.

Thank God they appreciated early on just how important speed was.

Partnering with AstraZeneca

The Oxford group then had to partner with a company to scale up the manufacturing. Surprisingly, the scientist themselves did not have input into selecting the company.

But Andy’s email was the first time I heard mention that it would be AstraZeneca, and it was something of a surprise. I knew they were big in cancer medicines and they were obviously a name in the pharma world, but they did not have a particular reputation for vaccine manufacture. We felt a bit disconnected: as though decisions that would really affect our working lives (by this point all of us were working on this project all of the time and it had completely taken over every waking and sleeping hour) were being taken at the highest level of the university with no consultation with those of us who actually knew how to make this vaccine.

Working with a large multinational was the exact opposite of what they were used to: a close-knit group of scientists who all understood what everyone else was working on.

AstraZeneca is an enormous entity, with multiple teams across the UK and the US with quite specialised roles, whereas everyone at our end was involved in and knew about everything. Also, they had no experience of manufacturing viral vectors, so the technical aspects of producing viral vectors, and the quality tests needed for these kinds of products, were all new to them. It was frustrating to have to keep repeating ourselves to slightly different combinations of AstraZeneca people.

But the Oxford scientists did come to appreciate their corporate partners. Companies have certain strengths that tend to complement research scientists.

I remember being in a meeting very early on, probably in May, when someone at AstraZeneca confidently used the phrase ‘billions of doses’. That’s a real shock to the system when a really big day for you is manually putting 500 doses into vials. They were prepared to throw everything at it straightaway, rather than waiting for results from clinical trials before they fully invested.

Companies also just have way more money to spend. (Provided the benefit/cost ratio is high enough).

The vaccine was still in Italy, the trials were in the UK – and there were no commercial flights operating between the two. We were stuck. … It turns out chartering a private jet costs around £20,000, normally well beyond the budget of a small academic clinical trial. But by this time we had the might of a global pharma company behind us. All those meetings with our AstraZeneca colleagues were starting to come good: we got permission to proceed. The jet arrived in London the next day with no passengers, just a large box of dry ice and 500 precious vials for next-day distribution across the UK. The trial must go on.

Clinical trials and regulatory agencies

The clinical trials were similar to clinical trials in normal times. Except that the gaps between the stage I, II and III trials were much smaller and all the data was processed much faster. Recruiting volunteers was also much easier.

The recruitment of volunteers to clinical trials is often quite a challenge…. [But] Within hours of announcing that we were recruiting volunteers for trials, we had thousands of applications.

The scientists praise the MHRA (the UK health regulator) in the book. They said the MHRA was quite cooperative and was willing to weigh the harms of using a slightly different process than usual against the benefits of manufacturing a vaccine faster. For example, there were some issues with the dosing in the Phase I trials and the MHRA was reasonably flexible. I take what the authors say with a pinch of salt because the MHRA acts as a gatekeeper for all their work. So it is probably unwise for them to heavily criticize the MHRA and sour their relationship with the very same people who they need to cooperate with for the rest of their careers. On the other hand, anecdotes like this are good evidence the praise is genuine:

The biggest risk (albeit a very small one) was that the Covid-19 vaccine might get contaminated with a bit of the previously manufactured product. We had a very sensitive and specific test for this valuable product, so we knew we would be able to test our final vaccine to check if there had been any contamination. Proceeding without fumigation would save at least three weeks. We drew up a formal risk assessment and submitted it to the Medicines and Healthcare products Regulatory Agency (MHRA, the body responsible for approving every step of our vaccine development process and ultimately for deciding whether to allow it to be used), who agreed our approach. (This was the first of a very large number of communications we would have with the MHRA over the coming months. That relationship, and the MHRA’s proactive approach, is a critical part of this story.)

Compare this to the FDA (the US health regulator). The FDA has been criticized heavily throughout the pandemic for being too biased in favor of inaction and inflexibly following pre-pandemic processes. This anecdote was telling:

The FDA approach was more process-driven, whereas the MHRA’s approach was more interactive, and more focused on gathering the evidence needed to assess the risks and answer the scientific questions. By way of illustration, many years previously we had been asked to collaborate with a US group working on malaria vaccine development. We had already completed a phase I clinical trial on a vaccine. It had been well tolerated, but – as happens a lot in vaccine development – the immune response was not as high as we had hoped, and we were not planning to proceed any further with it. But we did still have some of the batch left and the US group wanted to do a trial using our vaccine in combination with another one to see if that might improve the immune response. The issue we came up against was that although we had completed a clinical trial successfully in the UK, the FDA required toxicology studies to have been completed in two different species whereas in the UK we only have to complete a toxicology study in one species. We had done that and proceeded to human trials, and shown no safety concerns. On a call with the FDA, we explained that we had safety data from mice, and also from humans, which are a species after all, so would that work for them? The answer was no. They needed toxicology studies from another animal species – a rat or a rabbit. The problem was that if we did a toxicology study in rats or rabbits, it would use up the limited amount of vaccine that was remaining, and we wouldn’t then be able to do the clinical trial.

…

We were, however, unable to come to an agreement so the clinical trial was never carried out.

It was hard to read passages like the one above. The TGA (the Australian health regulator) has somehow been even slower than the FDA. As a result, I’m writing this during a lockdown in Australia where (as of Aug 21) our vaccination rate is lower than 36/38 OECD countries. These excerpts from an article by Steven Hamilton and Richard Holden give a summary for those of you who are unfamiliar with Australia:

At the end of 2020, as vaccines were rolling out en masse in the Northern Hemisphere, the TGA [Therapeutic Goods Administration, AT] flatly refused to issue the emergency authorisations other regulators did. As a result, the TGA didn’t approve the Pfizer vaccine until January 25, more than six weeks after the US Food and Drug Administration (FDA), itself not exactly the poster child of expeditiousness.

Similarly, the TGA didn’t approve the AstraZeneca vaccine until February 16, almost seven weeks after the UK.

In case you’re wondering “what difference does six weeks make?“, think again. Were our rollout six weeks faster, the current Sydney outbreak would likely never have exploded, saving many lives and livelihoods. In the face of an exponentially spreading virus that has become twice as infectious, six weeks is an eternity. And, indeed, nothing has changed. The TGA approved the Moderna vaccine this week, eight months after the FDA.

It approved looser cold storage requirements for the Pfizer vaccine, which would allow the vaccine to be more widely distributed and reduce wastage, on April 8, six weeks after the FDA. And it approved the Pfizer vaccine for use by 12 to 15-year-olds on July 23, more than 10 weeks after the FDA.

Where’s the approval of the mix-and-match vaccine regimen, used to great effect in Canada, where AstraZeneca is combined with Pfizer to expand supply and increase efficacy? Where’s the guidance for those who’ve received two doses of AstraZeneca that they’ll be able to receive a Pfizer booster later?

But the slow, insular, and excessively cautious advice of our medical regulatory complex, which comprehensively failed to grasp the massive consequences of delay and inaction, must be right at the top of that list.

Conclusion

The book has great anecdotes from the scientists and delves into lots of other topics in detail:

The media and political response to the vaccine, e.g. Macron said the vaccine “‘seems quasi-ineffective on people older than 65” based on a news article that claimed the vaccine was “only 8% effective” for old people. This was completely made up.
The infamous dosing problem in the clinical trial.
Security against anti-vax protestors.
How much time scientists spend securing funding. Gilbert writes “Actually, raising funds had been my main activity for years”.
Just how stressful this was for everyone involved.

Unfortunately there’s not that much discussion of the blood clot concerns because that was a relatively recent development. (Personally, I think those concerns are overblown and have been vaccinated with AZ myself.)

The take-home message of the book was that we need to be prepared to make vaccines for a pandemic (duh) and we need to be prepared to make them quickly. In particular, the authors suggest working on an annual flu vaccine development with the same emphasis on speed that we would expect in a pandemic. This can be used to validate that we have the capability to rapidly create a vaccine and scale up manufacturing before the next pandemic hits us.

For example, might it be cost-effective, given how much flu costs the economy, to work on flu vaccine development with as much urgency as we applied to the Covid vaccine? It would require more funding upfront, and the acceptance that not everything that was tried would work, but it might be the way to make some real progress rather than continuing to limp along as we have in the past, with small projects and no joined-up approach.

Most of the delay when creating the vaccine was logistical: securing funding, getting regulatory approval, conducting clinical trials and collaborating with multinational companies. An end-to-end test of rapid vaccine production, using the flu as a test case, is a great way to make sure every part of the pipeline can move quickly during a future pandemic. Besides, the normal flu is pretty deadly so we should arguably be putting a lot more money into fighting it anyway.

Buy the book here.

libc++’s implementation of std::string

2020-01-31T00:00:00+00:00

I. Introduction

libc++ is the LLVM project’s implementation of the C++ standard library. libc++’s implementation of std::string is a fascinating case study of how to optimize container classes. Unfortunately, the source code is very hard to read because it is extremely:

Optimized. Even for relatively niche use-cases.
General. The std::string class is a specialization of basic_string. basic_string can accept a custom character type and custom allocator.
Portable. This leads to #ifdef macros everywhere.
Resilient. Every non-public identifier is prefixed with underscores to prevent name clashes with other code. This is necessary even for local variables since macros defined by the user of the library could modify the library’s header file.
Undocumented. There are very few comments in the <string> header. I assume this is because library vendors would prefer it if users did not rely on internal implementation details of their classes, and not documenting internal helper functions is a desperate effort to mitigate Hyrum’s law.

This post examines the implementation of libc++’s std::string. To keep it simple I will assume you are using a modern compiler and a modern x86 processor¹. Keep in mind that the way objects are laid out in memory is very specific to the compiler, CPU archictecture and standard library used; everything I describe below is an implementation detail and not defined by the C++ standard.

II. Data layout

std::string has two modes: long string and short string. It uses a union to reuse the same bytes for both modes. Short string mode is an optimization which makes it possible to store up to 22 characters without heap allocation.

Long string mode

The long string mode is a pretty standard string implementation. There are three members:

size_t __cap_ - The amount of space in the underlying character buffer. If the string grows enough that length of the string (including the null-terminator) exceeds __cap_ then the buffer must be reallocated. __cap_ is an unsigned 64 bit integer. The least significant bit of __cap_ is used as a flag, see the discussion below.
size_t __size_ - The size of the current string, not including the null terminator. This is also an unsigned 64 bit integer.
char* __data_ - A pointer to the underlying buffer where the characters of the string are stored. This is 64 bits wide.

Since each member is 8 bytes, sizeof(std::string) == 24.

std::string uses the least significant bit of __cap_ to distinguish whether it is in long string mode or short string mode. If the least significant bit is set to 1, then it is in long string mode. If it is set to zero, then it is in short string mode. It is possible to use the least significant bit in this way because the size of the buffer is guaranteed by the implementation to always be an even number - so the true value for the capacity always has a 0 in the least significant bit. The method std::string::capacity() has an implementation that is equivalent to this (the real code looks quite different):

size_t capacity() {
  if (__cap_ & 1) { // Long string mode.
    // buffer_size holds the true size of the underlying buffer pointed
    // to by data_. The size of the buffer is always an even number. The
    // least significant bit of __cap_ is cleared since it is just used
    // as a flag to indicate that we are in long string mode.
    size_t buffer_size = __cap_ & ~1ul;
    // Subtract 1 because the null terminator takes up one spot in the
    // character buffer.
    return buffer_size - 1; 
  }

  // <Handle short string mode.>
}

Short string mode

The short string mode uses the same 24 bytes to mean something completely different. There are two members:

unsigned char __size_ - The size of the string, left-shifted by one (__size_ == (true_size << 1)). The true size of the string is left-shifted by one because the least significant bit of the first byte is used as a flag. The least significant bit must be set to 0 in short string mode.
char __data_[23] - A buffer to hold the characters of the string.

__size_ stores the size of the string left shifted by 1, so the method std::string::size() has an implementation equivalent to this:

size_t size() {
  if (__size_ & 1u == 0) {  // Short string mode.
    return __size_ >> 1;
  }
  // <Handle long string mode.>
}

Because we are assuming the target architecture is little-endian, the least significant bit of __cap_ is in the same position as the least significant bit of __size_.

III. Implementation

To see how the libc++ implementation achieves the data layout described above, I’m going to copy and paste real code snippets from libc++ and add comments.

Long mode is reasonably straightforward, it’s implemented like this:

// size_type and pointer are type aliases.
struct __long {
  size_type __cap_;
  size_type __size_;
  pointer __data_;
};

Short mode looks like this:

static const size_type __short_mask = 0x01;
static const size_type __long_mask = 0x1ul;

enum {
  __min_cap = (sizeof(__long) - 1) / sizeof(value_type) > 2
                  ? (sizeof(__long) - 1) / sizeof(value_type)
                  : 2
};

struct __short {
  union {
    unsigned char __size_;
    value_type __lx;
  };
  value_type __data_[__min_cap];
};

According to this Reddit comment, __lx is needed to ensure any padding goes after __size_, but has no other purpose (I don’t fully understand why this forces the padding to go after __size_ 🤷‍♂). __min_cap is 23 on the platforms we are considering (64-bit).

So the first byte of __short is occupied by __size_, and the next 23 are occupied the __data_ array.

The string is then represented like this:

// __ulx is only used to calculate __n_words.
union __ulx {
  __long __lx;
  __short __lxx;
};

enum { __n_words = sizeof(__ulx) / sizeof(size_type) };

struct __raw {
  size_type __words[__n_words];
};

struct __rep {
  union {
    __long __l;
    __short __s;
    __raw __r;
  };
};

The __rep_ struct represents the string. It is a union of __long and __short as expected.

The __raw struct is just an array of size 24 which allows some of the methods to consider the string as a sequence of bytes without having to care about whether the string is in long or short mode. For example, after a string is moved-from it is zeroed out, and the __zero() method is implemented like this:

void __zero() {
  size_type (&__a)[__n_words] = __r_.first().__r.__words;
  for (unsigned __i = 0; __i < __n_words; ++__i)
    __a[__i] = 0;
}

Finally, the only member variable in std::string is declared like this:

// allocator_type is the allocator defined by the user of basic_string
__compressed_pair<__rep, allocator_type> __r_;

__compressed_pair behaves like std::pair, except it has an optimization where if one of the templates in the pair is an empty class then that class will not contribute to the size of the pair. std::pair is larger than it needs to be, for example:

#include <utility>
#include <iostream>

struct E {};

int main() {
  std::pair<int, E> p;
  std::cout << sizeof(int) << std::endl;  // Outputs 4.
  std::cout << sizeof(E) << std::endl;  // Outputs 1.
  std::cout << sizeof(p) << std::endl;  // Outputs 8.
  std::cout << sizeof(__compressed_pair<int, E>) << std::endl;  // Outputs 4.
}

The reason E uses any space in the example above is for language-technical reasons: every object must have a unique memory address. (This will change in C++20, see here and here.) std::pair stores the objects next to each other in memory, and padding means that the E struct in the example above contributes 4 bytes to the pair.

__compressed_pair will not use any extra space if allocator_type is empty.

And that’s all there is to it! The implementation of std::string looks like this (with #ifdefs removed):

template <class _CharT, class _Traits, class _Allocator>
class _LIBCPP_TEMPLATE_VIS basic_string : private __basic_string_common<true> {
  // <Code omitted.>

private:
  struct __long {
    size_type __cap_;
    size_type __size_;
    pointer __data_;
  };

  static const size_type __short_mask = 0x01;
  static const size_type __long_mask = 0x1ul;

  enum {
    __min_cap = (sizeof(__long) - 1) / sizeof(value_type) > 2
                    ? (sizeof(__long) - 1) / sizeof(value_type)
                    : 2
  };

  struct __short {
    union {
      unsigned char __size_;
      value_type __lx;
    };
    value_type __data_[__min_cap];
  };

  union __ulx {
    __long __lx;
    __short __lxx;
  };

  enum { __n_words = sizeof(__ulx) / sizeof(size_type) };

  struct __raw {
    size_type __words[__n_words];
  };

  struct __rep {
    union {
      __long __l;
      __short __s;
      __raw __r;
    };
  };

  __compressed_pair<__rep, allocator_type> __r_;

public:
  // <Code omitted.>
};

// In another file:
typedef basic_string<char> string;

Here is the full source on GitHub if you want to take a look.

Comment on Hacker News

1 In particular, I will assume that (1) you are using the standard ABI layout, (2) your computer is 64-bit and little endian and (3) the char type is signed and CHAR_BIT is 8. (There may be something else I missed. In practice I’m just assuming the layout on your machine is the same as on my machine.) ↩

How linking works

2020-01-25T00:00:00+00:00

I. Introduction

C++ programs must be compiled and linked before they can be executed. Compilation takes each human-readable .cc file as input and produces a machine readable .o file as output. Since a .cc file can use a function defined in another file, linking is necessary to match up the call sites of a function with its definition and produce the final executable.

It is not always obvious that linking is a separate step from compilation because command line tools like g++ do both the compilation and linking in one go.

II. Separate compilation and linking example

We’ll use this simple program as a running example.

To produce an executable file from the source code above, type:

$ g++ square.cc main.cc  # Compile and link square.cc and main.cc.

If you want to separately compile and then link you can type:

$ g++ -c square.cc  # Compile square.cc to machine code.
$ g++ -c main.cc    # Compile main.cc to machine code.
$ g++ square.o main.o  # Link square.o and main.o.

The -c flag tells g++ to compile the file without linking. When you pass .o files to g++ it will link them together.

g++ -c square.cc takes the source code in square.cc, converts in to machine code that can be executed by your computer and finally puts that exectuable code in the square.o file along with some bookkeeping information.

The square function is declared in main.cc but it is not defined in main.cc.

A declaration looks like int square(int x);, it tells the compiler the types of the return values and the arguments of the function. This allows the compiler to type check the function call square(3) without having to know how square is implemented¹. Typically declarations will be in a header file.

A definition contains the actual body of the function. In our example, the square function is defined in square.cc.

After main.cc is compiled, the main.o file contains the machine code for the main function and some metadata which records that the square function is declared, but not defined, in main.o.

In the final step, g++ square.o main.o links the object files into an executable program by matching up the function declaration in main.o with the function defined in square.o.

III. Linking with system libraries

Most libraries will have many source files, and therefore many .o files. When distributing a library on the internet, it is typical for the object files in the library to be bundled together into an archive file ending in the .a extension. A .a file bundles a bunch of .o files together for convenient linking.

When you install a C library on Linux, the headers for the library are typically placed in /usr/local/include and the .a file in /usr/local/lib.

The compiler will automatically look in /usr/local/include for headers (again this only applies to Linux, type g++ -E -Wp,-v - to see the full include path).

To tell the compiler to link with a .a in /usr/local/lib, pass the flag -l<name of library> to the compiler (type g++ -print-search-dirs to see the full link path).

For example, I recently installed FFTW on my computer. It added libfftw3.a to my /usr/local/lib directory. To link with this library I type:

g++ main.cc -lfftw3

Note the lib prefix is omitted, there is no space between -l and fftw3 and the -lfftw3 flag is after the source file which uses it.

IV. Inspecting object files

When researching this blog post, I found it really helpful to actually inspect the object files produced by the compiler. objdump can do this.

First, add some global variables to square.cc to make it more interesting.

To use objdump, type:

$ g++ -c square.cc
$ objdump --disassemble --full-contents --all-headers --section=.text --section=.rodata --section=.data square.o

The --disassemble flag shows the assembly for our functions, --full-contents shows the contents of each section of the object file in both hex and ascii, all-headers shows the symbol table and sections, --section=.text --section=.rodata --section=.data filters the results to only include the functions we defined, global read-only data and global data.

The output is:

square.o:     file format elf64-x86-64
square.o
architecture: i386:x86-64, flags 0x00000011:
HAS_RELOC, HAS_SYMS
start address 0x0000000000000000

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .text         00000010  0000000000000000  0000000000000000  00000040  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  1 .data         00000004  0000000000000000  0000000000000000  00000050  2**2
                  CONTENTS, ALLOC, LOAD, DATA
  3 .rodata       0000000e  0000000000000000  0000000000000000  00000058  2**3
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
SYMBOL TABLE:
0000000000000000 l    d  .text	0000000000000000 .text
0000000000000000 l    d  .data	0000000000000000 .data
0000000000000000 l    d  .rodata	0000000000000000 .rodata
0000000000000000 l     O .rodata	000000000000000e _ZL8greeting
0000000000000000 g     O .data	0000000000000004 x
0000000000000000 g     F .text	0000000000000010 _Z6squarei


Contents of section .text:
 0000 554889e5 897dfc8b 45fc0faf 45fc5dc3  UH...}..E...E.].
Contents of section .data:
 0000 03000000                             ....            
Contents of section .rodata:
 0000 48656c6c 6f2c2077 6f726c64 2100      Hello, world!.  

Disassembly of section .text:

0000000000000000 <_Z6squarei>:
   0:	55                   	push   %rbp
   1:	48 89 e5             	mov    %rsp,%rbp
   4:	89 7d fc             	mov    %edi,-0x4(%rbp)
   7:	8b 45 fc             	mov    -0x4(%rbp),%eax
   a:	0f af 45 fc          	imul   -0x4(%rbp),%eax
   e:	5d                   	pop    %rbp
   f:	c3                   	retq   

Disassembly of section .data:

0000000000000000 <x>:
   0:	03 00 00 00                                         ....

Disassembly of section .rodata:

0000000000000000 <_ZL8greeting>:
   0:	48 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21 00           Hello, world!.

The first line

square.o:     file format elf64-x86-64

tells us that the file is in the Executable and Linkable Format (ELF). This is the default object code format on Linux. On macOS it is Mach-O and on Windows there is COM, PE and PE32+.

The object file is organized into sections. Metadata about the sections are displayed in a table.

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .text         00000010  0000000000000000  0000000000000000  00000040  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  1 .data         00000004  0000000000000000  0000000000000000  00000050  2**2
                  CONTENTS, ALLOC, LOAD, DATA
  3 .rodata       0000000e  0000000000000000  0000000000000000  00000058  2**3
                  CONTENTS, ALLOC, LOAD, READONLY, DATA

The symbol table is a table which contains metadata about every global variable and function. For example, the entry _Z6squarei is the entry for the square function. The compiler transforms the names of functions in a process called name mangling to encode type information (and potentially other data such as which namespace the function was declared in) into the function name. This ensures that even if we declare two different functions with same name such as int square(int x) and double square(double x), every entry in the symbol table will have a unique name. You can add the flag --demangle to objdump to make the names more human-readable.

SYMBOL TABLE:
0000000000000000 l    d  .text	0000000000000000 .text
0000000000000000 l    d  .data	0000000000000000 .data
0000000000000000 l    d  .rodata	0000000000000000 .rodata
0000000000000000 l     O .rodata	000000000000000e _ZL8greeting
0000000000000000 g     O .data	0000000000000004 x
0000000000000000 g     F .text	0000000000000010 _Z6squarei

The contents of each of the sections is shown in hexadecimal and ASCII. The .text section contains the machine code for the square function, the .data section contains our global variable x (which has value 3) and the .rodata section (read-only data) has our greeting variable (which has value “Hello, world!”).

Contents of section .text:
 0000 554889e5 897dfc8b 45fc0faf 45fc5dc3  UH...}..E...E.].
Contents of section .data:
 0000 03000000                             ....            
Contents of section .rodata:
 0000 48656c6c 6f2c2077 6f726c64 2100      Hello, world!.  

The assembly code for the square function is shown at the end of the output.

Disassembly of section .text:

0000000000000000 <_Z6squarei>:
   0:	55                   	push   %rbp
   1:	48 89 e5             	mov    %rsp,%rbp
   4:	89 7d fc             	mov    %edi,-0x4(%rbp)
   7:	8b 45 fc             	mov    -0x4(%rbp),%eax
   a:	0f af 45 fc          	imul   -0x4(%rbp),%eax
   e:	5d                   	pop    %rbp
   f:	c3                   	retq   

V.

Many statically typed languages such as C, Rust and Swift follow the same model as C++: separate compilation of source files followed by linking. This means you can call e.g. C functions from Swift by compiling the human-readable source files into object files and linking them together. It’s useful to know about linking if you want to interop between C and more modern languages.

Even if you just stick to C++, some of the darker corners of language and the more inscrutable error messages from the compiler are due to C++’s linking model. Having a good mental model of linking and compilation can save hours of debugging.

P.S. To keep things simple I didn't discuss link-time optimization at all in this post. For a great overview of link-time optimization and its implementation in LLVM, see Teresa Johnson's talk ThinkLTO: Scalable and Incremental Link-Time Optimization. (This is one of my favorite CppCon talks ever!)

Comment on Hacker News

1 C does not require explicit declarations, C++ does. A C program calling a function that has not been declared will not compile with a C++ compiler. ↩

Discrete Fourier analysis notes

2019-03-02T00:00:00+00:00

These are my notes on discrete Fourier analysis. It’s basically just an expanded version of the first chapter of my master’s thesis.

Network flow notes

2019-03-02T00:00:00+00:00

These are my notes on network flow. The max-flow min-cut theorem is proved at the end.

Checkmate, undefined behavior

2019-02-28T00:00:00+00:00

Undefined behavior is the bane of C and C++ programmers. The compiler can choose to do whatever it wants if a program has undefined behavior. This is normally not a good thing, but I recently wrote some code with undefined behavior and amazingly the compiler chose to do exactly what I had intended, not what I told it to do.

I have spent the last week working on a chess engine in C++. Most chess engines take advantage of the convenient coincidence that the number of squares on a chess board, 64, is the same as the word size on modern processors. So, you can do things like store the location of all the white pawns with a single 64 bit integer: you just set the i-th bit to 1 if there is a white pawn on the i-th square. This technique allows you to do neat tricks, such as move all pieces up one square by left shifting the integer by 8.

I wrote a simple utility function that takes the name of the square as a string and returns the corresponding 64 bit integer. Chess players use a simple naming convention for the squares on a chessboard: the rows are labeled 1-8 and the columns are labelled a-h, so the square in the bottom left hand corner is the a1 square.

Here is (roughly) how I implemented my string to 64 bit integer function. Can you see what’s wrong with it?

// At the top of the file.
constexpr int board_size = 8;

// algebraic_square would be one of "a1", "a2", ..., "h7", "h8".
uint64_t str_to_square(std::string_view algebraic_square) {
  const char column = algebraic_square[0];
  const char row = algebraic_square[1];
  const int column_index = column - 'a';
  const int row_index = row - 1;
  return uint64_t(1) << ((row_index + 1) * board_size - column_index - 1);
}

I forgot to put quotes around the 1 in the line const int row_index = row - 1;! Instead of subtracting the character '1', I subtracted the integer 1. Since the ascii encoding of the character '1' is 49, the row_index is always off by 48.

This bug disturbed me, not because bugs like this are so unusual, but because none of my tests caught this and I only discovered the bug when I was tidying up some of the surrounding code. I was left shifting a 64 bit integer by at least 384 every time I called this function and yet it seemingly caused none of my tests to fail. After some investigation I concluded that for every single square on the chess board my code gave the right answer. This was unexpected to say the least.

I was already aware that left shifting off the end of a signed integer is undefined behavior but I thought that left shifting off the end of unsigned integers was perfectly well defined, the most significant bits just get discarded. From cpprefence.com:

For unsigned a, the value of a « b is the value of a * 2^b, reduced modulo 2^N where N is the number of bits in the return type (that is, bitwise left shift is performed and the bits that get shifted out of the destination type are discarded).

According to cppreference, my function should simply push the single set bit uint64_t(1) off the end and return 0 every time. Since str_to_square clearly wasn’t doing this, my next step was to run my program with the UndefinedBehaviorSanitizer. I got the following warning.

runtime error: shift exponent 384 is too large for 64-bit type 'uint64_t' (aka 'unsigned long')

Which confirmed that I was indeed invoking undefined behavior.

After consulting the C++ standard (something I had been trying to avoid doing) I still did not understand. Paragraph 5.8.2 says:

5.8.2 The value of E1 « E2 is E1 left-shifted E2 bit positions; vacated bits are zero-filled. If E1 has an unsigned type, the value of the result is E1 × 2^E2, reduced modulo one more than the maximum value representable in the result type. Otherwise, if E1 has a signed type and non-negative value, and E1 × 2^E2 is representable in the corresponding unsigned type of the result type, then that value, converted to the result type, is the resulting value; otherwise, the behavior is undefined.

This paragraph only mentions undefined behavior for signed integers, but I was using unsigned integers so it shouldn’t affect me.

I was just about to give up. It was getting late, and although it was a remarkable coincidence that forgetting the quote marks didn’t affect the behavior of my program, I had already fixed the bug. Then I noticed the paragraph above 5.8.2:

5.8.1. The shift operators « and » group left-to-right. … The behavior is undefined if the right operand is negative, or greater than or equal to the length in bits of the promoted left operand.

I finally had my answer! It is undefined behavior to shift a 64 bit integer by 64 or greater.

All bets are off once your program has undefined behavior, but it was remarkable that my program was seemingly doing what I intended it to do, rather than what I had actually told it to do. I thought that left shifting by more than the “length in bits of the promoted left operand” would result in zero, but instead I was getting the correct answer each time.

To see what was going on I copy and pasted my function into compiler explorer, turned optimizations up to -O3 so the output was less noisy, and got:

str_to_square(std::basic_string_view<char, std::char_traits<char> >): # @str_to_square(std::basic_string_view<char, std::char_traits<char> >)
        movzx   eax, byte ptr [rsi]
        movzx   ecx, byte ptr [rsi + 1]
        mov     edx, 96
        sub     edx, eax
        lea     ecx, [rdx + 8*rcx]
        mov     eax, 1
        shl     rax, cl
        ret

The left shift is being done by the shl instruction. Helpfully, if you right click on an assembly instruction in compiler explorer it points you to the documentation for that instruction, which said:

The destination operand can be a register or a memory location. The count operand can be an immediate value or the CL register. The count is masked to 5 bits (or 6 bits if in 64-bit mode and REX.W is used).

Masking by 6 bits is the same as reducing modulo 64 and by coincidence, ((row - 1) + 1) * board_size is the same as the correct value (row - '1' + 1) * board_size modulo 64 (because (('1' - 1) * board_size) % 64 == 0).

The undefined behavior gods must have been smiling down on me.

Principal component analysis: pictures, code and proofs

2018-10-18T00:00:00+00:00

The code used to generate the plots for this post can be found here.

I.

Principal component analysis is a form of feature engineering that reduces the number of dimensions needed to represent your data. If a neural network has fewer inputs then there are less weights to train, which makes it easier and faster to train the model.

The data above is two dimensional, but it is “almost” one dimensional in the sense that every point is close to a line.

The first step in principal component analysis is to center the data. Given the list of 2d points, \(x_1, x_2, \dots , x_n \in \mathbb{R}^2\) we first center the data by calculating the mean \(\overline{x} = \frac{1}{n}\sum_{i=1}^n x_i\) and replacing each \(x_i\) with \(x_i - \overline{x}\). Now the data looks like this.

We then put the data in a matrix \(X = \begin{pmatrix} | & | & & | \\ x_1 & x_2 &\cdots & x_n \\ | & | & & |\end{pmatrix}.\) And calculate the eigenvectors and eigenvalues of the covariance matrix \(\frac{1}{n-1}XX^\top\).

The eigenvectors tell us the direction of the data. The first eigenvector in the picture above has the same slope as the data and the second eigenvector is perpendicular to the first. Now let’s scale each of the eigenvectors by its corresponding eigenvalue ¹.

And draw an ellipse around the eigenvectors.

The eigenvalues tell us how spread out the data is in the direction of that particular eigenvector. Thus we can reduce the dimension of the data by projecting onto the line given by the largest eigenvalue.

The data is now one dimensional since it fits on a single line. Each point has not moved too far from its original spot, so these new points still represent the data well.

In two dimensions this is the same as projecting onto the line of best fit, but this technique generalizes. If your data is \(n\)-dimensional then PCA lets you find the best \(m\)-dimensional subspace to project the data down onto; you just project your data onto the subspace spanned by the \(m\) eigenvectors with the largest eigenvalues. If \(m < < n\) this can compress your data a lot, and PCA guarantees that this \(m\) dimensional subspace is optimal, in the sense that it minimizes the mean squared error between the original data points and the projected data points.

II.

The data in the plots above was generated using a random number generator. Let’s try PCA on a real dataset.

We will use the MNIST dataset, which is a collection of grayscale, 28x28 images of hand written digits. To simplify the analysis we will discard images of 2,3,4,5,6,7,8,9 and only look at images of 0 and 1. Below are some examples of the images from MNIST.

To process the images we will:

Flatten each image into a \(784 = 28\times 28\) dimensional vector.
Use PCA to project each 784-dimensional vector to a 2-dimensional vector.
Plot the 2 dimensional vectors, with images of ‘0’ in red and images of ‘1’ in blue.

The result looks like this.

You can see that the zeros are clustered to the left, and the ones are clustered to the right. We could create a reasonable classifier by drawing a vertical line at \(x = - 250\), and all we did was linearly project the raw pixels down to a two dimensional subspace!

We can project onto any number of dimensions. Here is the three dimensional projection.

III.

It’s not obvious why the eigenvalues and eigenvectors of the covariance matrix have all these useful properties. There are proofs at the end of the post, but they’re not particularly enlightening. Thankfully there’s a more intuitive way of thinking about it.

Continuing with the MNIST example, let \(p_1\) be the vector where the \(i\)-th entry is the first pixel in the \(i\)-th image. Simlarly let \(p_2, p_3, \dots , p_{784}\) be the vectors consisting of the 2nd, 3rd … , 784th pixels across all images. Then

\[XX^\top = \begin{pmatrix} \langle p_1, p_1 \rangle & \langle p_1, p_2 \rangle & \cdots & \langle p_1, p_{784} \rangle \\ \langle p_2, p_1 \rangle & \langle p_2, p_2 \rangle & \cdots& \langle p_2, p_{784} \rangle \\ \vdots & \vdots & \ddots & \vdots \\ \langle p_{784}, p_1 \rangle & \langle p_{784}, p_2 \rangle & \cdots & \langle p_{784}, p_{784} \rangle \\ \end{pmatrix}.\]

This matrix can be diagonalized \(XX^\top = UDU^{-1}\) where \(U\) is a change of basis matrix and \(D = \operatorname{diag}(\lambda_1, \cdots , \lambda_n)\) is diagonal. We can view the change of basis as creating new features \(p_1’, p_2’, \dots , p_{748}’\) from the original pixels. And the diagonal matrix is the covariance matrix for these new features.

Since \(\langle p_i’, p_j’ \rangle = 0\) for \(i \neq j\) the features are independent, and the variance of \(p_i’\) is \(\langle p_i’, p_i’ \rangle = \lambda_i\).

So given a vector of pixels \(x\), we can convert \(x\) into a vector of new features \(x’\) by applying a change of basis. Then the eigenvalues \(\lambda_i\) are the variances of the new features, it seems reasonable that the features with the largest variance are the most important, while the features with the smallest variance can be discarded.

IV.

Now that we have some intuition, the preceding discussion can be formalized into a theorem.

Theorem: Let \(x_1, \dots , x_n \in \mathbb{R}^d\) be a sequence of data points. Let

\[X = \begin{pmatrix} | & | & & | \\ x_1 & x_2 &\cdots & x_n \\ | & | & & |\end{pmatrix}\]

be the \(d \times n\) matrix where each column is a data point. Let \(W = XX^\top\) (the \(\frac{1}{n-1}\) factor from before does not affect the eigenvectors or the relative order of the eigenvalues). Then \(W\) is positive semidefinite and hence has eigenvectors \(u_1, \dots , u_d\) which form an orthonormal basis for \(\mathbb{R}^d\). Let \(\lambda_1, \dots , \lambda_d\) be the corresponding eigenvalues and without loss of generality assume \(\lambda_1 \geq \lambda_2 \cdots \geq \lambda_d\). The projection error for \(x_i\) onto a subspace \(V \subset \mathbb{R}^d\) is defined as \(\|x_i - P_Vx_i\|_2^2\) where \(P_V:\mathbb{R}^d \to \mathbb{R}^d\) is the projection-onto-\(V\) operator. Then for any positive integer \(m < d\) the subspace \(U_m := \operatorname{span}\{u_1, \dots , u_m\}\) minimizes the sum of the projection errors. In symbols,

\[\sum_{i=1}^n \|x_i - P_{U_m}x_i\|_2^2 = \min_{\substack{V \subset \mathbb{R}^d \\ \operatorname{dim}V = m}} \sum_{i=1}^n \|x_i - P_Vx_i\|_2^2.\]

Proof:

Fix \(m < d\) and let \(V \subset \mathbb{R}^d\) be an \(m\)-dimensional subspace. Define the \(d \times n\) error matrix \[ E = \begin{pmatrix} | & | & & |\\ x_1 - P_Vx_1 & x_2 - P_Vx_2 & \cdots & x_n - P_Vx_n\\ | & | & & | \\ \end{pmatrix} = X - P_VX. \] We want to minimize \[ \sum_{i=1}^n \|x_i - P_Vx_i\|_2^2 = \|E\|_F^2 \] where \(\|\cdot \|_F\) is the Frobenius norm. We now rewrite the error using matrix algebra \[ \begin{align}\newcommand{\tr}{\mathrm{tr}} \|E \|_F^2 &= \| X- P_VX\|_F^2 \\ &=\tr\left(( X- P_VX)( X- P_VX)^\top\right) & (\|A \|_F^2 = \tr(A^\top A)) \\ &=\tr\left(( X- P_VX)( X^\top - X^\top P_V^\top)\right) \\ &=\tr\left(XX^\top - XX^\top P_V^\top - P_VXX^\top + P_VXX^\top P_V^\top \right) \\ &=\tr\left(W- W P_V^\top - P_VW + P_VW P_V^\top \right) & (W = XX^\top)\\ &=\tr\left(W- W P_V - P_VW + P_VW P_V \right) & (P_V = P_V^\top )\\ &=\tr(W)- \tr(W P_V) - \tr(P_VW) + \tr(P_VW P_V ) \\ &=\tr(W)- \tr(P_VW ) - \tr(P_VW) + \tr(P_VW) & (\tr(AB) = \tr(BA) \text{ and } P_V^2 = P_V)\\ &=\tr(W)- \tr(P_VW). \end{align} \]

The quantity \(\mathrm{tr}(W)\) is a constant, so minimizing \(\|E \|_F^2\) is the same as maximizing \(\tr(P_VW)\). Let \(\{v_1, \dots , v_m\} \subset \mathbb{R}^d\) be an orthonormal basis for \(V\). Then \[ P_V = \sum_{i = 1}^m v_iv_i^\top \] so \[ \begin{align}\newcommand{\tr}{\mathrm{tr}} \tr(P_VW) &= \tr\left(\sum_{i = 1}^m v_iv_i^\top W \right) \\ &= \sum_{i=1}^m \tr\left(v_iv_i^\top W\right) \\ &= \sum_{i=1}^m \tr(v_i^\top W v_i) & (\tr(AB) = \tr(BA)). \end{align} \]

Let \[ U = \begin{pmatrix} | & | & & | \\ u_1 & u_2 &\cdots & u_d \\ | & | & & |\end{pmatrix} \] where the \(u_i \in \mathbb{R}^d\) are the eigenvectors of \(W\) as stated in the theorem. The matrix \(U\) diagonalizes \(W\) so \(W = UDU^{-1} = UDU^\top\) where \[D = \begin{pmatrix} \lambda_1 & 0 & \dots & 0 \\ 0 & \lambda_2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \lambda_n \end{pmatrix}. \] Now \[ \begin{align}\newcommand{\tr}{\mathrm{tr}} \tr(P_VW) &= \sum_{i=1}^m \tr(v_i^\top W v_i) \\ &= \sum_{i=1}^m \tr(v_i^\top UDU^\top v_i) \\ &= \sum_{i=1}^m \tr((U^\top v_i)^\top D (U^\top v_i) \end{align} \]

If \(v_i = u_i\) for all \(1 \leq i \leq m\) then \[U^\top v_i = U^\top u_i = (0, \dots, 0, 1, 0, \dots, 0)^\top\] is the \(i\)-th standard basis vector. Thus \[ \begin{align}\newcommand{\tr}{\mathrm{tr}} \tr(P_VW) &= \sum_{i=1}^m \tr((U^\top v_i)^\top D (U^\top v_i) \\ &= \sum_{i=1}^m \lambda_i \end{align} \]

Therefore it suffices to show that \(\mathrm{tr}(P_VW) \leq \sum_{i=1}^m \lambda_i\) for all dimension \(m\) subspaces \(V\).

We will show this is true in the case \(m = 2\), i.e. \(\mathrm{tr}(P_VW) \leq \lambda_1 + \lambda_2\) when \(V\) is 2 dimensional. The case \(m > 2\) uses the same argument but it is more notationally heavy. Let \(\alpha = U^\top v_1 \in \mathbb{R}^d\) and \(\beta =U^\top v_2 \in \mathbb{R}^d\). Note that since \(U\) is unitary \(\|\alpha\|_2^2 = \|\beta\|_2^2 = 1\) and \(\langle \alpha, \beta \rangle = 0\).

The first step is to show that \(\alpha_i^2 + \beta_i^2 \leq 1\) for all \(i\). Let \(e_i = (0, \dots , 0, 1, 0, \dots , 0)\) be the \(i\)-th standard basis vector. Since \(\alpha\) and \(\beta\) are orthogonal and have length 1, the projection of \(e_i\) onto \(\operatorname{span}\{\alpha, \beta \}\) is given by \[\hat{e}_i = \langle e_i, \alpha \rangle \alpha + \langle e_i, \beta \rangle \beta = \alpha_i \alpha + \beta_i \beta .\] Then \[ \alpha_i^2 + \beta_i^2 = \|\hat{e_i}\|_2^2 \leq \|e_i\|_2^2 = 1 \] since a projected vector always has length less than or equal to the original vector.

The second step is to observe that \(\sum_{i=1}^d (\alpha_i^2 + \beta_i^2) = \|\alpha \|_2^2 + \|\beta \|_2^2 = 2\).

Finally, we want to maximize \[ \mathrm{tr}(P_VW) = \sum_{i=1}^d \lambda_i(\alpha_i^2 + \beta_i^2) \] and we know that \[\alpha_i^2 + \beta_i^2 \leq 1 \text{ and } \sum_{i=1}^d(\alpha_i^2 + \beta_i^2) = 2 .\]

The eigenvalues of a positive semidefinite matrix are nonnegative so the sum \(\sum_{i=1}^d \lambda_i(\alpha_i^2 + \beta_i^2)\) is maximized when when the first and second coefficient are as large as possible, i.e. when \(\alpha_1^2 + \beta_1^2 = \alpha_2^2 + \beta_2^2 = 1\). But then the second condition implies that \(\alpha_i^2 + \beta_i^2 = 0\) for \(i > 2\). Thus \[ \mathrm{tr}(P_VW) = \sum_{i=1}^d \lambda_i(\alpha_i^2 + \beta_i^2) \leq \lambda_1 + \lambda_2. \] \(\square\)

We also need to prove that the size of the eigenvalue is proportional to the variance in the direction of the corresponding eigenvector.

Theorem: As in the previous theorem let \(X = \begin{pmatrix}x_1 & x_2 & \cdots & x_n \end{pmatrix}\) be the data matrix, \(W = XX^\top\) the covariance matrix, \(u_1, \dots , u_d\) the eigenvectors of \(W\) and \(\lambda_1, \dots , \lambda_d\) the eigenvalues. Let \(P_{u_i}: \mathbb{R}^d \to \mathbb{R}^d\) be the projection operator onto the subspace \(\mathrm{span}\{u_i\}\). Then \[ \sum_{j=1}^m\|P_{u_i}x_j\|_2^2 = \lambda_i \]

Proof:

The working is similar to the previous proof so I'll omit some steps. \[ \begin{align} \sum_{j=1}^m\|P_{u_i}x_j\|_2^2 &= \|u_iu_i^\top X\|_F^2 \\ &= \mathrm{tr}((u_iu_i^\top X)(u_iu_i^\top X)^\top) \\ &= \mathrm{tr}(u_i^\top W u_i) \\ &= \mathrm{tr}((U^\top u_i)^\top D (U^\top u_i) \\ &= \lambda_i . \end{align} \]

Comment on Hacker News.

1 I actually scaled by two times the square root of the eigenvalue. The eigenvalue tells you the variance and I wanted the standard deviation. I multiplied by two so that ellipse would capture most of the data. ↩