Jekyll2019-03-11T19:27:25+00:00https://joellaity.com/feed.xmlJoel LaityDiscrete Fourier analysis notes2019-03-02T00:00:00+00:002019-03-02T00:00:00+00:00https://joellaity.com/2019/03/02/discrete-fourier-analysis<p><a href="/assets/notes/discrete_fourier_analysis.pdf">These</a> are my notes on discrete Fourier analysis. It’s basically just an expanded version of the first chapter of my master’s thesis.</p>These are my notes on discrete Fourier analysis. It’s basically just an expanded version of the first chapter of my master’s thesis.Network flow notes2019-03-02T00:00:00+00:002019-03-02T00:00:00+00:00https://joellaity.com/2019/03/02/network-flow-notes<p><a href="/assets/notes/network_flow.pdf">These</a> are my notes on network flow. The max-flow min-cut theorem is proved at the end.</p>These are my notes on network flow. The max-flow min-cut theorem is proved at the end.Checkmate, undefined behavior2019-02-28T00:00:00+00:002019-02-28T00:00:00+00:00https://joellaity.com/2019/02/28/checkmate-undefined-behavior<p>Undefined behavior is the bane of C and C++ programmers. The compiler can choose to do whatever it wants if a program has undefined behavior. This is normally not a good thing, but I recently wrote some code with undefined behavior and amazingly the compiler chose to do exactly what I had intended, not what I told it to do.</p>
<p>I have spent the last week working on a <a href="https://github.com/joelypoley/pawn_grabber">chess engine</a> in C++. Most chess engines take advantage of the convenient coincidence that the number of squares on a chess board, 64, is the same as the word size on modern processors. So, you can do things like store the location of all the white pawns with a single 64 bit integer: you just set the i-th bit to 1 if there is a white pawn on the i-th square. This technique allows you to do neat tricks, such as move all pieces up one square by left shifting the integer by 8.</p>
<p>I wrote a simple utility function that takes the name of the square as a string and returns the corresponding 64 bit integer. Chess players use a simple naming convention for the squares on a chessboard: the rows are labeled 1-8 and the columns are labelled a-h, so the square in the bottom left hand corner is the a1 square.</p>
<p><img src="/assets/algebraic_notation.png" alt="chessboard" /></p>
<p>Here is (roughly) how I implemented my string to 64 bit integer function. Can you see what’s wrong with it?</p>
<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="c1">// At the top of the file.
</span><span class="k">constexpr</span> <span class="kt">int</span> <span class="n">board_size</span> <span class="o">=</span> <span class="mi">8</span><span class="p">;</span>
<span class="c1">// algebraic_square would be one of "a1", "a2", ..., "h7", "h8".
</span><span class="kt">uint64_t</span> <span class="nf">str_to_square</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">string_view</span> <span class="n">algebraic_square</span><span class="p">)</span> <span class="p">{</span>
<span class="k">const</span> <span class="kt">char</span> <span class="n">column</span> <span class="o">=</span> <span class="n">algebraic_square</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
<span class="k">const</span> <span class="kt">char</span> <span class="n">row</span> <span class="o">=</span> <span class="n">algebraic_square</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
<span class="k">const</span> <span class="kt">int</span> <span class="n">column_index</span> <span class="o">=</span> <span class="n">column</span> <span class="o">-</span> <span class="sc">'a'</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">int</span> <span class="n">row_index</span> <span class="o">=</span> <span class="n">row</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">return</span> <span class="kt">uint64_t</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="o"><<</span> <span class="p">((</span><span class="n">row_index</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="n">board_size</span> <span class="o">-</span> <span class="n">column_index</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
<span class="p">}</span></code></pre></figure>
<p>I forgot to put quotes around the <code class="highlighter-rouge">1</code> in the line <code class="highlighter-rouge">const int row_index = row - 1;</code>! Instead of subtracting the character <code class="highlighter-rouge">'1'</code>, I subtracted the integer <code class="highlighter-rouge">1</code>. Since the ascii encoding of the character <code class="highlighter-rouge">'1'</code> is 49, the <code class="highlighter-rouge">row_index</code> is always off by 48.</p>
<p>This bug disturbed me, not because bugs like this are so unusual, but because none of my tests caught this and I only discovered the bug when I was tidying up some of the surrounding code. I was left shifting a 64 bit integer by at least 384 every time I called this function and yet it seemingly caused none of my tests to fail. After some investigation I concluded that for <em>every</em> single square on the chess board my code gave the right answer. This was unexpected to say the least.</p>
<p>I was already aware that left shifting off the end of a <em>signed</em> integer is undefined behavior but I thought that left shifting off the end of unsigned integers was perfectly well defined, the most significant bits just get discarded. From <a href="https://en.cppreference.com/w/">cpprefence.com</a>:</p>
<blockquote>
<p>For unsigned a, the value of a « b is the value of a * 2<sup>b</sup>, reduced modulo 2<sup>N</sup> where N is the number of bits in the return type (that is, bitwise left shift is performed and the bits that get shifted out of the destination type are discarded).</p>
</blockquote>
<p>According to cppreference, my function should simply push the single set bit <code class="highlighter-rouge">uint64_t(1)</code> off the end and return 0 every time. Since <code class="highlighter-rouge">str_to_square</code> clearly wasn’t doing this, my next step was to run my program with the <a href="https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html">UndefinedBehaviorSanitizer</a>. I got the following warning.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>runtime error: shift exponent 384 is too large for 64-bit type 'uint64_t' (aka 'unsigned long')
</code></pre></div></div>
<p>Which confirmed that I was indeed invoking undefined behavior.</p>
<p>After consulting the <a href="http://www.open-std.org/Jtc1/sc22/wg21/docs/papers/2014/n4296.pdf">C++ standard</a> (something I had been trying to avoid doing) I still did not understand. Paragraph 5.8.2 says:</p>
<blockquote>
<p>5.8.2 The value of E1 « E2 is E1 left-shifted E2 bit positions; vacated bits are zero-filled. If E1 has an unsigned type, the value of the result is E1 × 2<sup>E2</sup>, reduced modulo one more than the maximum value representable in the result type. Otherwise, if E1 has a signed type and non-negative value, and E1 × 2<sup>E2</sup> is representable in the corresponding unsigned type of the result type, then that value, converted to the result type, is the resulting value; otherwise, the behavior is undefined.</p>
</blockquote>
<p>This paragraph only mentions undefined behavior for signed integers, but I was using unsigned integers so it shouldn’t affect me.</p>
<p>I was just about to give up. It was getting late, and although it was a remarkable coincidence that forgetting the quote marks didn’t affect the behavior of my program, I had already fixed the bug. Then I noticed the paragraph above 5.8.2:</p>
<blockquote>
<p>5.8.1. The shift operators « and » group left-to-right. … The behavior is undefined if the right operand is negative, or greater than or equal to the length in bits of the promoted left operand.</p>
</blockquote>
<p>I finally had my answer! It is undefined behavior to shift a 64 bit integer by 64 or greater.</p>
<p>All bets are off once your program has undefined behavior, but it was remarkable that my program was seemingly doing what I intended it to do, rather than what I had actually told it to do. I thought that left shifting by more than the “length in bits of the promoted left operand” would result in zero, but instead I was getting the correct answer each time.</p>
<p>To see what was going on I copy and pasted my function into <a href="https://godbolt.org/z/z1Vobs">compiler explorer</a>, turned optimizations up to <code class="highlighter-rouge">-O3</code> so the output was less noisy, and got:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>str_to_square(std::basic_string_view<char, std::char_traits<char> >): # @str_to_square(std::basic_string_view<char, std::char_traits<char> >)
movzx eax, byte ptr [rsi]
movzx ecx, byte ptr [rsi + 1]
mov edx, 96
sub edx, eax
lea ecx, [rdx + 8*rcx]
mov eax, 1
shl rax, cl
ret
</code></pre></div></div>
<p>The left shift is being done by the <code class="highlighter-rouge">shl</code> instruction. Helpfully, if you right click on an assembly instruction in compiler explorer it points you to the documentation for that instruction, which said:</p>
<blockquote>
<p>The destination operand can be a register or a memory location. The count operand can be an immediate value or the CL register. The count is masked to 5 bits (or 6 bits if in 64-bit mode and REX.W is used).</p>
</blockquote>
<p>Masking by 6 bits is the same as reducing modulo 64 and by coincidence, <code class="highlighter-rouge">((row - 1) + 1) * board_size</code> is the same as the correct value <code class="highlighter-rouge">(row - '1' + 1) * board_size</code> modulo 64 (because <code class="highlighter-rouge">(('1' - 1) * board_size) % 64 == 0</code>).</p>
<p>The undefined behavior gods must have been smiling down on me.</p>Undefined behavior is the bane of C and C++ programmers. The compiler can choose to do whatever it wants if a program has undefined behavior. This is normally not a good thing, but I recently wrote some code with undefined behavior and amazingly the compiler chose to do exactly what I had intended, not what I told it to do.Principal component analysis: pictures, code and proofs2018-10-18T00:00:00+00:002018-10-18T00:00:00+00:00https://joellaity.com/2018/10/18/pca<p><small>The code used to generate the plots for this post can be found <a href="https://github.com/joelypoley/joelypoley.github.io/blob/master/assets/notebooks/pca.ipynb">here</a>.</small></p>
<h2 id="i">I.</h2>
<p>Principal component analysis is a form of <a href="https://en.wikipedia.org/wiki/Feature_engineering">feature engineering</a> that reduces the number of dimensions needed to represent your data. If a neural network has fewer inputs then there are less weights to train, which makes it easier and faster to train the model.</p>
<p><img src="/assets/plot1.png" alt="scatter plot" /></p>
<p>The data above is two dimensional, but it is “almost” one dimensional in the sense that every point is close to a line.</p>
<p><img src="/assets/plot2.png" alt="scatter plot" /></p>
<p>The first step in principal component analysis is to center the data. Given the list of 2d points, \(x_1, x_2, \dots , x_n \in \mathbb{R}^2\) we first center the data by calculating the mean \(\overline{x} = \frac{1}{n}\sum_{i=1}^n x_i\) and replacing each \(x_i\) with \(x_i - \overline{x}\). Now the data looks like this.</p>
<p><img src="/assets/plot3.png" alt="scatter plot" /></p>
<p>We then put the data in a matrix
<script type="math/tex">% <![CDATA[
X = \begin{pmatrix}
| & | & & | \\
x_1 & x_2 &\cdots & x_n \\
| & | & & |\end{pmatrix}. %]]></script>
And calculate the eigenvectors and eigenvalues of the <em>covariance matrix</em> <script type="math/tex">\frac{1}{n-1}XX^\top</script>.</p>
<p><img src="/assets/plot4.png" alt="scatter plot" /></p>
<p>The eigenvectors tell us the <em>direction</em> of the data. The first eigenvector in the picture above has the same slope as the data and the second eigenvector is perpendicular to the first. Now let’s scale each of the eigenvectors by its corresponding eigenvalue <sup id="a1"><a href="#f1">1</a></sup>.</p>
<p><img src="/assets/plot5.png" alt="scatter plot" /></p>
<p>And draw an ellipse around the eigenvectors.</p>
<p><img src="/assets/plot6.png" alt="scatter plot" /></p>
<p>The eigenvalues tell us how spread out the data is in the direction of that particular eigenvector. Thus we can reduce the dimension of the data by projecting onto the line given by the largest eigenvalue.</p>
<p><img src="/assets/plot7.png" alt="scatter plot" /></p>
<p>The data is now one dimensional since it fits on a single line. Each point has not moved too far from its original spot, so these new points still represent the data well.</p>
<p>In two dimensions this is the same as projecting onto the line of best fit, but this technique generalizes. If your data is \(n\)-dimensional then PCA lets you find the best \(m\)-dimensional subspace to project the data down onto; you just project your data onto the subspace spanned by the \(m\) eigenvectors with the largest eigenvalues. If \(m < < n\) this can compress your data a lot, and PCA guarantees that this \(m\) dimensional subspace is optimal, in the sense that it minimizes the mean squared error between the original data points and the projected data points.</p>
<h1 id="ii">II.</h1>
<p>The data in the plots above was generated using a random number generator. Let’s try PCA on a real dataset.</p>
<p>We will use the MNIST dataset, which is a collection of grayscale, 28x28 images of hand written digits. To simplify the analysis we will discard images of 2,3,4,5,6,7,8,9 and only look at images of 0 and 1. Below are some examples of the images from MNIST.</p>
<p><img src="/assets/plot8.png" alt="scatter plot" /></p>
<p>To process the images we will:</p>
<ul>
<li>Flatten each image into a <script type="math/tex">784 = 28\times 28</script> dimensional vector.</li>
<li>Use PCA to project each 784-dimensional vector to a 2-dimensional vector.</li>
<li>Plot the 2 dimensional vectors, with images of ‘0’ in red and images of ‘1’ in blue.</li>
</ul>
<p>The result looks like this.</p>
<p><img src="/assets/plot9.png" alt="scatter plot" /></p>
<p>You can see that the zeros are clustered to the left, and the ones are clustered to the right. We could create a reasonable classifier by drawing a vertical line at \(x = - 250\), and all we did was linearly project the raw pixels down to a two dimensional subspace!</p>
<p>We can project onto any number of dimensions. Here is the three dimensional projection.</p>
<p><img src="/assets/plot11.png" alt="scatter plot" /></p>
<h2 id="iii">III.</h2>
<p>It’s not obvious why the eigenvalues and eigenvectors of the covariance matrix have all these useful properties. There are proofs at the end of the post, but they’re not particularly enlightening. Thankfully there’s a more intuitive way of thinking about it.</p>
<p>Continuing with the MNIST example, let \(p_1\) be the vector where the \(i\)-th entry is the first pixel in the \(i\)-th image. Simlarly let \(p_2, p_3, \dots , p_{784}\) be the vectors consisting of the 2nd, 3rd … , 784th pixels across all images. Then</p>
<script type="math/tex; mode=display">% <![CDATA[
XX^\top =
\begin{pmatrix}
\langle p_1, p_1 \rangle & \langle p_1, p_2 \rangle & \cdots & \langle p_1, p_{784} \rangle \\
\langle p_2, p_1 \rangle & \langle p_2, p_2 \rangle & \cdots& \langle p_2, p_{784} \rangle \\
\vdots & \vdots & \ddots & \vdots \\
\langle p_{784}, p_1 \rangle & \langle p_{784}, p_2 \rangle & \cdots & \langle p_{784}, p_{784} \rangle \\
\end{pmatrix}. %]]></script>
<p>This matrix can be diagonalized \(XX^\top = UDU^{-1}\) where \(U\) is a change of basis matrix and \(D = \operatorname{diag}(\lambda_1, \cdots , \lambda_n)\) is diagonal.
We can view the change of basis as creating new features \(p_1’, p_2’, \dots , p_{748}’\) from the original pixels. And the diagonal matrix is the covariance matrix for these new features.</p>
<p>Since \(\langle p_i’, p_j’ \rangle = 0\) for \(i \neq j\) the features are independent, and the variance of \(p_i’\) is \(\langle p_i’, p_i’ \rangle = \lambda_i\).</p>
<p>So given a vector of pixels \(x\), we can convert \(x\) into a vector of new features \(x’\) by applying a change of basis. Then the eigenvalues \(\lambda_i\) are the variances of the new features, it seems reasonable that the features with the largest variance are the most important, while the features with the smallest variance can be discarded.</p>
<h2 id="iv">IV.</h2>
<p>Now that we have some intuition, the preceding discussion can be formalized into a theorem.</p>
<p><strong>Theorem:</strong> Let \(x_1, \dots , x_n \in \mathbb{R}^d\) be a sequence of data points.
Let</p>
<script type="math/tex; mode=display">% <![CDATA[
X = \begin{pmatrix}
| & | & & | \\
x_1 & x_2 &\cdots & x_n \\
| & | & & |\end{pmatrix} %]]></script>
<p>be the \(d \times n\) matrix where each column is a data point.
Let \(W = XX^\top\) (the \(\frac{1}{n-1}\) factor from before does not affect the eigenvectors or the relative order of the eigenvalues).
Then \(W\) is <a href="https://en.wikipedia.org/wiki/Positive-definite_matrix#Positive_semidefinite">positive semidefinite</a> and hence has eigenvectors \(u_1, \dots , u_d\) which form an <a href="https://en.wikipedia.org/wiki/Orthonormal_basis">orthonormal basis</a> for \(\mathbb{R}^d\).
Let \(\lambda_1, \dots , \lambda_d\) be the corresponding eigenvalues and without loss of generality assume \(\lambda_1 \geq \lambda_2 \cdots \geq \lambda_d\).
The <em>projection error</em> for \(x_i\) onto a subspace \(V \subset \mathbb{R}^d\) is defined as <script type="math/tex">\|x_i - P_Vx_i\|_2^2</script> where \(P_V:\mathbb{R}^d \to \mathbb{R}^d\) is the projection-onto-\(V\) operator.
Then for any positive integer \(m < d\) the subspace <script type="math/tex">U_m := \operatorname{span}\{u_1, \dots , u_m\}</script> minimizes the sum of the projection errors. In symbols,</p>
<script type="math/tex; mode=display">\sum_{i=1}^n \|x_i - P_{U_m}x_i\|_2^2 = \min_{\substack{V \subset \mathbb{R}^d \\ \operatorname{dim}V = m}} \sum_{i=1}^n \|x_i - P_Vx_i\|_2^2.</script>
<p><em>Proof:</em></p>
<p>Fix \(m < d\) and let \(V \subset \mathbb{R}^d\) be an \(m\)-dimensional subspace. Define the \(d \times n\) error matrix
\[
E =
\begin{pmatrix}
| & | & & |\\
x_1 - P_Vx_1 & x_2 - P_Vx_2 & \cdots & x_n - P_Vx_n\\
| & | & & | \\
\end{pmatrix}
= X - P_VX.
\]
We want to minimize
\[
\sum_{i=1}^n \|x_i - P_Vx_i\|_2^2 = \|E\|_F^2
\]
where \(\|\cdot \|_F\) is the <a href="https://en.wikipedia.org/wiki/Matrix_norm#Frobenius_norm">Frobenius norm</a>.
We now rewrite the error using matrix algebra
\[
\begin{align}\newcommand{\tr}{\mathrm{tr}}
\|E \|_F^2
&= \| X- P_VX\|_F^2 \\
&=\tr\left(( X- P_VX)( X- P_VX)^\top\right) & (\|A \|_F^2 = \tr(A^\top A)) \\
&=\tr\left(( X- P_VX)( X^\top - X^\top P_V^\top)\right) \\
&=\tr\left(XX^\top - XX^\top P_V^\top - P_VXX^\top + P_VXX^\top P_V^\top \right) \\
&=\tr\left(W- W P_V^\top - P_VW + P_VW P_V^\top \right) & (W = XX^\top)\\
&=\tr\left(W- W P_V - P_VW + P_VW P_V \right) & (P_V = P_V^\top )\\
&=\tr(W)- \tr(W P_V) - \tr(P_VW) + \tr(P_VW P_V ) \\
&=\tr(W)- \tr(P_VW ) - \tr(P_VW) + \tr(P_VW) & (\tr(AB) = \tr(BA) \text{ and } P_V^2 = P_V)\\
&=\tr(W)- \tr(P_VW).
\end{align}
\]</p>
<p>The quantity \(\mathrm{tr}(W)\) is a constant, so minimizing \(\|E \|_F^2\) is the same as maximizing \(\tr(P_VW)\). Let \(\{v_1, \dots , v_m\} \subset \mathbb{R}^d\) be an orthonormal basis for \(V\). Then
\[
P_V = \sum_{i = 1}^m v_iv_i^\top
\]
so
\[
\begin{align}\newcommand{\tr}{\mathrm{tr}}
\tr(P_VW)
&= \tr\left(\sum_{i = 1}^m v_iv_i^\top W \right) \\
&= \sum_{i=1}^m \tr\left(v_iv_i^\top W\right) \\
&= \sum_{i=1}^m \tr(v_i^\top W v_i) & (\tr(AB) = \tr(BA)).
\end{align}
\]</p>
<p>Let
\[
U = \begin{pmatrix}
| & | & & | \\
u_1 & u_2 &\cdots & u_d \\
| & | & & |\end{pmatrix}
\]
where the \(u_i \in \mathbb{R}^d\) are the eigenvectors of \(W\) as stated in the theorem. The matrix \(U\) diagonalizes \(W\) so \(W = UDU^{-1} = UDU^\top\) where
\[D = \begin{pmatrix}
\lambda_1 & 0 & \dots & 0 \\
0 & \lambda_2 & \dots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \dots & \lambda_n
\end{pmatrix}.
\]
Now
\[
\begin{align}\newcommand{\tr}{\mathrm{tr}}
\tr(P_VW)
&= \sum_{i=1}^m \tr(v_i^\top W v_i) \\
&= \sum_{i=1}^m \tr(v_i^\top UDU^\top v_i) \\
&= \sum_{i=1}^m \tr((U^\top v_i)^\top D (U^\top v_i)
\end{align}
\]</p>
<p>If \(v_i = u_i\) for all \(1 \leq i \leq m\) then
\[U^\top v_i = U^\top u_i = (0, \dots, 0, 1, 0, \dots, 0)^\top\]
is the \(i\)-th standard basis vector. Thus
\[
\begin{align}\newcommand{\tr}{\mathrm{tr}}
\tr(P_VW)
&= \sum_{i=1}^m \tr((U^\top v_i)^\top D (U^\top v_i) \\
&= \sum_{i=1}^m \lambda_i
\end{align}
\]</p>
<p>Therefore it suffices to show that \(\mathrm{tr}(P_VW) \leq \sum_{i=1}^m \lambda_i\) for all dimension \(m\) subspaces \(V\).</p>
<p>We will show this is true in the case \(m = 2\), i.e. \(\mathrm{tr}(P_VW) \leq \lambda_1 + \lambda_2\) when \(V\) is 2 dimensional. The case \(m > 2\) uses the same argument but it is more notationally heavy. Let \(\alpha = U^\top v_1 \in \mathbb{R}^d\) and \(\beta =U^\top v_2 \in \mathbb{R}^d\). Note that since \(U\) is unitary \(\|\alpha\|_2^2 = \|\beta\|_2^2 = 1\) and \(\langle \alpha, \beta \rangle = 0\).</p>
<p>The first step is to show that \(\alpha_i^2 + \beta_i^2 \leq 1\) for all \(i\). Let \(e_i = (0, \dots , 0, 1, 0, \dots , 0)\) be the \(i\)-th standard basis vector. Since \(\alpha\) and \(\beta\) are orthogonal and have length 1, the projection of \(e_i\) onto \(\operatorname{span}\{\alpha, \beta \}\) is given by
\[\hat{e}_i = \langle e_i, \alpha \rangle \alpha + \langle e_i, \beta \rangle \beta = \alpha_i \alpha + \beta_i \beta .\]
Then
\[ \alpha_i^2 + \beta_i^2 = \|\hat{e_i}\|_2^2 \leq \|e_i\|_2^2 = 1 \]
since a projected vector always has length less than or equal to the original vector.</p>
<p>The second step is to observe that \(\sum_{i=1}^d (\alpha_i^2 + \beta_i^2) = \|\alpha \|_2^2 + \|\beta \|_2^2 = 2\).</p>
<p>Finally, we want to maximize
\[ \mathrm{tr}(P_VW) = \sum_{i=1}^d \lambda_i(\alpha_i^2 + \beta_i^2) \]
and we know that
\[\alpha_i^2 + \beta_i^2 \leq 1 \text{ and } \sum_{i=1}^d(\alpha_i^2 + \beta_i^2) = 2 .\]</p>
<p>The eigenvalues of a positive semidefinite matrix are nonnegative so the sum \(\sum_{i=1}^d \lambda_i(\alpha_i^2 + \beta_i^2)\) is maximized when when the first and second coefficient are as large as possible, i.e. when \(\alpha_1^2 + \beta_1^2 = \alpha_2^2 + \beta_2^2 = 1\). But then the second condition implies that \(\alpha_i^2 + \beta_i^2 = 0\) for \(i > 2\). Thus
\[ \mathrm{tr}(P_VW) = \sum_{i=1}^d \lambda_i(\alpha_i^2 + \beta_i^2) \leq \lambda_1 + \lambda_2. \]
\(\square\)</p>
<p>We also need to prove that the size of the eigenvalue is proportional to the variance in the direction of the corresponding eigenvector.</p>
<p><strong>Theorem:</strong> As in the previous theorem let \(X = \begin{pmatrix}x_1 & x_2 & \cdots & x_n \end{pmatrix}\) be the data matrix, \(W = XX^\top\) the covariance matrix, \(u_1, \dots , u_d\) the eigenvectors of \(W\) and \(\lambda_1, \dots , \lambda_d\) the eigenvalues. Let \(P_{u_i}: \mathbb{R}^d \to \mathbb{R}^d\) be the projection operator onto the subspace \(\mathrm{span}\{u_i\}\). Then \[
\sum_{j=1}^m\|P_{u_i}x_j\|_2^2 = \lambda_i
\] </p>
<p><em>Proof:</em> </p>
<p>
The working is similar to the previous proof so I'll omit some steps.
\[
\begin{align}
\sum_{j=1}^m\|P_{u_i}x_j\|_2^2
&= \|u_iu_i^\top X\|_F^2 \\
&= \mathrm{tr}((u_iu_i^\top X)(u_iu_i^\top X)^\top) \\
&= \mathrm{tr}(u_i^\top W u_i) \\
&= \mathrm{tr}((U^\top u_i)^\top D (U^\top u_i) \\
&= \lambda_i .
\end{align}
\]
</p>
<p><a href="https://news.ycombinator.com/item?id=19356584">Comment on Hacker News.</a></p>
<hr />
<p><b id="f1">1</b> I actually scaled by two times the square root of the eigenvalue. The eigenvalue tells you the variance and I wanted the standard deviation. I multiplied by two so that ellipse would capture most of the data. <a href="#a1">↩</a></p>The code used to generate the plots for this post can be found here.