In the last video, we determined the coordinates of the optimal projection with respect to the orthonormal basis that spans our principal subspace. Before we go on and determine the optimal basis vectors, let's rephrase our loss function first. I have copied over the results that we have so far. So the description of our projected data point, our loss function, the partial derivative of our loss function with respect to our projected data point and the optimal coordinates that we found in the loss video. Before we go on and determine the optimal basis Vectors, let's rephrase our loss function. This will make it much easier to find our basis vectors. For this, let's have a closer look at the difference vector between our original data point and our projected data point. So you can write. So Xn tilde is given by equation A, which is the sum over j = 1 to M beta Jn times bj. If we now use the results for our Optimal beta jn parameters from here, we get, this is j = 1 to M xn transpose times bj times bj where we used D. Now, we rewrite this in the following way. This is just a scalar or a dot product in this particular case, dot products are symmetric so we can swap the order, and we can also move the scalar over here. So what we end up with is bj times bj transpose times Xn and this one we can write generally as j = 1 to M times bj times bj transpose Xn, where we move the Xn out of the sum. And if we look at this, this is a projection matrix. So, this means that Xn tilde is the orthogonal projection of Xn onto the subspace spanned by the M basis vectors, bj where j = 1 to M. Similarly, we can write Xn as the sum j = 1 to M of bj times bj transpose times Xn, plus a term that runs from M +1 to D, bj times bj transpose times Xn. So, we write Xn as a projection onto the principal subspace plus a projection onto the orthogonal complement. And this term is the one that is missing over here. That's the reason why Xn tilde is the approximation to Xn. So if we now look at the difference vector between Xn tilde and Xn, what remains is exactly this term. So Xn minus Xn tilde is the sum J = M + 1 to D of bj times bj transpose times Xn. So, now we can look at this displacement vectors of the difference between Xn and its projection, and we can see that the displacement vector lies exclusively in the subspace that we ignore. That means the orthogonal complement to the principal subspace. Let's look at an example in two dimensions. We have a data set and two dimensions represented by these dots and now we are interested in projecting them onto the U1 subspace. Well, we do this and then look at the difference vector between the original data and the projected data, we get these vertical lines. That means they have no x component, no variation in x. That means they only have a component that lives in the subspace U2 which is the orthogonal complement to U1 which is the subspace that we projected onto. So, with this illustration, let's quickly rewrite this in a slightly different way. Going to write this as sum of J = M +1 to D of bj transpose Xn times bj and we're going to call this now equation E. We looked at the displacement vector between Xn and it's a orthogonal projection onto the principal subspace, Xn tilde. And now we're going to use this to reformulate our loss function. So, from equation B, we get that our loss function is 1 over N times the sum n = 1 to N of Xn minus Xn tilde squared. So, this is the average squared reconstruction error and now we're going to use equation E for the displacement vector here. So we rewrite this now using equation E as 1 over N times the sum N = 1 to capital N. And now we're going to use inside that squared norm this expression here. So we get the sum j = M + 1 to D of bj transpose times Xn times bj squared. And now we're going to use the fact that the bjs form an orthonormal basis and this will greatly simplify this expression, and we will get 1 over N times the sum n = 1 to capital N times the sum J = M + 1 to D of bj transpose times Xn squared. And now we're going to multiply this out explicitly and we get 1 over N times the sum over n times the sum over j times bj transpose times Xn times Xn transpose times bj. So, this part is now identical to this part. And now I'm going to rearrange the sums. So I'm going to move the sum over j outside. So I'll have sum over J = M + 1 to D times bj transpose. So this is independent of n, times 1 over N the sum n = 1 to N of Xn times Xn transpose and there's a bj from here missing times bj. So I'm going to bracket it now in this way. And what we can see now is that if we look very carefully, we can identify this expression as the data covariance matrix S, because we assumed we have centred data. So the mean of the data is zero. This means now we can rewrite our loss function using the data covariance matrix, and we get that our loss is the sum over j = M + 1 to D of bj transpose times S times bj and we can also use a slightly different interpretation by rearranging a few terms and using the trace operator. So, we can now also write this as the trace of the sum of j = M + 1 to D of bj times bj transpose times S and we can now also interpret this matrix as a projection matrix. This projection matrix takes our data covariance matrix and project it onto the orthogonal compliment of the principal subspace. That means, we can reformulate the loss function as the variance of the data projected onto the subspace that we ignore. Therefore, minimising this loss is equivalent to minimising the variance of the data that lies in the subspace that is a orthogonal to the principal subspace. In other words, we are interested in retaining as much variance after projection as possible. The reformulation of the average squared reconstruction error in terms of the data covariance gives us an easy way to find the basis vector of the principal subspace, which we'll do in the next video.