In this video, we will introduce the setting of PCA and the high level idea.

Assume we have a dataset X in RD consisting of n vectors.

So, X is a dataset and we have n vectors X_1 to

X_n where the X_i are D-dimensional vectors.

Our objective is to find a low dimensional representation

of the data that is as similar to X as possible.

Before we start, let's briefly review three important concepts.

The first one is that every vector in RD can be

represented as a linear combination of the basis vectors.

So, let's write this down.

So, we write X_n can be written as the sum of

i equals to one to D of beta i_n times b_i.

In the following we will assume that the b_i are an orthonormal bases of RD.

If we assume that we use the dot programs in

a product and b_1 to b_D are an orthonormal basis,

we can also write the beta i_n as X_n transpose times b_i,

which means we can interpret beta i_n to

b the orthogonal projection of X_n onto

the one dimensional subspace spanned by the ith basis vector.

The third property is that if we have an orthonormal basis b_1

to b_m of RD and we define

b to be the matrix that consists of these orthonormal basis vectors.

Then the projection of X onto the subspace,

we can write as X tilde is B times B transpose times X.

That means X tilde is the orthogonal projection of X

onto the subspace spanned by the M basis vectors.

And B transpose times X are

the coordinates of X tilde with respect of the basis vectors collected in the matrix B.

This is also called the code,

so coordinates or code.

Now, let's have a look at PCA.

The key idea in PCA is to find

a lower dimensional representation X_n tilde of

X_n that can be expressed using fewer basis vectors,

let's say M. We assume the data is centered,

that means the dataset has mean zero.

And we also assume that b_1 to b_D are an orthonormal bases of RD.

Generally, we can write any X_n tilde in the following way.

X_n tilde can be written as a sum i equals one to M of

beta i_n times b_i plus

the sum of i equals M plus one to D of beta i_n times b_i.

This entire thing is still living in RD.

So, we took our general way of writing any vector in RD which comes from property one,

and we split the sum in property one into two sums.

One is living in an M-dimensional subspace and the other one

is living in a D minus M-dimensional subspace which

is an orthogonal complement to this particular subspace.

In PCA, we ignore the second term,

so we get rid of this part.

And then we call the subspace that is spanned by

the basis vectors b_1 to b_M the principal subspace.

So b_1 to

b_M span the principal subspace.

Although X_n tilde is still a D-dimensional vector,

it lives in an M-dimensional subspace of RD and only M coordinates;

beta n_1 to beta n_M are necessary to represent it.

So, these ones are the coordinate of this, X_n tilde vector.

The betas of n also called the code of the coordinates of

tilde X_n with respect to the basis vectors b_1 to b_M.

And the setting now is as follows.

Assuming we have data X_1 to X_n,

we want to find parameters beta i_n and orthonormal basis vectors b_i,

such at the average squared reconstruction era is minimised.

And we can write the average squared reconstruction error as follows.

We can write J,

that's the average squared reconstruction error,

is going to be one over N times the sum n equals one to N,

and then we write X_n minus X_n tilde squared.

Let's have a look at an example.

We have data living in two dimensions and now we want to

find a good one dimensional subspace such that the

squared or average squared reconstruction error of

the original data points and their corresponding projection is minimised.

Here I'm plotting the original dataset with

their corresponding projections onto

one dimensional subspaces and I'm cycling through a couple of options of

subspaces and you can see that some of these projections are

significantly more informative than others and in PCA we are going to find the best one.

Our approach is to compute the partial derivatives of J with respect to the parameters.

The parameters are the beta i_n and the b_i's.

We set the partial derivatives of J with respect to

these parameters to zero and solve for the optimal parameters.

But one observation we can already make.

And that observation is that the parameters only

enter this loss function through X_n tilde.

This means that in order to get our partial derivatives,

we need to apply the chain rule.

So, we can write d_J by d either beta i_n or

b_i can be written as d_J by

d_X_n tilde times d_X_n tilde by d either beta i_n or b_i.

And the first part we can already compute and we get