0:00

[MUSIC]

The final topic of this module on Eigenproblems, as well as the final topic

of this course as a whole, will focus on an algorithm called PageRank.

This algorithm was famously published by and

named after Google founder Larry Page and colleagues in 1998.

And was used by Google to help them decide which order to display their websites when

they returned from search.

The central assumption underpinning page rank

is that the importance of a website is related to its links to and

from other websites, and somehow Eigen theory comes up.

This bubble diagram represents a model mini Internet,

where each bubble is a webpage and each arrow from A, B, C,

and D represents a link on that webpage which takes you to one of the others.

We're trying to build an expression that tells us, based on this network structure,

which of these webpages is most relevant to the person who made the search.

As such, we're going to use the concept of Procrastinating Pat

who is an imaginary person who goes on the Internet and

just randomly click links to avoid doing their work.

By mapping all the possible links, we can build a model to estimate the amount of

time we would expect Pat to spend on each webpage.

We can describe the links present on page A as a vector, where each row is either

a one or a zero based on whether there is a link to the corresponding page.

And then normalise the vector by the total number of the links,

such that they can be used to describe a probability for that page.

For example, the vector of links from page A will be 0 1 1 1.

Because vector A has links to sites B, to C, and to D, but

it doesn't have a link to itself.

Also, because there are three links in this page in total,

we would normalize by a factor of a third.

So the total click probability sums to one.

So we can write, L_A =

(0, a third, a third, a third).

Following the same logic, the link vectors in the next two sites are shown here.

And finally, for page D, we can write L_D is going to equal,

so D is connected to B and C, but not A, two sides in total,

(0, a half, a half, 0).

We can now build our link matrix L by using each of our

linked vectors as a column, which you can see will form a square matrix.

What we're trying to represent here with our matrix L

is the probability of ending up on each of the pages.

For example, the only way to get to A is by being at B.

So you then need to know the probability of being at B,

which you could've got to from either A or D.

As you can see, this problem is self-referential,

as the ranks on all the pages depend on all the others.

Although we built our matrix from columns of outward links, we can see that

the rows describe inward links normalized with respect to their page of origin.

3:08

We can now write an expression which summarises the approach.

We're going to use the vector r to store the rank of all webpages.

To calculate the rank of page A,

you need to know three things about all other pages on the Internet.

What's your rank?

Do you link to page A?

And how many outgoing links do you have in total?

The following expression combines these three pieces of information for

webpage A only.

So r_A is going to equal the sum from j = 1 to n,

where n is all the webpages of the link matrix relevant to webpage A and

location j, multiplied by the rank at location j.

So this is going to scroll through each of our webpages.

Which means that the rank of A is the sum of the ranks of all the pages which link

to it, weighted by their specific link probability taken from matrix L.

Now we want to be able to write this expression for

all pages and solve them simultaneously.

Thinking back to our linear algebra, we can rewrite the above

expression applied to all webpages as a simple matrix multiplication.

So r = Lr.

Clearly, we start off not knowing r.

So we simply assume that all the ranks are equally and normalise them by the total

number of webpages in our analysis, which in this case is 4.

So r equals a quarter, a quarter,

a quarter, a quarter.

4:58

Applying this expression repeatedly means that we are solving this problem

iteratively.

Each time we do this, we update the values in r until, eventually, r stops changing.

So now r really does equal Lr.

Thinking back to the previous videos in this module, this implies that r is now

an eigenvector of matrix L, with an eigenvalue of 1.

At this point, you might well be thinking,

if we want to multiply r by L many times, perhaps this will be best

tackled by applying the diagonalization method that we saw in the last video.

But don't forget, this would require us to already know all of the Eigen vectors,

which is what we're trying to find in the first place.

So now that we have an equation, and

hopefully some idea of where it came from, we can ask our computer

to iteratively apply it until it converges to find our rank vector.

You can see that although it takes about ten iterations for the numbers to settle

down, the order is already established after the first iteration.

However, this is just an artifact of our system being so tiny.

So now we have our result, which says that as Procrastinating Pat randomly clicks

around our network, we'd expect them to spend about 40% of their time on page D.

But only about 12% of their time on page A, with 24% on each of pages B and C.

We now have our ranking, with D at the top and A at the bottom, and B and

C equal in the middle.

As it turns out, although there are many approaches for

efficiently calculating eigenvectors that have been developed over the years,

repeatedly multiplying a randomly selected initial guest vector by a matrix,

which is called the power method, is still very effective for

the page rank problem for two reasons.

Firstly, although the power method will clearly only give you one eigenvector, when

we know that there will be n for an n webpage system,

it turns out that because of the way we've structured our link matrix,

the vector it gives you will always be the one that you're looking for, with an eigenvalue of 1.

Secondly, although this is not true for the full webpage mini Internet,

when looking at the real Internet you can imagine that almost every entry in

the link matrix will be zero, i.e,, most pages don't connect to most other pages.

This is referred to as a sparse matrix.

And algorithms exist such that multiplications can be performed

very efficiently.

7:28

This adds an additional term to our iterative formula.

So r i + 1 is now going to equal d, times Lri + 1- d over n.

d is something between 0 and 1.

And you can think of it as 1 minus the probability with which

procrastinating Pat suddenly, randomly types in a web address,

rather than clicking on a link on his current page.

The effect that this has on the actual calculation is about finding a compromise

between speed and stability of the iterative convergence process.

There are over one billion websites on the Internet today, compared with just a few

million when the page rank algorithm was first published in 1998.

And so the methods for search and ranking have had to evolve to maximize efficiency,

although the core concept has remained unchanged for many years.