0:00

Welcome to Lecture three of Networks, Friends, Money, and Bytes.

In today's lecture, we will continue to talk about Google.

But this time, how Google rank webpages. In the last lecture, we talked about how

Google auctions the advertising slots, usually on the, the right panel of the

search results. And today, we'll talk about how Google

ranks web pages displayed in the main panel of the Search Result page.

And we will see that, the algorithm that Google uses to rank these web pages, can

be viewed as the solution to a gigantic system of linear equations.

So, while in this lecture, we'll start using some Linear Algebra and matrix

notation to accelerate to the presentation, you should know that at the

heart of all these notation, it is just solving linear equations.

Now, the idea that links can be embedded in text dates back to at least the middle

of last century. But then in the 1990s, with the

introduction of the Web in '89, the browser in 1990, and the web portal in

1994, this vision of linking text grew to a gigantic scale.

Now, we can view these web pages, as a network.

They form a network of information represented as web pages.

So, each node is a web page. And a link is what's called a hyperlink.

It connects the text in one web page to that of another web page and you go from

this page to the other by clicking on that link.

As we all know, that webpage A points to webpage B doesn't mean that B points back

to A so we have to provide a sense of direction to this link.

We call this a directional link. And the resulting graph is called a

directed graph. How big is this graph?

It is very big. Nobody knows for sure but it is estimated

that N, the number of webpages out there, is somewhere between 40 to 60 billion.

At the same time, this graph is also very sparse meaning that there aren't that many

links compared to the number of no's out there.

Okay. Many of these webpages have very few links

pointing in or links pointing out. So, we're talking about a huge and a

sparse graph. It's a very special kind of graph.

3:10

And the idea is that we will be able to leverage the pattern of connectivity in

this graph to calculate what we call the importance score of each node.

And in this case, each node is a web page, and importance score will be used to rank

that page and what we want to do is to say that let's turn a seemingly circular logic

that says, important nodes pointing to, say you, you being a web page, means that

you are also important. This argument almost sounds like a cyclic

argument. But, in fact, we can use it to derive a

recursive definition to define, define an equilibrium.

Now, this equilibrium is not the equilibrium of a convergence of an

algorithm, like in power control, or the equilibrium of strategic thinking as in

gang theory. This is a fixed point equation equilibrium

as we will see in the rest of this lecture, alright?

So, coming back to this question. How can we quantify the relative

importance of different nodes in a graph? Or, more specifically in this case,

quantify the importance of each web page. We will see later in Lecture eight, quite

a few different ways to quantify the notion of importance of nodes in a graph.

In today's lecture, we'll look at a specific one used to, by Google.

Now, there are quite a few intuitions you can think about this.

One is to say, well, you want to quantify node importance, right?

So, maybe, I'll just count how many links does this node have.

Well, since this is a directograph, where each directed link, I'll just count how

many links point into this node. In this case, there are three links

pointing into this node. So, we call this the in-degree.

5:35

And the out-degree is one because there's just one node, one link pointing out of

this node. And the total degree of this node is four.

Okay. So, this is a simple intuition as we will

see that it's sort of getting on the right track, but it lacks many elements for it

to be a useful metric. Now, we will also see in the discussion of

important scores of nodes, a recurring theme in this course.

And that recurring theme says, that we never will talk about a network whether

it's a network of people, a network of webpages, a network of wireless devices.

There actually are two different elements in it.

One concerns the topology, that is the pattern of connectivity, and that's

usually represented visually by graphs and algebraically, by matrices.

We will see within today's lecture, three matrices to describe an increasing order

of complexity, the graph that we are dealing with.

But that's only half of the story. The other half is the functionality that

you would like to study that's running on top of this network.

That's what you do on the graph. In today's case, what you do is, can be

viewed as a special case of navigation or search.

Later in Lecture nine, we will talk six degrees of separation.

That's also a kind of navigation and search.

In Lecture thirteen, we'll talk about routing on the internet.

That's another special case here. So, in order to fully understand a

network, we need to know not just properties of the topology, but also

viewed models of functionalities. And, in fact, in Modules three and four of

today's lecture, we will see some very crude models of the functionality of

search and navigation that is implicitly assumed behind Google's ranking algorithm.

Alright, now, back to our intuition. That intuition says, let's count the

number of links. We call that Try number zero, that's the

very basic intuition. And we will soon see that is hardly a very

useful one, when it comes to ranking web pages.

Here's a little smarter idea that says, what we can do is to say, each node has a

certain score. Give it a number, give it a positive real

number. We're going to denote that number as pi

sub node index. In this case, say, node one.

So, it's got a number denoted as pi sub one.

And another node, say node two, also has a number, pi sub two, okay?

And here's another one, node three with a number pi sub three.

And our job for the rest of this lecture is to compute these pis, what we called

importance scores, we can collect them into a vector, pi one, pi 2...

Up to pi N. This is a very long vector and by

convention, it's a column vector and we call this vector pi, okay?

In the textbook or in written media printed media, you can't use bold face.

It's hard to draw bold face on the screen so we're just going to use this arrow to

represent that vector. So, pi vector, that's what we want to

compute. So, Try one says, that look, you got these

pis, okay. Maybe you can just add up the pis that

points to you. For example, if node one and node two both

point to node three, then I can say that node 3's importance is the sum of pi one

and pi two, okay. Simple enough.

This is slightly better than just counting degrees cuz I now allow each node to have

different important score, okay? And, of course, in general, first node and

second node, they may have many other connections coming in and going out.

10:28

So, this is a reasonable choice, okay? However, what it ignores is that some

node, like this node one here and two, they are pointing to many other nodes, not

just pi of node number three. So, think of a professor who writes

reference letters for her students. And if this professor keeps writing that

this is the very best student I have ever had in my 80 years of teaching at this

university, and so on. If every student should rise in that way,

and should rise like, say 100 letters every semester, then soon people will

realize that, hey, this professor's reference letters should be discounted.

And that's the idea here, is that you've got to normalize by how you spread your

important score. If you point to too many web pages, then

each of those web pages that you point to should not receive all your importance.

12:40

So, that's try number two now, okay? The ideas that we get to normalize by the

strata of importance. It turns out that by this point, we're

sort of close to what Google does. In the next of three modules, we will have

to quantify this intuition and then, provide two more enhancements about what

are getting there. The idea is that, look at all the

importance scores, add them up across the links pointing to you, but make sure you

normalize by the outgoing number of links. Now, you may wonder, if I have a node

here, number two with important square pi two, I can write pi two as some

combination of the important scores normalized from the incoming neighbors.

And each of these, say, this involves pi one, can also be written as the summation

of normalized important scores from its incoming neighbors.

So, the question is, on the left-hand side, I will have pi one, pi two, and all

the pis. On the right-hand side, I also have all

these pis, divided by their outgoing number of neighbors.

So, I've got pis both on the left-hand side and the right-hand side of this, this

linear equation system. Can they actually make sense collectively?

Is there is a set of consistent scores pi? Is there a vector pi that, can satisfy all

these linear equations simultaneously? If so, then we have found a particular

meaningful and self-consistent way to assign importance scores.

If not, then we're in trouble. So, the question now is how can we modify

this basic intuition in our try number two now, in order to guarantee both the

existence of such a set of consistent scores and an efficient way to compute

these scores? But before we go into the general space,

let's look at a simple, very small example.

There are only four nodes and six directed links.

Nodes one and two points to three, three points to four, and then four points to

back to one, two, and three, okay? Now, what is the intuition of the

importance score vector pi, which should be a vector of length four here?

If we assume that such a pi consistent set of scores actually exist.

Here's one particular view. We can think of nodes three and four as

collectively acting like a supernode, okay?

Let's call this supernode three+4. And here's node one.

Here's node two. One points to supernode three + five, so

does two, and a supernode points back, okay?

So now, intuitively, you look at the supernode and say, this supernode got two

links pointing inwards. But each of these nodes, one and two, only

has one length pointing inward. So, intuitively, you'll think that nodes

one and two should each, has a smaller important score than the supernode three +

four. And furthermore, by symmetry of this graph

here, nodes one and two should have the same importance score.

And then, the supernode three + four. Need to distribute its score in some way

between nodes three and four. So, at least we know intuitively that pi

one should equal to pi two, and each of them should be smaller than pi3 + four,

the supernode score, okay? So, intuitively that's what we should.

Anticipate. Now, we can write down the math a little

further and calculate in this small example very easily the result.

Now, first of all you'll also notice that, in this node of three pointing to node

four, it says, that basically all the important score of pi three that we get

goes into node four. And node four has only got this one

incoming link. So, pi four should equal to pi three, or

pi three divided by one, cause there's only one outgoing link from node three.

And plus what? There's no other terms.

There's only one link pointing inwards to node four.

So, we can just say pi four equals pi three, and pi one equals pi two.

Now, once we know that, the job is very simple now.

Let's, say, pi one equals pi two, this number is x.

Pi three equals pi four, this number is y, okay?

Now, we know that there is one link from node four pointing to node one, right?

19:19

So now, we got two equations. One says, x = y / three.

The other says, by normalization 2x + 2y should equal to, say, a constant one.

Now, we got two variables in two linear equations.

In this case, we got a solution. The solution is x = one / eight and, and y

= three / eight, which satisfies our intuitive understanding in the last slide.

Now, therefore, what we're going to output is that we will say, nodes three and four

pi for position number one. Three, and four, and nodes one and two

tied for position number, I guess three, okay?

Cuz positions one and two are pi between nodes three and four.

So, you can't display three and then four, and then one and then two, for example, on

the Search Result page by Google. Now, is this the right answer or not?

And that's actually a tricky, almost philosophical question.

What is a right answer? In some sense, the right answer depends on

how does each user find that Google search results useful or not as useful.

Now, because Google is such a dominant force in the search space today, it's

actually hard to imagine what other search orders might look like so it's not easy to

actually falsify whatever we might conjecture about that.

And it is just very hard to measure or quantify subjective view on the usefulness

of the search results, okay? So, what we are doing here is really a

proxy of the real problem, which is make each individual users find the search

results as useful as possible. And furthermore, as you can see, that we

carried out a whole bunch of computation. Well, in this case, it's not that much,

two variables and two linear equations. But the actual numbers of the important

scores are not displayed in the search result page.

All you see is the rank order. You do not see the scale behind the order.

So, order is displayed. Scale, the numerical scale is hidden from

users' view, okay? So, in that sense, scale is really a means

to the ends. So, we don't know, for example, that this

node is three times more important than that node.