0:00
So the way that Google chose to explain the page rank concept was in terms of
what they call The Random Surfer. And The Random Surfer as it says surfs
the web at random and goes to random pages.
But there's a specific way that he does it in order to quantify the importance
score of each of these pages, which we'll look at.
But just conceptually right now, just to see what we mean.
and this graph on the right hand side, we have eight nodes.
And so suppose we're starting off at page A, right, and when we're on page A, we're
facing the decision. We have to click on one of the hyperlinks
on page A, right. So we can either go to B, we can go to E,
or we can go to D. So we choose one randomly.
We have a 1 3rd chance of each, because we're choosing randomly, and there's
three of them. And that brings us right down to page D.
And so, then, now we're on page D and then, again, we have to choose randomly
again. So we choose, well, we say there's a
third of a chance of each of these 1, 2, 3 and so we go over to page eight.
And then, now we're on page E. And then now, once were on e, then we say
maybe the person is bored, right? So maybe the person decided that he was
bored, and he didn't like what he was reading, and so he goes and enters a
random URL into his browser, www.something.com.
And so, now this action is just a random Not surfing, because he's not really
surfing and clicking on links, but is just a random entry of a URL.
So he's going to just end up somewhere else in the connectivity.
So he ends up on F. And so, this one actually that he just
made now did not follow the connectivity of the webgraph, but it's still an action
that we take into account within the random surfer.
And so -- and then, when he's on F again, he chooses one of the two hyperlinks, and
he decides to go to G and so on. So just to summarize, with the Random
Surfer, we define two possible actions. The first is to pick a hyperlink randomly
from the current page, right, so this is how we want to go next.
You can pick a hyperlink randomly from the current page, and that's this whole 1
3rd, 1 3rd, 1 3rd, business we were talking about, were you just choose one
where you go next, and the second is to enter a URL into the browser randomly.
And those are two actions. This one follows the connectivity, and
this one doesn't follow. And so, you can see already how This kind
of takes into account the importance of the lengths here because the lengths
which are coming from nodes, which themselves are more important are going
to get visited more often and going to be used more often.
And so the question that pagerank asks is what is the chance that a page is going
to be selected at any time? So, if we keep surfing over and over and
over and over and over again, how many fractions of the total number of searches
is each page going to get? Which ones are going to be visited more
frequently and we say those are the more important.
So what is the chance a page is selected at any time?
And that's the idea behind PageRank. So again now, the ones that are going to
get visited the more often are the ones that have links that are more likely to
be chosen. And the ones that have links that are
more likely to be chosen are in terms going to be from nodes that are visited
more often. So you can see sort of the logic that
keeps circling here. That important is dependent upon the
importance of the links, which is dependent upon the importance of the
nodes, which is then dependent upon the importantance of the links, and so on and
it gets kind of complicated. So we keep going back.
And, but, we'll see how we can quantify that pretty simply, just to see what the
importance scores mean and how the represented.
So, now let's talk a little more about the importance score.
Let's ignore the idea and the random surfing of getting bored for now, okay.
And we'll return back to our initial example with four nodes which we just ran
an in-degree version of importance on before.
And so Each node, we're going to say, has it's own importance score.
Okay. So, they have their importance score and
we can make, use lower letters for that. So, for big W we'll say the importance
score is w, or little w. For X, we'll say the importance score is
little x. For Y, we'll say it's little y.
And for Z, we'll say it's little z. And so now we have to figure out what are
the chances that each of these links are going to be visited.
So it's very simple, is that if we're on node W, okay, we only have two choices,
okay? If we ignore the idea of getting bored
and surfing randomly, okay for R and W we're going to follow the connectivity of
the web graph. I'm going to make that assumption.
So, we have to choose one of the two links.
We either choose this link or we choose this link.
Each of them is going to clearly have one half of a chance because there's two.
Right? So, it's like flipping a coin.
Just choose one of the two. [NOISE] Similarly for if we're already on
Node y for instance, y only has one outgoing link.
So, it's only one choice that you have to go to z next.
So, this is actually a chance of one. But now as we said we want to figure out
how important the links are and that's going to depend upon how important the
node is itself. Okay.
So, we in a sense need each of these nodes to be spreading their importance
score to this link. So, it should make sense then that we
should multiply this by w and say well this link is one half w.
This other link right here is also one half w.
And in that way then we can look at this much easier and say, okay when, when
we're at W. W.
starts with an importance score of W. Then this link is half of W, one-half W.
This other link is another half of W, because there's two of them.
So W gets divided by 2, and we say it spreads its score to those two links.
And similarly, for Y there's only one link right here, so all of Y's importance
gets put off on this one link. Okay.
And so the reason is first we're splitting half and half on each of the
links, and second is that we have to wait by the importance in the first place,
because that's how important the node is itself.
So, the more important node W is, then you see the higher the importance of each
of these links will be. So if W is an very important node then
that will mean a link from W will be Very important, you see, that's why there's
this w factor in here, which take care of that, of weighing the importance of the
nodes on the links themselves. So as we said, each node has its own
importance score and this spreads an equal amount of importance to each
outgoing link. And so we can also say that that's
related to the out-degree of a node. So we talked before about the in-degree,
which was the number of incoming links, and we used that as a measure of
importance. But now we're actually using the
out-degree as well and we're saying that basically we're dividing in each length
Gets a fraction of the out degree, right, because if the out degree is three then
each length gets a third, if the out degree is four then each length gets a
fourth of the important square of the node and so on, so it's just a way of
looking at it. This out degree of W right here is two,
so each of these gets one half. The out degree of y is one, so each of
these gets one... The out-degree of x is 2.
So each of these would get 1/2 x. 1/2 x.
We can right those equations. Now the out-degree of z is 3.
Cause z gets divided three times. So each of these is 1/3.
We have 1/3 z here, 1/3 z here, and 1/3 z here.
Now notice that if you add What's outgoing, you will get this importance
score in the beginning, and that's, again, the idea of spreading.
This is one-thirdsy, two-thirdsy, onesy, if we add all those up.
One half plus one half is one. One half plus one half is one.
One is just equal to one. 'Kay so that's what that whole
conservation principle, that we're only spreading the amount of importance.
Each node is only spreading the amount of importance that it has to begin with.