0:00

So, Google searches output, and what you see is really a composite of the

relevance and importance score. And they both do play a part in the

output that you see, and the order in there of.

But the importance score specifically determines the top results.

That determines the first page or so of results that you see, so the, the idea is

that for the first ten or twenty results, they all have to clearly be very

relevant, so all have to contain keywords.

And the question is amongst those that contain keywords, how do we figure out

which of them are the most most important.

Now, Google's ranking algorithm, we should point out, is called PageRank.

And it's a clever pun on Larry Page, was one of the founders of Google, so his

last name there. And so an interesting Cornell study in

2008, just to show you why we really only need to look at the first pages, so

that's why we focus on the important score.

Here in this lecture is they, they, they searched typical nouns and they tried to

see how many, they did a study on the number of clicks that each search results

gets on average. So what fraction of the clicks, if you're

someone searching how many, how many, times are you going to click on the first

one? How many times are you going to click on

the second on? And so on, so in all, if we took the

proportion of people who click on the first link, the proportion of people who

click on the second link, and so on, how does that decrease?

And so the first link gets something like 56% of all the clicks, which is really

remarkable. And it drops off pretty quickly after

that, so we have 56%, 13%, 10%, that's up to almost 80% just from the first three

results. So, the first three results right here,

they saw see almost 80% of all of the clicks.

And the first page itself, is something like, 98%.

So, if we took the first page, that was about 98% and I believe I have those

statistics here as well. So, the first three links receive about

80% and the entire first page actually receives greater than 98%.

And so, these first few links in the entire page are really determined, so

they're ordering amongst themselves. Once you have those relevant pages, those

are determined by the importance score itself.

And that's really what's going to play an impact here, so we will be focusing on

the importance score in this analysis. So now, let's move to understanding how

the importance score is Is used to rank these pages.

So, the first thing we have to take a look at, is the idea of a Webgraph, and

what that is. And we're going to see, how this can give

the connectivity overview of the webpages and how they're interconnected with one

another. So, let's start with some terminology.

A Webpage, we've been using that term, but it's really, it's a document on the

web. So it's a document, and you can go there,

and you can access the page, and do whatever you want to do on that page.

A Hyperlink, is a connection from a webpage to a piece of data.

So, on the webpages themselves, you see some external links here, that you can

click on, and when you click on them, that will bring you somewhere else.

Now typically, it will bring you to another webpage, so for our purposes here

and what we're seeing, a hyperlink is a connection from a webpage to another

webpage. So, we have webpages, and then we have

Hyperlinks connecting Webpages. All right.

So now, before we talk about a webgraph, and you could probably already understand

what a webgraph is. It's really just the Webpages and how

Hyperlinks connect them together and we just draw that visually.

But, so, a graph is a huge concept mathematically and we, we will not go

into much detail on what graph theory is, but simple terminology here will suffice.

And when we say graph route, we're referring to some like XY chart plot.

That's, that's not what we're referring to something, a structure, a mathematical

structure, that has nodes which are also known as bird seeds, and those

represented typically as circles. So, we have these nodes, and we have link

or edges, so there's, those are kind of interchangeable, it's either vertices and

edges or notes and links. And the nodes or circles and the links

connect the nodes together. So, we may have links like this.

So, this is node A, this is node B, and this is node C, that I have a link

between A and B, I have a link between B and C, and so on.

I, you know, I may have a link also between A and C but maybe not, and it

will depend upon how the graph is connected.

And so now, we can take this. And well, we can apply that to a

Webgraph, which is just a graph of Webpages, so in this sense, now, the

nodes themselves become webpages, right. So, we have webpages as our nodes, which

we will still draw as circles but webpages become our nodes.

So, suppose we have just three webpages that we're looking at here, and the links

become the hyperlinks, so the hyperlinks are connected to them.

So now, one thing that we have to make note of is that, here we drew the links

without any arrows or anything indicating a direction.

Now, Graphs can have directed links, in which case, it's really only implying a

one-way connectivity. So, A may connect to B, but may not

connect to A, and C may connect to A, but A may not connect to C, and C may connect

to B, for instance, B may not connect to C.

So, when we're dealing with a graph of the Internet or a graph of the webpages,

these links are directed. Because hyperlinks only go one way, they

don't, just because A points to B doesn't mean B is going to have a hyperlink back

to A. So, if this is page A, maybe page A has a

hyperlink to page B, and then page C has a hyperlink to page B and so on,

something like that. So, a simple example of that would be, if

you are going, if you have a website and your favorite artist also has a website,

right. So this, maybe this is your artist's

website. This is your favorite artist right here,

and this is you, and this is your website right here.

Okay, so you may have a link on your page that goes to your artist's website, and

there's a lot of other people that probably have links that go to this

website as well. But it doesn't mean that he is going to

take a link and put it back to your page for every single person that puts a link

to his page. So it's not, they're not bidirectional,

and otherwise, there would be a lot more people that were famous, because they

would just feed off of that, be able to gain a lot of fame.

That obviously wouldn't work too well. But, so now, directed, the way that we

indicate a directed graph is by having the links have arrows.

And as we said, a webgraph does have direct links.

Another thing about a webgraph is that it's very sparse.

By sparse, I mean that most webpages aren't connected to one another directly.

So for every web page, a web page has a very few number of neighbors, which we

call them, so If we look at a node A here, A B is A's neighbor because A is

connect to B, and so on. And you can get to him in just one path

jump. the number of neighbor's in each node has

is way smaller than the total number of nodes.

So I mean, we have something like 60 billion nodes in the internet, right.

So there's like 60 billion webpages, but the Wikipedia page, which is on the

higher end of the number of hyperlinks you see, may have like a hundred links

out to different pages. So 100 versus 60 billion is much, much

smaller, so by sparse, we mean that there's not a lot of lengths compared to

the number of nodes, and the internet itself, as we said, has 60 billion nodes.

So that, that gives use kind of the issue here, because there's no way we could

ever draw the graph of the Internet and then do like a calculation on that.

Because that would just take us, take us, you know, our entire lifetime and we'd

still never finish. So, we're going to have to use examples

here, that are of much smaller scale with four even five nodes maybe, and look at

those. But really the ideas of the same, it's

just on a much larger scale scales. The ideas are portable and they scale up

very well. And Google has a lot of algorithms in

place which scale up to very high data sets and very large data sets.