0:00

Let's answer the first question first, about Structural small worlds.

Let's use symbol L to denote the median of shortest path between node pairs.

So, if I got node pair i,j and the shortest path distance between them is dij

and then I look at all the node pairs and the corresponding shortest path distance.

Histogram it and then I get L as the median.

And I want L to be small. And of course, L will be small if there

are only a few nodes, So really, I want to say L is small

relative to the number of nodes, say n, And, more precisely, I want L to grow like

a slow function of a number of nodes. How slow?

For example, logarithmic function, Then even for exponential growth after

taking the log, it become just linear. So what we now want to do is to reverse

engineer and we'll comment on this, the utility and pitfalls of reverse

engineering, topology or functionality of a network towards the end of the lecture.

Want to reverse engineer and build an explanatory model to explain the

observation like what has do. To explain that observation this time

about the social network that L is often a slow function the number of nodes.

But whatever explanatory of model we have, it can be just a tree growing like this

twenty to the power of six, cuz we know that violates another observation

empirically made in social network, That is there's a lot of transitivity

quantified for example by a clustering coefficient of a graph quantifying the

degree of existence of triad closures. And we're going to use C to denote this

metric clustering coefficient. I just want to highlight C clustering

coefficient of a graph has nothing to do with the clustered density p that we

talked about in the last lecture, In contagion model.

Okay? Clustering coefficient of a graph is not

the cluster, the density of a subset of nodes.

So what is clustering coefficient? Remember, we're trying to quantify the

following observation that in many social networks, if a, b, c forms a connected

triple like this, then there's a good chance that it also closes the triangle.

Okay, that's why it's called triad closure,

So, we want to look at the number of triangles versus the number of connected

triples. And, we want this metric to be normalized,

so it always lies between zero and one for all graphs.

In particular, if it is a triangle, then we want the C to be one.

In order to accomplish this normalization we're going to divide the number of

connect triples by three. Why? Because of different ways we count,

connected triples versus the count of number of triangles.

If there is a, A link here, then the triangle exists, we

call that triangle a, B, c. We count it only once.

But when we count a number of connected triples, we actually have to count a, b

and b, c, two links coexist, then there's this connected triple whether there's a

link a, c or not. If there is one, we'll also have a

triangle. If not, we don't.

But, to connect a triple, this is one connected triple.

And then, we can also write a, c and c, b, two links that is another connector

triple. And b, a link and a, c link, that's

another connector triple. So, just because when we count connected

triples, we count basically three times, so we divide it by three to take into

account that over-counting. And therefore now, c, d will be guaranteed

to lie within zero and one. For just a triangle, it will be one.

For just a connected triple, which is a line here, it will be zero.

Now the question for a bigger general graph,

What will C look like? We want C to be big, and L to be small for

our explanatory model. So, here's another graph, slightly bigger,

four nodes and four links. Let's see what is the clustering

coefficient for tri closuring this graph. Well, for this graph, this number of c is

actually easy to compute. Alright, C equals the number of triangle

over the number of connected triples over three.

How many triangles are there? Whether there's only one triangle, okay

one, two, three nodes, triangle. What about connector triples here?.

Well, there are let's see, how many connect triples.

Two, one, three, that's a connected triple.

Okay? One, two, three, that's another connector

triple. Okay? one, three, two, that's another

connector triple. One, two, four, That's another connector

triple. And three, two, four, that's another

connector triple. So, there are actually five connectoe

triples. Of course, some of them we basically

counted more than once, So, we have the normalization divided by

three. And that means the clustering coefficient

of this graph is three-fifths, which is a pretty big number considering that C must

lie between zero and one. Okay?

7:26

And let's look at, if you indeed construct such a graph what would be the l and c.

Now l is actually small as we can intuitively see okay, because between any

node pairs, there is a chance of p that they are directly connected, but the

clustering coefficient C is also small. Okay? When we look at c now we're talking

about expected c and expected l because this is a probabilistic object.

And without going to the details there, sufficed us to show the intuition that c

for our, plus our random graph is the following.

Small c over n minus one. N

Is the number of nodes. Small C is the expected degree of the

node. Why is that?

Why is this the expression for the expected clustering coefficient big C for

a random graph? Well, because the chance, the probability

that two nodes are connected is C over n minus one in a random graph.

Okay? This is how many connections you have and this is how many nodes are out

there other than yourself and, therefore, the ratio is the expected number of the

probability of the two nodes been connected.

Independent of whether these two nodes are already indirectly connected by another

node or not, doesn't matter. That's not how the graph was constructed. So the

chance that these two nodes are connected is exactly just c over n minus one.

So, suppose we look at a graph, let's say 100 million people, and each has 1,000

friends. That's a lot already, much bigger than

Dunbar's number. Still, the clustering coefficient is only

on the order of ten minus three, which is way too small.

We know C will be much bigger than that. Orders of magnitude bigger than that in

the real social network. So, random graph can explain small l, but

not big C. What about the extreme end on the other

hand? A regular graph, let's say, a regular ring

graph. What's a regular ring graph?

Let's draw a ring here. Okay?

It's just a line with two ends connected. And, let's say, there are nodes living

here. Okay? In this case, there are eight nodes,

so n equals eight. Just like random graph is parametrized by

number p, that's the probability of a link appearing or not.

A regular ring graph is parametrized by a number of c, a small c.

Now this small c is not exactly what we just used expected degree.

Here is the actual deterministic degree c, here.

Okay. That's the number of neighbors each node has,

It's the degree of each node. Given this is a regular graph, you can

rotate however you want and it still looks the same.

This is basically the degree for all the nodes.

10:53

Now, in this particular ring graph c is two because for each node there are two

neighbors. The degree is two for everyone, I can add

a few more. Say, let's make a c and even number.

Okay? I can also connect to this node with two hop neighbors now.

So there's a direct link going out over there now.

Okay? Similarly,

I'll add all these. Now in this case c is no longer two, c is

four, because there's one more missing, because every node has actually four

neighbours. I pick this node, one neighbour, two

neighbour, three neighbour, four neighbour.

Okay, don't worry too much about exactly the way the lines are drawn.

You can easily count that, one, two, three, four.

This node has four neighbours and its symmetric is regular, so every node has

four neighbours, degree four and eight nodes.

11:58

Of course, you can make c even bigger now. If we insist the c to be even then, you

can make c to be six and eventually six be eight and that would also be called full

mesh network and every node is connected to every other node in fact in this case

because of even number of nodes here, you can even do that.

The largest and even c can get is just the six.

Okay? Well, so in this case, what would be c and

what would be l? First of all, c is now much better.

Hm, okay? Let's compute the clustering coefficient

here, c. Number of triangles and number of

connected triples. Turns out number of connected triples is

easy to compute. If you just look at each node, because

it's symmetric, We can just count those centered around

each individual node. The number of connected triples is easy to

choose, it's count is just c choose two. You've got c neighbors and you choose two

of them then you form a connected triple. It might be a triangle, might not be a

triangle, But still, a number of connected triples

centered around a node is just c over c choose two.

Of course, don't forget you have to divide by three for normalization.

Now, what about the number of triangles centered around a given node?

Let's think about that. Well, first to form a triangle, you need

to go up, okay, one step, then need to go up another step, and then you have to come

back, Otherwise, you don't close it and you

don't have a triangle. So you have to two go steps out and then

be able to come back in one shot. So there, how many such choices there are?

Well, in order to be able to come back in one hop, you can only go out c over two,

that far. Farther than that you won't be able to

come back. And out of those, you can choose two

because you'll be making two steps. So the number of triangles is c over two

choose two and the number of connect triple is c over two.

What is this expression? It is one-half, c over two, times c over

two minus one, divided by one-half, c, c minus one, over three, Which simplifies to

three times c minus two over four times c minus one.

This is the clustering coefficient quantifying triad closure, and closure, on

a regular ring graph. You can see that a number of nodes doesn't

come into play here, because this is entirely symmetric.

So only c matters the deterministic number of the degree here for each node matters.

Well, for sanity check, let's see what happens if c is two.

Well, we know if c is two, we're just talking about a ring.

There is no triangle. There's only connected triple.

Indeed when c is two, this becomes zero as you can see.

15:18

Okay? So triad closure, doesn't exist,

Clustering coefficient big c is zero. What if c is four now,

Like this graph? Actually, the way we draw this, it's hard

to see, intuitively, what should a clustering coefficient be.

There's another way to draw the same graph,

Which is, you can easily verify for yourself like this.

These are the nodes. .

And now, you can clearly see the clustering coefficient should be half,

Because had I also joined these dotted lines, which don't actually exist here, I

would have actually tri closure all the tri closures. But they don't exist,

Only these parts exist. So, intuitively this way I've joined this

same picture, shows that clustering coefficients should be half.

And indeed, if you'll plug in c equals four here, what you'll get is that

clustering coefficient is one over two. Okay.

Now, what if c becomes real big like this small c. Okay. When this little c is like

infinite, then what would happen? Well then, in that case this simplifies us to

three over four. In other words regular ring graph, the

clustering coefficient can only go up to three quarters, but that's already very,

very big. Compared to say ten to the minus three for

our Poisson random graph, this is way bigger.

And, indeed, most likely it would be, at least c, at least four and therefore,

clustering coefficient is between one-half and three-fourths.

This is a small dynamic range, but it is of the right magnitude.

Good. We got c checked off, okay?

We got large c on the regular ring graph, Which is not too surprising, cuz with

enough C, you quickly start closing the triangles.

Now the question is, what about these l's? That's where the problem is with a regular

ring graph to explain small row.. It doesn't have a very small l.

Indeed, if you want to go from this node to that node, you actually have to at

least hop through here and here. But remember, this is only enabled with

eight nodes. When you have many more nodes ends big on

the regular ring, then no matter how many c you add for quite a large range of

degrees, you still need to hop slowly, locally.

You do not have a direct long range link anywhere in this graph.

This graph only got short range links. Hm, okay?

So, with only short range links, you cannot possibly have a small l.