0:08

And again, we're looking at sets of models now that allow us to

deal with link dependencies, and allow us

to have different types of sub-graphs formed.

And we want to capture those features and so we

want to see how it might estimate those statistically, okay.

So again, just to remind you the form of the model now.

The model

is that, that links form, triangles form.

You could allow the maybe it's links

between blue nodes, links between yellow nodes.

Links between blues and yellows, triangles involving three different kinds.

So, there could be different probabilities for each one of these things.

But, each type of different sub graph pops up with some probability.

These things might intersect and overlap.

We res, we observe a resulting network and

then we're trying to infer what these probabilities were.

So, are, are triangles really being formed independently of links and so forth.

So we are trying to estimate those probabilties.

Okay, so there is sort of two different approaches that one might use to do this.

One is sparse graphs then these incidental interactiosn are going to be

rare and direct estimation turns out to be valid and consistent.

There's also in paper with [UNKNOWN] you can

go ahead and we also provide an algorithm that

corrects for small numbers or gives estimates for non-sparse graphs.

There's algorithms that you can use to actually just go through and,

and try and estimate directly, how many links there would have been?

How many triangles there would have been?

And try and figure out how many of those

were incidentally or overlapped generated in terms of overlapping.

And then you can just do that directly.

I'm not

going to go through the algorithm for you to

the paper but lets just talk about this sparseness.

So the idea here is that incidentals

are generated by combinations of subgraphs that overlap.

Sparsity basically is a condition that I'll refer you

to the paper in terms of its full glory.

It basically says that, you know,

different types of sub-graphs are rare enough

that when it, they're not going to interact, so they can't grow too quickly.

And it

relies on conditions that, that, that just go through and

we calculate explicitly combinatorics of

different possible things generating each other.

And then bound those.

And, we end up with a fairly tight bound.

But, in terms of a, an intuitive example, if we just go with our links and triangles

then, as long as the probability of a link is, is smaller than the square

root of n, the probability of triangles is, is smaller than 3

3:01

And in that case, you know its not anywhere

near the number of nodes out there even root n.

And, and so root n can be a fairly good bound here.

So again we're, that limits the number of incidentals.

And if we went, went through for instance

on this network, and wanted to estimate things.

If we wanted to get estimates of what

those probabilities were, what we could do is to

start, let's try and estimate triangles.

Okay, well, we have, if you go through and count triangles, there

were actually nine that were originally

generated directly and then one incidental.

If we didn't see how they were generated, we would

estimate that there were actually ten generated on these 42 nodes.

And so what we're trying to figure out

what's the relative likelihood of, of triangles forming.

Well how many possible triangles are there?

Well, there's n different nodes, 42 here

42 choosing triples. So you can do 42 choose three.

If I did my calculation correctly it's 11,480.

And so what's our estimate?

Our estimate is going to be the ten that we observed out of the 11,480.

And so what we end up with is something about one

in 1000 of the triangles that could be there are there.

So that gives us an estimate on, on probabilities

of triangles. Okay?

And in fact the estimate should've been if we

were able to actually see what nature really did.

It should of been nine instead of ten, okay?

So the idea with sparcity, we're not off by far in terms of what our estimate was.

4:32

Estimation of links.

What we can do is now look at how many links there are not in triangles.

Okay, so we've got different numbers of links

that were formed not as part of triangles.

We count up those. You can go through and count those.

You're going to end up with 23, if I counted correctly.

So you end up with 23 not in triangles.

And here how many links could there have been?

Well how many pairs of

nodes are there? Theres n choose 2 861 in this case

minus 28 that appeared in the triangles and so we don't count those.

So there's 833 possible links that aren't in triangles.

We saw 23 of them, so what's the estimate there?

23 over 833, about a 3% probability on any link forming.

Okay?

So that would be

an estimate for this.

So we ended up estimating the number of triangles the

relative frequency of those things, and relative frequency of links.

And that gives us estimates for what those parameters are.

And then we can test whether they're different from zero.

Are they fall in some range.

We can test for homophily in this if we had different

colored nodes and so forth, so we can enrich that, okay?

So what's a theorem in the paper with a [INAUDIBLE] again?

So take a sequence of a sparse SUGMs.

And sparsity then has to be defined very carefully

in a way that says that that there are

certain bounds on the size the relative likelihoods of

the true parameters compared to the number of nodes.

Then the empirical frequency counts that we do by just looking at

whatever statistic here is, You're looking at, figure out how many did you

actually observe, how many could there have been, and

the theorem is that when you do this, in this

very simple way, the, that compared to the true one,

the ratio of those two things will go to one.

Okay, so we're going to get a consistent estimation

6:28

we're going to zero in on the true parameters.

And in particular you can also show that a normalized

difference of these things is going to be a zero mean,

normal random variable.

So in fact, you also know something about the errors,

there's a version of a central limit theorem that holds here.

So these things are consistent, and they

have nice, normal distributions in terms of

the differences between your actual estimator and

the true parameters, and so then we

get a, if you look at what this d matrix looks like, it gives

you the rates of convergence, and all the details are there in the paper.

But the idea here is these things'll be very easy

to estimate.

You're basically just doing binomial counts, so

you're counting how many links were there?

How many could there have been?

How many triangles are there? How many could have been?

So really simple estimation techniques.

And yet it gives us the same kinds of information that

we can then back out of an exponential random graph family.

And it gives us accurate estimation with trivially easy techniques.

Okay?

So let's try and see why we might be interested in these things.

So, What's the

need for these kinds of models?

Why don't we just do straight blocks models, why

bring triangles into the picture, why bring something else?

And what we are going to do is we are going to go

back to the Indian village data and what we can do is an

estimate a model and then use it to generate networks and then in

particular try and see how well do the models that we end up fitting.

End up recreating the networks. Okay?

So we just saw, there's easy ways to estimate these for the, for, to include

things like triangles.

Is it really necessary to include the triangles?

Suppose instead we worked with a block model.

We didn't bring this into the picture. Okay?

So what we're going to do is, we're

going to estimate the SUGM based on covariates.

Allowing for triangles.

We can also then compare that to what would

have happened if we just did a block model.

8:17

Counting for covariates. Does it do better,

you know, how well do they do at recreating the actual networks?

And does it, does, does the inclusion of

triangles really help us in some substantial way?

Okay. Okay so how are we going to do this?

We're going to do a very simple version of, of categorizing things.

We're going to have the nodes either be the same or different.

So we'll allow probabilities of links between two

of the si, similar nodes or different nodes.

And we'll say that they're, they're in the same class if they have the same caste.

8:49

And if the GPS distance between the homes,

is less than the median distance between homes, okay?

So going to a village, look at two

households, we say okay are they linked or not?

what, are they of the same caste?

Is the GPS distance greater or less than the median distance?

So when

we're looking at two households, we're say they're in similar if they have the same

caste and less than the median distance

between home, and otherwise we'll say they're different.

Okay, so if they're either of different castes or

greater distance than the median, then we'll put them.

So we're just going to make it a really simple model where we either keep

track of nodes, and we'll allow for

two probabilities, probabilities for nodes being similar and

nodes being different.

And similar here means they're very similar on

both the di, dimensions of caste and GPS location.

Okay?

So now what we can do is, we can fit a block model.

So we can say, what's, we'll allow

block model where we have two different probabilities.

Probability of a link of both of the same category

or similar to each other and probability if they are different.

And then we also fit subgraph generation model.

We're now what we're going to add in is also

triangles, and we'll allow triangles to have two different probabilities.

Probabilities of triangles for people that are all similar, and probability of

triangles if some of the people involved are different from each other.

Okay?

So we'll fit the block model, fit this sub-graph generation model.

Both of these are very easy to fit here, right?

So we can fit,

the block model's a special case of a SUGM where we just look at links.

So we can just count up lengths, count

up triangles, count up whether they're same or different.

So we're going to have four different counts and that

will gives us estimates on all these things, okay?

And the block model just looks, links,

ignoring whether they're in triangles or not.

This subgraph generation model keeps track of

triangles separately from links and estimated that way.

Okay, so that's the basic estimation technique.

So we estimate these block model. Step one.

We're going to estimate this probability of link,

probability of link if you're same or different.

Sub graph generation model we'll do the same

thing but we're going to add in triangle counts.

And then once we have these, the nice thing about these

kinds of models is then we can generate back networks very easily.

So how do we generate a network?

Well once we have this probabilities there, we can just take

this set of nodes, pick pairs, flip coins, put in links with probability same or

different depending on whether they're the same

or different, and then generate a, a network.

For the SUGM what we can do is randomly pick triangle, randomly pick

links and put them in with

these probabilities and then we generate networks.

Okay?

So we randomly generate these networks and then we try

to see whether or not these networks recreate the actual, original

observations. Okay, so here is what we get.

So here's the data.

12:27

The SUGMs don't actually do quite as well on the estimates

of the average degree, because here they're also putting in triangles.

And so the number of, of links is, is slightly off.

But the number of triangles, and the number of unsupported links

actually matched much better, so in particular, it's doing better at matching

the number of triangles and unsupported

links, not surprisingly because its counting

those things directly and, and that's the way we're doing the estimation.

12:57

The sub graph generation models are not doing

us well on isolates, And so here this last

column here is you can also do a

subgraph generation model where you will allow for isolates.

And if you do that then you end up doing better on isolates.

While still doing as well on, on, on links and unsupported links.

So, okay. That, so far nothing too surprising.

What's interesting then is, lets go down here and

look at a series of other features of the

graph that we weren't trying to fit directly, okay?

So, clustering, how well do we do at clustering?

Well the data actually has about a 10% clustering rate.

The block model not surprisingly where all links to independent doesn't take that up.

And so, one explanation, people often say okay,

well you can get clustering on block models,

because if people were in the same group,

then they're more likely to have a higher density

of links. And so you're going to pick that up.

And here, you do see that people are much more likely to

form links of they're of the same caste, and close in GPS distance.

But that still doesn't come close to generating the clustering you need.

Whereas once you put in the triangles, then you come back and you hit this.

And actually if you pull out the isolates, you hit it right on the money.

14:22

You can, the first eigenvalue of the

adjacency matrix, that's how rapidly things expand.

again, you know, the, the sub-graph models are

going to do better than just the block model.

And in particular, once you get the isolates, you're coming pretty close.

Second eigenvalue of the stochastic, stochasticized

matrix. This is a measure of Mofily.

Actually once you throw the isolates in, into the sub-graph generation model, this

moves away a little bit, but it's still fairly good in terms of that.

15:15

The sub graph generation model is this, blue line and the

block model is the red one here and what you begin to

see is even though the sub graph generation model didn't get the,

the average rate it actually masked the distribution of degrees much more

accurately than the block model did.

So it missed the, the first moment, but it was better at actually catching the whole

spread, and doing better at matching the tails

of this distribution than the block model did.

So it actually did better in some

dimensions of, of degree than the other model.

16:35

And then you can also the block models with just looking at

links only or, or allowing for different types of covariates and so forth.

And basically the block models are still going to miss the

clustering, no matter whether you

are incorporating different covariates, multiple covariates.

You know, similarly the eigenvalue,

you, you get different pictures or the block models end up not doing as well.

When you look at the second eigenvalue, actually these are incredibly

segregated villages, so the second eigenvalue, the is, is, is quite high.

And in fact, the SUGMs pick up that, that they're quite high,

the block models end up having a full spread on these things,

even though the actual data is all very close to one in terms of the level.

So here's the actual data numbers, they're all down here and

the sub-graph generation models are picking that up the block models aren't.

So what do we have, we have better fits on these

different things, you can look at the fraction of the giant component.

Neither of these do that well, but if you throw in

the isolates, then you'd actually pick up things a lot better.

Okay, so, what do we get?

Dependencies are really important to pick up in social networks.

Why, well that's the whole nature of social.

I mean, what does social mean?

It means that we're interacting in more than, than

but, well, generally more than two at a time.

And that's going to generate dependencies.

We need tractable models to capture and test these.

Exponential random graph models are a nice form.

It's a rich family that has a, a long tradition

in, history in statistics, but they're not always accurately estimable.

If we take a step back from these and

either work in a statistic space or just work

with the, generating these things directly, we get easy

and consistent estimation, these things are nice to work with.

And so there are ways of, of dealing with these things

directly, and then we can go ahead and test test things.

We'll see some tests of that in lectures

coming up in lecture 4.9 and others we'll begin to see how we

can test different theories using these kind

of models and, and do that explicitly.

Okay, so.

18:45

Statistical models gives us a, a medium to,

to go through and, and estimating these things.

Often we're going to need to put these models in context.

So, you know, where do these dependencies come from?

They, are they friends of friends?

Is there some social enforcement? So on top

of these models, we do need some

theory, some social theory or some economic theory.

To guide our thinking of, you know, we don't

want to just randomly put down different types of counts

into our exponential random graph models, we want to have

some idea of exactly what should we be testing for.

And again when we get to lecture 4.9.

Then we'll start with some particular theories.

We can go ahead and, and test those directly using these kinds of models.

Strengths of random networks, we can generate large

networks with well identified properties.

We can mimic real networks, at least in some characteristics

depending on the kinds of models we put down there.

So the beauty of these models is now

we can tie specific properties to specific processes.

So it's either sub-graph generation or could be

preferential attachment or sort of meeting friends of friends.

These different processes we looked, small worlds.

And we get some idea of why these different kinds

of things are happening in terms of the process that's there.

And some of the weaknesses of these models is we're still missing the why.

Why this particular model and not some other one?

And there we need to put in some social theory or economic theory behind that.

And it also then would allow us to, to say something about relevance.

So, you know, are, are we really worried about this?

We often we're worried about this things, you know.

Why do we worry about segregation because it might cause inequality, or might

lead to certain kinds of things So, we have

to have some idea of what to welfare is involved.

20:20

One other thing is that a lot of the kind of stylized facts we've talked about.

Small worlds, power laws, clustering and so forth.

These things are stylized facts but, often by just

saying okay, well here's another network that happens to

have this And, and the laws are, are basically

something that we've seen, in a series of different observations,

but we need something that can systematically begin

to look across and say, you know, is something

really statistically there as it there consistently do

we see it across different types of, of observations.

Well we can have some theory and, and luckily now we

have models that are beginning to address this kind of thing.

And allowing us to, to answer these kinds of questions.