0:08

And again, we're looking at sets of models now that allow us to

Â deal with link dependencies, and allow us

Â to have different types of sub-graphs formed.

Â And we want to capture those features and so we

Â want to see how it might estimate those statistically, okay.

Â So again, just to remind you the form of the model now.

Â The model

Â is that, that links form, triangles form.

Â You could allow the maybe it's links

Â between blue nodes, links between yellow nodes.

Â Links between blues and yellows, triangles involving three different kinds.

Â So, there could be different probabilities for each one of these things.

Â But, each type of different sub graph pops up with some probability.

Â These things might intersect and overlap.

Â We res, we observe a resulting network and

Â then we're trying to infer what these probabilities were.

Â So, are, are triangles really being formed independently of links and so forth.

Â So we are trying to estimate those probabilties.

Â Okay, so there is sort of two different approaches that one might use to do this.

Â One is sparse graphs then these incidental interactiosn are going to be

Â rare and direct estimation turns out to be valid and consistent.

Â There's also in paper with [UNKNOWN] you can

Â go ahead and we also provide an algorithm that

Â corrects for small numbers or gives estimates for non-sparse graphs.

Â There's algorithms that you can use to actually just go through and,

Â and try and estimate directly, how many links there would have been?

Â How many triangles there would have been?

Â And try and figure out how many of those

Â were incidentally or overlapped generated in terms of overlapping.

Â And then you can just do that directly.

Â I'm not

Â going to go through the algorithm for you to

Â the paper but lets just talk about this sparseness.

Â So the idea here is that incidentals

Â are generated by combinations of subgraphs that overlap.

Â Sparsity basically is a condition that I'll refer you

Â to the paper in terms of its full glory.

Â It basically says that, you know,

Â different types of sub-graphs are rare enough

Â that when it, they're not going to interact, so they can't grow too quickly.

Â And it

Â relies on conditions that, that, that just go through and

Â we calculate explicitly combinatorics of

Â different possible things generating each other.

Â And then bound those.

Â And, we end up with a fairly tight bound.

Â But, in terms of a, an intuitive example, if we just go with our links and triangles

Â then, as long as the probability of a link is, is smaller than the square

Â root of n, the probability of triangles is, is smaller than 3

Â 3:01

And in that case, you know its not anywhere

Â near the number of nodes out there even root n.

Â And, and so root n can be a fairly good bound here.

Â So again we're, that limits the number of incidentals.

Â And if we went, went through for instance

Â on this network, and wanted to estimate things.

Â If we wanted to get estimates of what

Â those probabilities were, what we could do is to

Â start, let's try and estimate triangles.

Â Okay, well, we have, if you go through and count triangles, there

Â were actually nine that were originally

Â generated directly and then one incidental.

Â If we didn't see how they were generated, we would

Â estimate that there were actually ten generated on these 42 nodes.

Â And so what we're trying to figure out

Â what's the relative likelihood of, of triangles forming.

Â Well how many possible triangles are there?

Â Well, there's n different nodes, 42 here

Â 42 choosing triples. So you can do 42 choose three.

Â If I did my calculation correctly it's 11,480.

Â And so what's our estimate?

Â Our estimate is going to be the ten that we observed out of the 11,480.

Â And so what we end up with is something about one

Â in 1000 of the triangles that could be there are there.

Â So that gives us an estimate on, on probabilities

Â of triangles. Okay?

Â And in fact the estimate should've been if we

Â were able to actually see what nature really did.

Â It should of been nine instead of ten, okay?

Â So the idea with sparcity, we're not off by far in terms of what our estimate was.

Â 4:32

Estimation of links.

Â What we can do is now look at how many links there are not in triangles.

Â Okay, so we've got different numbers of links

Â that were formed not as part of triangles.

Â We count up those. You can go through and count those.

Â You're going to end up with 23, if I counted correctly.

Â So you end up with 23 not in triangles.

Â And here how many links could there have been?

Â Well how many pairs of

Â nodes are there? Theres n choose 2 861 in this case

Â minus 28 that appeared in the triangles and so we don't count those.

Â So there's 833 possible links that aren't in triangles.

Â We saw 23 of them, so what's the estimate there?

Â 23 over 833, about a 3% probability on any link forming.

Â Okay?

Â So that would be

Â an estimate for this.

Â So we ended up estimating the number of triangles the

Â relative frequency of those things, and relative frequency of links.

Â And that gives us estimates for what those parameters are.

Â And then we can test whether they're different from zero.

Â Are they fall in some range.

Â We can test for homophily in this if we had different

Â colored nodes and so forth, so we can enrich that, okay?

Â So what's a theorem in the paper with a [INAUDIBLE] again?

Â So take a sequence of a sparse SUGMs.

Â And sparsity then has to be defined very carefully

Â in a way that says that that there are

Â certain bounds on the size the relative likelihoods of

Â the true parameters compared to the number of nodes.

Â Then the empirical frequency counts that we do by just looking at

Â whatever statistic here is, You're looking at, figure out how many did you

Â actually observe, how many could there have been, and

Â the theorem is that when you do this, in this

Â very simple way, the, that compared to the true one,

Â the ratio of those two things will go to one.

Â Okay, so we're going to get a consistent estimation

Â 6:28

we're going to zero in on the true parameters.

Â And in particular you can also show that a normalized

Â difference of these things is going to be a zero mean,

Â normal random variable.

Â So in fact, you also know something about the errors,

Â there's a version of a central limit theorem that holds here.

Â So these things are consistent, and they

Â have nice, normal distributions in terms of

Â the differences between your actual estimator and

Â the true parameters, and so then we

Â get a, if you look at what this d matrix looks like, it gives

Â you the rates of convergence, and all the details are there in the paper.

Â But the idea here is these things'll be very easy

Â to estimate.

Â You're basically just doing binomial counts, so

Â you're counting how many links were there?

Â How many could there have been?

Â How many triangles are there? How many could have been?

Â So really simple estimation techniques.

Â And yet it gives us the same kinds of information that

Â we can then back out of an exponential random graph family.

Â And it gives us accurate estimation with trivially easy techniques.

Â Okay?

Â So let's try and see why we might be interested in these things.

Â So, What's the

Â need for these kinds of models?

Â Why don't we just do straight blocks models, why

Â bring triangles into the picture, why bring something else?

Â And what we are going to do is we are going to go

Â back to the Indian village data and what we can do is an

Â estimate a model and then use it to generate networks and then in

Â particular try and see how well do the models that we end up fitting.

Â End up recreating the networks. Okay?

Â So we just saw, there's easy ways to estimate these for the, for, to include

Â things like triangles.

Â Is it really necessary to include the triangles?

Â Suppose instead we worked with a block model.

Â We didn't bring this into the picture. Okay?

Â So what we're going to do is, we're

Â going to estimate the SUGM based on covariates.

Â Allowing for triangles.

Â We can also then compare that to what would

Â have happened if we just did a block model.

Â 8:17

Counting for covariates. Does it do better,

Â you know, how well do they do at recreating the actual networks?

Â And does it, does, does the inclusion of

Â triangles really help us in some substantial way?

Â Okay. Okay so how are we going to do this?

Â We're going to do a very simple version of, of categorizing things.

Â We're going to have the nodes either be the same or different.

Â So we'll allow probabilities of links between two

Â of the si, similar nodes or different nodes.

Â And we'll say that they're, they're in the same class if they have the same caste.

Â 8:49

And if the GPS distance between the homes,

Â is less than the median distance between homes, okay?

Â So going to a village, look at two

Â households, we say okay are they linked or not?

Â what, are they of the same caste?

Â Is the GPS distance greater or less than the median distance?

Â So when

Â we're looking at two households, we're say they're in similar if they have the same

Â caste and less than the median distance

Â between home, and otherwise we'll say they're different.

Â Okay, so if they're either of different castes or

Â greater distance than the median, then we'll put them.

Â So we're just going to make it a really simple model where we either keep

Â track of nodes, and we'll allow for

Â two probabilities, probabilities for nodes being similar and

Â nodes being different.

Â And similar here means they're very similar on

Â both the di, dimensions of caste and GPS location.

Â Okay?

Â So now what we can do is, we can fit a block model.

Â So we can say, what's, we'll allow

Â block model where we have two different probabilities.

Â Probability of a link of both of the same category

Â or similar to each other and probability if they are different.

Â And then we also fit subgraph generation model.

Â We're now what we're going to add in is also

Â triangles, and we'll allow triangles to have two different probabilities.

Â Probabilities of triangles for people that are all similar, and probability of

Â triangles if some of the people involved are different from each other.

Â Okay?

Â So we'll fit the block model, fit this sub-graph generation model.

Â Both of these are very easy to fit here, right?

Â So we can fit,

Â the block model's a special case of a SUGM where we just look at links.

Â So we can just count up lengths, count

Â up triangles, count up whether they're same or different.

Â So we're going to have four different counts and that

Â will gives us estimates on all these things, okay?

Â And the block model just looks, links,

Â ignoring whether they're in triangles or not.

Â This subgraph generation model keeps track of

Â triangles separately from links and estimated that way.

Â Okay, so that's the basic estimation technique.

Â So we estimate these block model. Step one.

Â We're going to estimate this probability of link,

Â probability of link if you're same or different.

Â Sub graph generation model we'll do the same

Â thing but we're going to add in triangle counts.

Â And then once we have these, the nice thing about these

Â kinds of models is then we can generate back networks very easily.

Â So how do we generate a network?

Â Well once we have this probabilities there, we can just take

Â this set of nodes, pick pairs, flip coins, put in links with probability same or

Â different depending on whether they're the same

Â or different, and then generate a, a network.

Â For the SUGM what we can do is randomly pick triangle, randomly pick

Â links and put them in with

Â these probabilities and then we generate networks.

Â Okay?

Â So we randomly generate these networks and then we try

Â to see whether or not these networks recreate the actual, original

Â observations. Okay, so here is what we get.

Â So here's the data.

Â 12:27

The SUGMs don't actually do quite as well on the estimates

Â of the average degree, because here they're also putting in triangles.

Â And so the number of, of links is, is slightly off.

Â But the number of triangles, and the number of unsupported links

Â actually matched much better, so in particular, it's doing better at matching

Â the number of triangles and unsupported

Â links, not surprisingly because its counting

Â those things directly and, and that's the way we're doing the estimation.

Â 12:57

The sub graph generation models are not doing

Â us well on isolates, And so here this last

Â column here is you can also do a

Â subgraph generation model where you will allow for isolates.

Â And if you do that then you end up doing better on isolates.

Â While still doing as well on, on, on links and unsupported links.

Â So, okay. That, so far nothing too surprising.

Â What's interesting then is, lets go down here and

Â look at a series of other features of the

Â graph that we weren't trying to fit directly, okay?

Â So, clustering, how well do we do at clustering?

Â Well the data actually has about a 10% clustering rate.

Â The block model not surprisingly where all links to independent doesn't take that up.

Â And so, one explanation, people often say okay,

Â well you can get clustering on block models,

Â because if people were in the same group,

Â then they're more likely to have a higher density

Â of links. And so you're going to pick that up.

Â And here, you do see that people are much more likely to

Â form links of they're of the same caste, and close in GPS distance.

Â But that still doesn't come close to generating the clustering you need.

Â Whereas once you put in the triangles, then you come back and you hit this.

Â And actually if you pull out the isolates, you hit it right on the money.

Â 14:22

You can, the first eigenvalue of the

Â adjacency matrix, that's how rapidly things expand.

Â again, you know, the, the sub-graph models are

Â going to do better than just the block model.

Â And in particular, once you get the isolates, you're coming pretty close.

Â Second eigenvalue of the stochastic, stochasticized

Â matrix. This is a measure of Mofily.

Â Actually once you throw the isolates in, into the sub-graph generation model, this

Â moves away a little bit, but it's still fairly good in terms of that.

Â 15:15

The sub graph generation model is this, blue line and the

Â block model is the red one here and what you begin to

Â see is even though the sub graph generation model didn't get the,

Â the average rate it actually masked the distribution of degrees much more

Â accurately than the block model did.

Â So it missed the, the first moment, but it was better at actually catching the whole

Â spread, and doing better at matching the tails

Â of this distribution than the block model did.

Â So it actually did better in some

Â dimensions of, of degree than the other model.

Â 16:35

And then you can also the block models with just looking at

Â links only or, or allowing for different types of covariates and so forth.

Â And basically the block models are still going to miss the

Â clustering, no matter whether you

Â are incorporating different covariates, multiple covariates.

Â You know, similarly the eigenvalue,

Â you, you get different pictures or the block models end up not doing as well.

Â When you look at the second eigenvalue, actually these are incredibly

Â segregated villages, so the second eigenvalue, the is, is, is quite high.

Â And in fact, the SUGMs pick up that, that they're quite high,

Â the block models end up having a full spread on these things,

Â even though the actual data is all very close to one in terms of the level.

Â So here's the actual data numbers, they're all down here and

Â the sub-graph generation models are picking that up the block models aren't.

Â So what do we have, we have better fits on these

Â different things, you can look at the fraction of the giant component.

Â Neither of these do that well, but if you throw in

Â the isolates, then you'd actually pick up things a lot better.

Â Okay, so, what do we get?

Â Dependencies are really important to pick up in social networks.

Â Why, well that's the whole nature of social.

Â I mean, what does social mean?

Â It means that we're interacting in more than, than

Â but, well, generally more than two at a time.

Â And that's going to generate dependencies.

Â We need tractable models to capture and test these.

Â Exponential random graph models are a nice form.

Â It's a rich family that has a, a long tradition

Â in, history in statistics, but they're not always accurately estimable.

Â If we take a step back from these and

Â either work in a statistic space or just work

Â with the, generating these things directly, we get easy

Â and consistent estimation, these things are nice to work with.

Â And so there are ways of, of dealing with these things

Â directly, and then we can go ahead and test test things.

Â We'll see some tests of that in lectures

Â coming up in lecture 4.9 and others we'll begin to see how we

Â can test different theories using these kind

Â of models and, and do that explicitly.

Â Okay, so.

Â 18:45

Statistical models gives us a, a medium to,

Â to go through and, and estimating these things.

Â Often we're going to need to put these models in context.

Â So, you know, where do these dependencies come from?

Â They, are they friends of friends?

Â Is there some social enforcement? So on top

Â of these models, we do need some

Â theory, some social theory or some economic theory.

Â To guide our thinking of, you know, we don't

Â want to just randomly put down different types of counts

Â into our exponential random graph models, we want to have

Â some idea of exactly what should we be testing for.

Â And again when we get to lecture 4.9.

Â Then we'll start with some particular theories.

Â We can go ahead and, and test those directly using these kinds of models.

Â Strengths of random networks, we can generate large

Â networks with well identified properties.

Â We can mimic real networks, at least in some characteristics

Â depending on the kinds of models we put down there.

Â So the beauty of these models is now

Â we can tie specific properties to specific processes.

Â So it's either sub-graph generation or could be

Â preferential attachment or sort of meeting friends of friends.

Â These different processes we looked, small worlds.

Â And we get some idea of why these different kinds

Â of things are happening in terms of the process that's there.

Â And some of the weaknesses of these models is we're still missing the why.

Â Why this particular model and not some other one?

Â And there we need to put in some social theory or economic theory behind that.

Â And it also then would allow us to, to say something about relevance.

Â So, you know, are, are we really worried about this?

Â We often we're worried about this things, you know.

Â Why do we worry about segregation because it might cause inequality, or might

Â lead to certain kinds of things So, we have

Â to have some idea of what to welfare is involved.

Â 20:20

One other thing is that a lot of the kind of stylized facts we've talked about.

Â Small worlds, power laws, clustering and so forth.

Â These things are stylized facts but, often by just

Â saying okay, well here's another network that happens to

Â have this And, and the laws are, are basically

Â something that we've seen, in a series of different observations,

Â but we need something that can systematically begin

Â to look across and say, you know, is something

Â really statistically there as it there consistently do

Â we see it across different types of, of observations.

Â Well we can have some theory and, and luckily now we

Â have models that are beginning to address this kind of thing.

Â And allowing us to, to answer these kinds of questions.

Â