0:23

And multiplied by some parameters.

Â And in order to make this a probability,

Â then we have to sum across what this would look like for all possible graphs.

Â The power of such models,

Â we can put pretty much anything you want into those statistics.

Â So if you wanted to keep track of links,

Â triangles, you could make these based on covariates.

Â So as we talked about in the block models, we could have links among blue nodes.

Â And links among green nodes, links between blue and green nodes.

Â Triangle involving blue, greens and yellows etc.

Â So you could have all kinds of different statistics, keep track of those.

Â And the idea, then, is that using this kind of model would

Â allow us to test to see whether certain things are going to be significant or not.

Â So are the parameters non-zero, are they positive, are they negative,

Â what do they look like?

Â That gives us some idea of how these different things matter in

Â the formation of a graph.

Â And what's the difficulty, the challenge,

Â in these things is how are we going to estimate these?

Â And, so

Â let me just start with just an example of an estimation of one of these things.

Â So if you remember our Florentine families example.

Â So there's a paper by Robins, Pattison, Kalish, and Lusher in 2007.

Â They're fitting an exponential random graph model to Padgett and

Â Ansel's Florentine data.

Â An in particular,

Â they're going to look at the business ties between the major 16 families.

Â And what they're going to do is try and see whether or not,

Â to what extent this depends on links.

Â Two stars, so two individuals connected to a given individual, so we've got links,

Â two stars, three stars, and triangles, okay?

Â So these are the different things, we can count up

Â how many of each one of these different shapes is present in a network, and

Â see to what extent each one matters in the determination of a graph.

Â So we've got the probability of a graph depending on each one of these things.

Â And then we can ask, what do these different parameters look like?

Â So what do the parameters look like on these?

Â 2:24

Okay, so this is the network, in this case it's a very small set of nodes,

Â small set of links.

Â So we see a few triangles, we see some links,

Â we see some two-stars, three-stars, and so forth.

Â And we can go ahead and estimate this.

Â And here, you can use programs, we'll do one of these in a problem set.

Â You can use statnet, for instance, which is a program package in R.

Â And you stick in your dataset and

Â end up just directly trying to fit an exponential random graph model.

Â It asks which kinds of things you want to throw in.

Â And then it will spit out an answer.

Â So in this case, the software will go ahead and estimate things,

Â and we'll talk about how it's done in a moment.

Â And what it does gives us?

Â It gives us, on links, we get a parameter of -4.27, standard error for about 1.13.

Â So, this is statistically significantly negative.

Â So that's a network that doesn't like links.

Â So if we go back to this, it's a fairly sparse network, so networks that have

Â fewer links are going to be more probable under this model,

Â with the negative parameter there.

Â Networks that have lots of links would be less probable.

Â So this is a fairly sparse network, so

Â it's not surprising that it gets a negative parameter here.

Â It likes two-stars, 1.09.

Â It dislikes three-stars -0.67, and it likes triangles 1.32, okay?

Â So we end up with some different parameters.

Â Now exactly how we interpret those is a little complicated.

Â It's complicate by two facts, one,

Â we're not quite sure the combinations of these different things.

Â So links are going to be inside two-stars, inside three-stars, and so forth.

Â We also haven't given us any theory of exactly

Â why the business dealing should have one type of structure or another.

Â So going at this with a social theory would then allow us to make sense of

Â these and you can go to the original article for more discussion of that.

Â But what I want to do is talk a little bit about how these kinds of things

Â are estimated and then what we should take away from those estimates.

Â So what's the challenge in estimating an exponential random graph model?

Â 4:30

So we put down our formulation, we've got a bunch of statistics, we want to see if

Â they matter and to what extent they matter positively or negatively.

Â And in order to make this a possibility,

Â we had to sum over all the possible graphs of what this can be, okay?

Â So the idea is, how likely graph g is depends on comparing it to how likely all

Â the other graphs would be in order to figure out why this one got picked out and

Â not any other one, okay?

Â So in order to go ahead and estimate this to figure out, for instance,

Â if we want to do what's known as maximum likelihood estimation,

Â what we would want to do is let's pick the parameters that actually make the given

Â graph that we saw as probable as possible, okay?

Â So which parameters would make this probability as high as possible, okay?

Â Well, the difficulty is, in order to figure out which parameters maximize this

Â function, we have to be able to do this sum in the denominator, right?

Â And that's going to be tricky.

Â Why is it going to be tricky?

Â There's a lot of ground graphs to sum over.

Â And so, a techniques of instead of actually doing an explicit calculation

Â here, would be to try and estimate this instead.

Â And we'll estimate this, the standard the technique, is using MCMC,

Â 5:47

Markov chain Monte Carlo techniques for estimation.

Â And this is because we can popularize by Tom Snijders, and

Â Mark Handcock, who made some breakthroughs that allowed this to be applied to

Â exponential random graph models, and that's what underlies the software.

Â So instead of summing over all g primes, what we'll do is we'll sample some g

Â primes and then try and extrapolate and estimate that denominator based on that.

Â The program's a little more complicated, but

Â essentially what it's doing is walking the space and trying to

Â pick out different g primes and then based on that it can get a relative likelihood.

Â And then figure out, by searching across different parameters, which parameters

Â meet the highest likelihood of seeing the data we actually saw.

Â So what's the difficulty of this?

Â Well, remember, if we had just 30 nodes in our graph,

Â there's 2 to the 435th possible networks, which is more than,

Â it's estimated that there's less than 2 to 258th atoms in the universe.

Â So we're way above the number of atoms in the universe.

Â So it's going to be impossible for us to actually look at that space.

Â And even pick a small subset of that space.

Â So if we're just randomly picking some g primes in the denominator,

Â the chance that we're actually picking a representative sample is tiny.

Â Because unless we have an algorithm that really walks the space very well,

Â given how many different networks there are, it's going to cause difficulty, okay?

Â And in particular, there is a recent theorem by Bhamidi, Bresler and Sly.

Â There's related theorems elsewhere in this statistics literature.

Â And what it shows is, if you've got a dense enough

Â exponential random graph model, this type of MCMC technique, and in particular

Â using what is known as Glauber dynamics, which is a form of Gibbs sampling.

Â The estimates are actually going to become efficient

Â 7:41

only if you're able to look at an exponential number of graphs.

Â Unless you have approximately independent links, okay?

Â So that's a mouth full, but what it's saying is if you want to be able to do

Â this with less than exponential number of graphs,

Â the network's going to have to be one that has almost independent links.

Â And of course the reason we went to exponential random graph models is because

Â we didn't want to have independent links.

Â We wanted to allow for triangle, we wanted to allow for

Â different kinds of stars to be more likely than one would see at random.

Â And so, exponential random graph models that are really interesting

Â are going to ones where we do have dependencies.

Â And those are going to be ones which, according to this theorem,

Â are not going to be estimable in less than exponential time, okay?

Â Now this applies for dense enough graphs but we can also do

Â some simulations to see whether there's going to be problems with sparser ones.

Â Let's turn to assimilation next.

Â So this is actually simulation that comes out of work I have been doing with and

Â here what we're doing is looking at the probability of a graph.

Â And let's keep things simple, we're going to look at links,

Â triangles and isolates.

Â And part of the reason we're going to look at isolates, is when you look at graphs

Â you often find more isolated nodes than you would find at random.

Â And so in order to really estimate things well this is something you often have to

Â take into account.

Â 9:07

So let's look at 50 nodes.

Â And we're going to fit a very simple model.

Â This, an exponential random graph model, where we keep track of how many isolated

Â nodes there are, how many links there are and how many triangles, okay?

Â So a very simple model.

Â And what we're going to do is we're actually going to draw a bunch of graphs.

Â So, we'll give it graphs that have basically, we're going to

Â feed into the program a bunch of networks that have roughly 20 isolated nodes.

Â 9:34

So, each node is isolated with a probability of about 0.4.

Â We'll feed 10 triangles in, so we'll put 10 triangles down at random,

Â 45 links down at random.

Â So, we're going to draw links so that on an average you get 45 links.

Â We'll draw triangles so, on average, you get 10 triangles.

Â And we drew these by picking binomials, so we'll just do this at random.

Â And then 20 isolated nodes, okay?

Â So we'll start by picking 20 isolated nodes,

Â on that remains we'll put down a series of triangles and

Â then what remains we'll put down a series of links, okay?

Â We do this 1,000 times, so

Â we're drawing very similar graphs to each other, a 1,000 different graphs.

Â And each time we'll feed it in into the program and

Â then ask it to estimate this function, okay?

Â 10:18

So we'll give it just a bunch of different graphs that all have roughly 20 isolated

Â nodes, 10 triangles, 45 links and see what comes back.

Â So here's the link parameter estimates.

Â So it gives us estimates, often, in the range somewhere between 0 and 4.

Â But actually it sometimes even gives in the negative 3 range, so

Â it's popping all over in terms of the link parameter.

Â And when you look at the estimates that it's giving back to us, in terms

Â of what our standard error are, it thinks that it's picking things very accurately.

Â So part of the way that it picks standard errors is by looking back at

Â the graphs that it wandered through and trying to figure out what

Â the variation in possible parameters could be,

Â given the variation that you could see across different possible graphs.

Â And the noise that would be present if the parameters were

Â in the vicinity that you're looking at.

Â Okay, so

Â it's using a bootstrap method to basically estimate these confidence intervals.

Â And when we look at this, it thinks that it's very accurate.

Â So the median left and right confidence intervals are right on top of each other.

Â So it would tell you that it's very sure that the thing should be in a certain

Â range even though it's popping all over the place.

Â If we do the isolates parameter,

Â the isolates parameter it has much less variation in it.

Â So it's always coming out somewhere in the range between 15 and 19, basically.

Â 11:52

Again, it overestimates how accurate it's being, because if the confidence intervals

Â were accurate, they should be capturing the variation we're getting here.

Â And finally, then, when we look at the triangle parameter,

Â the triangle parameter ranges somewhere between 0 and about 5 and a half.

Â And again, we see sort of sometimes it's very close to 0,

Â sometimes it's close to 5 and a half, so it's popping all over the place.

Â So it's giving us unstable parameters, estimates and

Â it's coming up with inaccurate standard error estimates.

Â And one thing we can do is then recreate the network.

Â So take the parameters that we get out, so the exponential random graph

Â model gives us a set of parameters, so each time we estimated.

Â So we feed into the network with 20 isolates,

Â 10 triangles and 45 links, it spits out some parameters.

Â And then you can actually allow the program to then randomly draw a network

Â from using those parameters.

Â And, hopefully, if you randomly draw a network with those parameters, you ought

Â to get one back that has fairly similar characteristics to what you started with.

Â So when we look, the isolates parameter was pretty tight, it should, hopefully,

Â give us back something with the right number of isolates.

Â 13:17

But the link parameter was all over the place, the triangle parameter was all over

Â the place, maybe it's not going to be so good at doing that.

Â Well, indeed, if you recreate the number of isolates, most of the time it

Â gives you something in the range of around 20 isolates, which is good.

Â So it's recreating the isolates fairly well.

Â Occasionally, it gets stuck near 50, so it gives you an empty graph back.

Â So isolates are doing okay.

Â When it does triangles, it does miserably.

Â So remember, we fed it graphs that pretty much have, on average, 10 triangles.

Â Somewhere between 5 to 15 most of the time, so very few triangles.

Â And when it recreates them occasionally, it gets 10 triangles but

Â most of the time it's somewhere between 1,000 and 3,500.

Â So it's packing these networks full of triangles, even though it shouldn't be.

Â When we do links, something similar is happening.

Â So the actual number of links it should be generating is about 45, and yet

Â it either gets very close to 0, or somewhere between 150 and 500.

Â So it's having problems estimating these parameters.

Â And then when you try and recreate and draw random networks from it, the networks

Â that it's giving back to us look nothing like the ones we fed in, okay.

Â And that shouldn't be surprising given the theorem that we had by Bomidi and

Â et al that said there are going to be difficulties

Â in actually wandering through and estimating accurately what that

Â denominator is in the exponential random graph model.

Â So there's some fundamental difficulties here.

Â So MCMC techniques can be inaccurate here.

Â And so we're facing this problem of how we compute parameters.

Â And there's also question about even if we could do the computations,

Â are we confident that for all reasonable class of models,

Â that if we had large numbers of nodes, that these parameters would be accurate?

Â So there's a computational issue and then there's also just an open question of

Â how consistent are these estimators of these.

Â 15:18

And it's also important for us to be able to generate networks back.

Â So if we want to do questions of counterfactuals, so what would happen if

Â we had a program to encourage certain kinds of ties to form?

Â So we want to subsidize research and development,

Â we put in subsidies to try and create research parks.

Â That's going to create certain kinds of ties in a research and

Â development network.

Â