0:23

And multiplied by some parameters.

And in order to make this a probability,

then we have to sum across what this would look like for all possible graphs.

The power of such models,

we can put pretty much anything you want into those statistics.

So if you wanted to keep track of links,

triangles, you could make these based on covariates.

So as we talked about in the block models, we could have links among blue nodes.

And links among green nodes, links between blue and green nodes.

Triangle involving blue, greens and yellows etc.

So you could have all kinds of different statistics, keep track of those.

And the idea, then, is that using this kind of model would

allow us to test to see whether certain things are going to be significant or not.

So are the parameters non-zero, are they positive, are they negative,

what do they look like?

That gives us some idea of how these different things matter in

the formation of a graph.

And what's the difficulty, the challenge,

in these things is how are we going to estimate these?

And, so

let me just start with just an example of an estimation of one of these things.

So if you remember our Florentine families example.

So there's a paper by Robins, Pattison, Kalish, and Lusher in 2007.

They're fitting an exponential random graph model to Padgett and

Ansel's Florentine data.

An in particular,

they're going to look at the business ties between the major 16 families.

And what they're going to do is try and see whether or not,

to what extent this depends on links.

Two stars, so two individuals connected to a given individual, so we've got links,

two stars, three stars, and triangles, okay?

So these are the different things, we can count up

how many of each one of these different shapes is present in a network, and

see to what extent each one matters in the determination of a graph.

So we've got the probability of a graph depending on each one of these things.

And then we can ask, what do these different parameters look like?

So what do the parameters look like on these?

2:24

Okay, so this is the network, in this case it's a very small set of nodes,

small set of links.

So we see a few triangles, we see some links,

we see some two-stars, three-stars, and so forth.

And we can go ahead and estimate this.

And here, you can use programs, we'll do one of these in a problem set.

You can use statnet, for instance, which is a program package in R.

And you stick in your dataset and

end up just directly trying to fit an exponential random graph model.

It asks which kinds of things you want to throw in.

And then it will spit out an answer.

So in this case, the software will go ahead and estimate things,

and we'll talk about how it's done in a moment.

And what it does gives us?

It gives us, on links, we get a parameter of -4.27, standard error for about 1.13.

So, this is statistically significantly negative.

So that's a network that doesn't like links.

So if we go back to this, it's a fairly sparse network, so networks that have

fewer links are going to be more probable under this model,

with the negative parameter there.

Networks that have lots of links would be less probable.

So this is a fairly sparse network, so

it's not surprising that it gets a negative parameter here.

It likes two-stars, 1.09.

It dislikes three-stars -0.67, and it likes triangles 1.32, okay?

So we end up with some different parameters.

Now exactly how we interpret those is a little complicated.

It's complicate by two facts, one,

we're not quite sure the combinations of these different things.

So links are going to be inside two-stars, inside three-stars, and so forth.

We also haven't given us any theory of exactly

why the business dealing should have one type of structure or another.

So going at this with a social theory would then allow us to make sense of

these and you can go to the original article for more discussion of that.

But what I want to do is talk a little bit about how these kinds of things

are estimated and then what we should take away from those estimates.

So what's the challenge in estimating an exponential random graph model?

4:30

So we put down our formulation, we've got a bunch of statistics, we want to see if

they matter and to what extent they matter positively or negatively.

And in order to make this a possibility,

we had to sum over all the possible graphs of what this can be, okay?

So the idea is, how likely graph g is depends on comparing it to how likely all

the other graphs would be in order to figure out why this one got picked out and

not any other one, okay?

So in order to go ahead and estimate this to figure out, for instance,

if we want to do what's known as maximum likelihood estimation,

what we would want to do is let's pick the parameters that actually make the given

graph that we saw as probable as possible, okay?

So which parameters would make this probability as high as possible, okay?

Well, the difficulty is, in order to figure out which parameters maximize this

function, we have to be able to do this sum in the denominator, right?

And that's going to be tricky.

Why is it going to be tricky?

There's a lot of ground graphs to sum over.

And so, a techniques of instead of actually doing an explicit calculation

here, would be to try and estimate this instead.

And we'll estimate this, the standard the technique, is using MCMC,

5:47

Markov chain Monte Carlo techniques for estimation.

And this is because we can popularize by Tom Snijders, and

Mark Handcock, who made some breakthroughs that allowed this to be applied to

exponential random graph models, and that's what underlies the software.

So instead of summing over all g primes, what we'll do is we'll sample some g

primes and then try and extrapolate and estimate that denominator based on that.

The program's a little more complicated, but

essentially what it's doing is walking the space and trying to

pick out different g primes and then based on that it can get a relative likelihood.

And then figure out, by searching across different parameters, which parameters

meet the highest likelihood of seeing the data we actually saw.

So what's the difficulty of this?

Well, remember, if we had just 30 nodes in our graph,

there's 2 to the 435th possible networks, which is more than,

it's estimated that there's less than 2 to 258th atoms in the universe.

So we're way above the number of atoms in the universe.

So it's going to be impossible for us to actually look at that space.

And even pick a small subset of that space.

So if we're just randomly picking some g primes in the denominator,

the chance that we're actually picking a representative sample is tiny.

Because unless we have an algorithm that really walks the space very well,

given how many different networks there are, it's going to cause difficulty, okay?

And in particular, there is a recent theorem by Bhamidi, Bresler and Sly.

There's related theorems elsewhere in this statistics literature.

And what it shows is, if you've got a dense enough

exponential random graph model, this type of MCMC technique, and in particular

using what is known as Glauber dynamics, which is a form of Gibbs sampling.

The estimates are actually going to become efficient

7:41

only if you're able to look at an exponential number of graphs.

Unless you have approximately independent links, okay?

So that's a mouth full, but what it's saying is if you want to be able to do

this with less than exponential number of graphs,

the network's going to have to be one that has almost independent links.

And of course the reason we went to exponential random graph models is because

we didn't want to have independent links.

We wanted to allow for triangle, we wanted to allow for

different kinds of stars to be more likely than one would see at random.

And so, exponential random graph models that are really interesting

are going to ones where we do have dependencies.

And those are going to be ones which, according to this theorem,

are not going to be estimable in less than exponential time, okay?

Now this applies for dense enough graphs but we can also do

some simulations to see whether there's going to be problems with sparser ones.

Let's turn to assimilation next.

So this is actually simulation that comes out of work I have been doing with and

here what we're doing is looking at the probability of a graph.

And let's keep things simple, we're going to look at links,

triangles and isolates.

And part of the reason we're going to look at isolates, is when you look at graphs

you often find more isolated nodes than you would find at random.

And so in order to really estimate things well this is something you often have to

take into account.

9:07

So let's look at 50 nodes.

And we're going to fit a very simple model.

This, an exponential random graph model, where we keep track of how many isolated

nodes there are, how many links there are and how many triangles, okay?

So a very simple model.

And what we're going to do is we're actually going to draw a bunch of graphs.

So, we'll give it graphs that have basically, we're going to

feed into the program a bunch of networks that have roughly 20 isolated nodes.

9:34

So, each node is isolated with a probability of about 0.4.

We'll feed 10 triangles in, so we'll put 10 triangles down at random,

45 links down at random.

So, we're going to draw links so that on an average you get 45 links.

We'll draw triangles so, on average, you get 10 triangles.

And we drew these by picking binomials, so we'll just do this at random.

And then 20 isolated nodes, okay?

So we'll start by picking 20 isolated nodes,

on that remains we'll put down a series of triangles and

then what remains we'll put down a series of links, okay?

We do this 1,000 times, so

we're drawing very similar graphs to each other, a 1,000 different graphs.

And each time we'll feed it in into the program and

then ask it to estimate this function, okay?

10:18

So we'll give it just a bunch of different graphs that all have roughly 20 isolated

nodes, 10 triangles, 45 links and see what comes back.

So here's the link parameter estimates.

So it gives us estimates, often, in the range somewhere between 0 and 4.

But actually it sometimes even gives in the negative 3 range, so

it's popping all over in terms of the link parameter.

And when you look at the estimates that it's giving back to us, in terms

of what our standard error are, it thinks that it's picking things very accurately.

So part of the way that it picks standard errors is by looking back at

the graphs that it wandered through and trying to figure out what

the variation in possible parameters could be,

given the variation that you could see across different possible graphs.

And the noise that would be present if the parameters were

in the vicinity that you're looking at.

Okay, so

it's using a bootstrap method to basically estimate these confidence intervals.

And when we look at this, it thinks that it's very accurate.

So the median left and right confidence intervals are right on top of each other.

So it would tell you that it's very sure that the thing should be in a certain

range even though it's popping all over the place.

If we do the isolates parameter,

the isolates parameter it has much less variation in it.

So it's always coming out somewhere in the range between 15 and 19, basically.

11:52

Again, it overestimates how accurate it's being, because if the confidence intervals

were accurate, they should be capturing the variation we're getting here.

And finally, then, when we look at the triangle parameter,

the triangle parameter ranges somewhere between 0 and about 5 and a half.

And again, we see sort of sometimes it's very close to 0,

sometimes it's close to 5 and a half, so it's popping all over the place.

So it's giving us unstable parameters, estimates and

it's coming up with inaccurate standard error estimates.

And one thing we can do is then recreate the network.

So take the parameters that we get out, so the exponential random graph

model gives us a set of parameters, so each time we estimated.

So we feed into the network with 20 isolates,

10 triangles and 45 links, it spits out some parameters.

And then you can actually allow the program to then randomly draw a network

from using those parameters.

And, hopefully, if you randomly draw a network with those parameters, you ought

to get one back that has fairly similar characteristics to what you started with.

So when we look, the isolates parameter was pretty tight, it should, hopefully,

give us back something with the right number of isolates.

13:17

But the link parameter was all over the place, the triangle parameter was all over

the place, maybe it's not going to be so good at doing that.

Well, indeed, if you recreate the number of isolates, most of the time it

gives you something in the range of around 20 isolates, which is good.

So it's recreating the isolates fairly well.

Occasionally, it gets stuck near 50, so it gives you an empty graph back.

So isolates are doing okay.

When it does triangles, it does miserably.

So remember, we fed it graphs that pretty much have, on average, 10 triangles.

Somewhere between 5 to 15 most of the time, so very few triangles.

And when it recreates them occasionally, it gets 10 triangles but

most of the time it's somewhere between 1,000 and 3,500.

So it's packing these networks full of triangles, even though it shouldn't be.

When we do links, something similar is happening.

So the actual number of links it should be generating is about 45, and yet

it either gets very close to 0, or somewhere between 150 and 500.

So it's having problems estimating these parameters.

And then when you try and recreate and draw random networks from it, the networks

that it's giving back to us look nothing like the ones we fed in, okay.

And that shouldn't be surprising given the theorem that we had by Bomidi and

et al that said there are going to be difficulties

in actually wandering through and estimating accurately what that

denominator is in the exponential random graph model.

So there's some fundamental difficulties here.

So MCMC techniques can be inaccurate here.

And so we're facing this problem of how we compute parameters.

And there's also question about even if we could do the computations,

are we confident that for all reasonable class of models,

that if we had large numbers of nodes, that these parameters would be accurate?

So there's a computational issue and then there's also just an open question of

how consistent are these estimators of these.

15:18

And it's also important for us to be able to generate networks back.

So if we want to do questions of counterfactuals, so what would happen if

we had a program to encourage certain kinds of ties to form?

So we want to subsidize research and development,

we put in subsidies to try and create research parks.

That's going to create certain kinds of ties in a research and

development network.