0:32

So so, let's look at what we have, we represent a distribution over template

trajectories. So the first thing we want to do when

representing a distribution over continuous time is, in most cases, not

always, is to try and forget the time is actually continuous.

Because continuous quantities are harder to deal with.

So, we're going to discretize time. And specifically,

i is claimed. And specifically, we're going to do that

by picking a particular time granularity delta,

which is the time granularity at which we're going to measure time.

Now, in many cases, this is something that we already is given to us by the

granularity of our sensor. So, in many cases, for example, we have a

video or a robot there is a certain time granularity at which we obtain

measurements and so that's usually the granularity that we'll pick.

But in other cases, we might want to have have a different a different granularity

so there is a choice here. So, here's our time granularity, for

example. And now, we have a set of template random

variables, X of t. And X of t denotes the value of a

particular variable x, variable x being a template variablem x to the t being a

particular instantiation of that variable at the time point t delta so that we have

multiple copies, one for each time point. Now, here's some notation that we're

going to end up using later on. So, let's introduce it.

where X of t denotes the variable t at time t x(t:t) prime denotes the set of

between t and t prime. So, a discrete, in this case, because

we've discretized time. So, a finite set of random variables that

spans these two time points, inclusive. Now, our goal is that we would like to be

able to have a concise representation that allows us to represent this

probability distribution over the trajectory,

over a trajectory of the system of any duration.

So, we want to start at a particular time point.

Usually, this is going to be zero. And then, how what is the probability

distribution over trajectories of arbitrary length?

So, how do we represent what is a, first of all, an infinite family of probability

distributions because you could look at trajectories of duration 2, 5, 10, a

million. So, that's an infinite family of

distributions. And each of these is a distribution over

an unbounded number of random variables. Because if you have a distribution over a

trajectory length a million, you have to represent a million dimensional

distribution. So, how do we compactify that, how do we

make that a much more concise representation?

So, there is different pieces to this. The first of those is what's typically

called the Markov assumption. And the Markov assumption goes, is

effectively a type of conditional independence assumption.

So, it's the same building block that we use to compactify general purpose

graphical models, we're going to use here in the context of time course data.

So here, we're saying that the probability of the variable, X,

I'm sorry, the set of variables expanding the time

from zero all the way to capital T. so, I haven't made any assumptions yet in

this statement so I'm just writing this down.

I'm re-expressing it in terms of the chain rule for probabilities.

There is no chain rule for Bayesian networks yet here.

And this is just a chain rule for probabilities.

And the chain rule for probabilities in this context basically says that, that

probability is equal to the probability of X at time zero times the probability

of each consecutive time point, t + 1, given, so this is the state at t + 1,

given the state at all previous time points, zero up to t.

So, this is not in any way an assumption. This is just a way of re-expressing this

probability distribution in the way that time flows forward.

4:46

But it's not an assumption. You can represent any probability

distribution over these random variables in this way.

But now, we're going to add this assumption.

And this is an assumption. This is an independence assumption.

And this independence assumption tells me that X of t plus one,

that is the state of time t + 1, the next step.

So, this is the next step, is independent of the past,

given the present. So, this is a forgetting assumption.

Once you know the current state, you don't care anymore about your past,

okay? If you do that, we can now go back to

this chain rule over here and simplify it, because whereas before, we

conditioned on x upto zero, some times zero to time t,

now everything upto t - 1 is conditionally independent, given.

So, all of this is conditionally independent of t plus one given X of t,

which means that I've allowed myself to move X of t here as the only thing that

I'm conditioned on, conditioning on, in order to determine the probability of

distribution of X = 1. So, to what extent is this assumption

warranted? So, is this true?

And let's take as an example, X equals the location or pose of a robot or an

object that's moving. Is it the case that the location of the

robot at t plus one. So, L of t plus one plus one is

independent, of say, L of t minus one, to simplify our

lives, given L of t.

Is this a reasonable assumption? Well, in most cases, probably not.

8:38

a probability of X of t + 1, given X of t, but it's still an unbounded number of

conditional probabilities. Now, at least each of them is compact,

but there's still a probabilistic model for every P.

And this is where we're going to end up with a with a template based model.

We're going to stipulate that there is a probabilistic model, P p of x prime given

x. X prime denotes the, next time point and

X denotes the current time point. And we're going to assume that that model

is replicated for every single time point.

That is, when you're moving from time zero to time one, you use this model.

When you're moving time one to time two, you also use this model.

And and that assumption, for obvious reasons, is called time invariance.

Because it assumes that the dynamics of the system, not the actual position of

the robot, but rather the dynamics that move it from

state to state, or the dynamics of the system, don't depend on the current time

point t. And once again, this is an assumption,

and it's an assumption that's warranted in certain cases and not in others.

So, let's imagine that this represents now the traffic on some road.

Well does that traffic, does the, does the dynamics of that traffic depend on

on, say, the current time point of the system?

On most roads, the answer is probably yes.

It might depend on the time of day, on the day of the week,

10:17

on whether there is a big football match, on all sorts of things that might affect

the dynamics of traffic. The point being that just like in the

previous example, we can correct inaccuracies in our assumption by

enriching the model. So, once again, we can enrich the model

by including these variables in it. And once we have that, then the, again,

the model becomes a much better reflection of reality.

So now, how do we represent this probabilistic model in in the context of

a graphical model like we had before? So, let's now assume that our stayed

description is composed of a set of random variables.

And so, we're interested, we have we have a little baby traffic system where we

have the weather at the current time point, the location of, say, a vehicle,

the velocity of the vehicle. We also have a sensor, who's observation

we get at each of those time points. And the sensor may or may not be failing

at the current time point. And what we've done here is we've encoded

the the probabilistic model of this next state.

So, W prime, V prime, L prime, F f prime, and O prime, given the previous states.

So, given W, V, L, and F. Why is O not here on the right-hand side?

It's not here on the right-hand side because it doesn't affect any of the next

state variables. So, it would be kind of hanging down here

if we included it. But that doesn't, it doesn't affect

anything, we don't choose to to represent it.

So, this model represents a conditional distribution.

Now, we have a little network fragment. So, this is a network fragment.

And it doesn't represent a joint distribution,

it represents a conditional distribution. The conditional distribution of the t + 1

given t. But what, but in order to represent that,

we still use the same tools that we have in the context of variance, of graphical

models. And so, we can write that as the same

kind of chain rule that we used before. So, this would be the probability of W

prime, given W, based on this edge over here,

times the probability of V prime, the velocity.

So, this, this says that the weather, the first one says, that the weather at time

t plus one depends on the weather at time t.

The second one that the velocity of time t plus one depends on the weather at time

t and the velocity at time t which indicates a certain persistence in the

velocity as well as the fact that, you know, if there, if it's raining you might

slip sideways so the velocity might change.

Also if you're careful, you might slow down if it's raining.

And so again, there might be an effect of the weather on the velocity.

The probability of the location at time t + one, given the location at time t and

the velocity time t. The probability of a sensor failure at

time t1. + 1, given the failure, and at, at the

previous time and the weather. Which indicates that, once the sensor has

failed, it's probably more likely to stay failed.

But maybe rain can make the sensor behave badly.

And then, finally, the probability of the observation of time t + 1 given the

location of time t1. + 1, and the failure of time t1.

+ 1. So, there's several important things to

note about this diagram that are worth highlighting.

First of all, we have dependencies both within and across time.

So here, we have a dependency that goes from t to t plus one.

And here, we have a dependency that is within t T plus one alone.

What's, what induces us to make a modeling chose like this go one way

versus the other? The assumption here that this is a fairly

wide [UNKNOWN] dependency so that a, the observation is relatively instantaneous

compared to our time granularity. And so, we, we don't want that to go

across time but rather we want it to be within a time slice because it's a better

reflection for which variable is it that actually influences the observation.

Is it the current location or the previous location?

So these kinds of edges, let's just give the names.

These are called intra-time slice edges and these are called inter or between

time slice. And the model can include a combination

of both of these. Another kind of

anothwe, a particular type of inter-time slice edge that's worth highlighting

specifically are edges that go from a variable at one time point to the value

of that variable at the next time point. These are often called persistence edges

because they indicate the the tendency of a variable to persist in

state from one time point to another. Finally, let's just go back and look at

the parameterization that we have in this model. So, what CPDs did we actually need

to include in this model? And we can see that we have CPDs for the

variables on the right-hand side, the prime variables.

But there's no CPDs for the variables that are unprimed,

the variables on the left. And this is because the model doesn't

actually try and represent the distribution, O over W, V, L, and F.

It doesn't try and do that. It tries to represent the probability of

the next time slice, given the previous one.

So, as we can see, this graphical model only has CPD's for a subset of the

variables in it. The ones that represent the next time

point. So, that represents the transition

dynamics. If we want to represent the probability

distribution over an entire system, we also need to provide a distribution over

the initial state. And this is just the standard generic

Bayesian network which represent the probability over the state at times zero

using some appropriate chain rule. So, nothing very fancy here.

16:56

With those two pieces, we can now represent probability distributions over

arbitrarily long trajectories. So, we represent this by taking for time

slice zero and copying the times zero Bayesian network, which represent the

probability distribution over the time zero variables. And now, we have a bunch

of copies that represent the probability distribution at time one, given time

zero. And here, we have another copy of exactly

the same set of parameters that represents time two given time one.

And we can continue copying this indefinitely and each copy gives us the

probability distribution of the next time slice given the one that we just had and

so we can construct arbitrarily along Bayesian network.

So, to make this definition slightly more formal, we define the notion of a

two-time slice Bayesian network, also known as a 2TBN.

And the 2TBN over a set of template variables X1 up to Xn, is specified as a

Bayesian network fragment along exactly the same lines that we used in the

example. The nodes have two copies.

the next time state variable X prime up to Xn prime.

And some subset of X1 up to Xn, which are variables, the time t

variables, that affect directly the state of three

plus one, okay?

18:44

And because we want this to represent a conditional probability distribution,

only the time t + 1 nodes have parents and a CPD.

Because we don't really want to model the distribution over the variables of time

t. And the 2TBN defines a conditional

distribution using the chain rule. You can tell me it looks exactly like the

chain rule. So, the probability of X prime, given X

is the product of each variable and time t plus one, so only the prime variables

can have parents. Which may or be, may be in time P plus

one, time P, or a combination of both. A dynamic Bayesian network is now

basically defined by a 2TBN, which we just defined, and a Bayesian

network over times zero. So, this is the dynamics and this is the

initial state. And we can use that to define arbitrary

probability distributions for sorry,

probability distributions over arbitrarily long trajectories using

what's called the unrolled network or also called the ground network.

And this is exactly in the, as in the example that I showed,

the dependency model for time zero is copied from the Bayes net for time zero.

And this is, the transition is copied from the Base N under the nodes, the

conditional probability for transitions. So, before we conclude this lecture,

let's look at an example of a dynamic Bayesian network that is a more realistic

one than the simple examples that we've shown before.

This is a network that was actually designed for tracking vehicles in a

traffic situation. And so, we can see that there are

multiple variables here that represent both the position and velocity of the

vehicle both in an absolute sense. For example, this is Xdot and Ydot are

the velocities as well as various more semantic notions of location, like

whether you're in the lane. There are contextual variables such as

left clear and right clear. the engine status, for example as well as

what the driver is currently doing, for example the forward action and the

lateral action. We can see that there are persistence

edges that denote the persistence of various forms of the state from time t to

time t plus one, as well as a variety of these

intermediate variables over here that allow us to represent the probability

distribution in a more compact way by incorporating variables that do not

persist, or at least in this simplified model,, do not persist.

And finally, we see that there are a large number of

sensor observations, such as turn signal, whether the car is clear on the right and

on the left or appears to be clear on the right and left, and so, and so on.

So, this is a much more realistic model of how traffic evolves than the

simplified one that we saw before. To summarize, dynamic Bayesian networks

provides us with a language for encoding structured distributions over time.

And by making the assumptions of the Markovian evolution as well as time

invariance, you can use a single compact network to allow us to code arbitrarily

long transitions over arbitrarily long time sequences.