0:32
So so, let's look at what we have, we represent a distribution over template
trajectories. So the first thing we want to do when
representing a distribution over continuous time is, in most cases, not
always, is to try and forget the time is actually continuous.
Because continuous quantities are harder to deal with.
So, we're going to discretize time. And specifically,
i is claimed. And specifically, we're going to do that
by picking a particular time granularity delta,
which is the time granularity at which we're going to measure time.
Now, in many cases, this is something that we already is given to us by the
granularity of our sensor. So, in many cases, for example, we have a
video or a robot there is a certain time granularity at which we obtain
measurements and so that's usually the granularity that we'll pick.
But in other cases, we might want to have have a different a different granularity
so there is a choice here. So, here's our time granularity, for
example. And now, we have a set of template random
variables, X of t. And X of t denotes the value of a
particular variable x, variable x being a template variablem x to the t being a
particular instantiation of that variable at the time point t delta so that we have
multiple copies, one for each time point. Now, here's some notation that we're
going to end up using later on. So, let's introduce it.
where X of t denotes the variable t at time t x(t:t) prime denotes the set of
between t and t prime. So, a discrete, in this case, because
we've discretized time. So, a finite set of random variables that
spans these two time points, inclusive. Now, our goal is that we would like to be
able to have a concise representation that allows us to represent this
probability distribution over the trajectory,
over a trajectory of the system of any duration.
So, we want to start at a particular time point.
Usually, this is going to be zero. And then, how what is the probability
distribution over trajectories of arbitrary length?
So, how do we represent what is a, first of all, an infinite family of probability
distributions because you could look at trajectories of duration 2, 5, 10, a
million. So, that's an infinite family of
distributions. And each of these is a distribution over
an unbounded number of random variables. Because if you have a distribution over a
trajectory length a million, you have to represent a million dimensional
distribution. So, how do we compactify that, how do we
make that a much more concise representation?
So, there is different pieces to this. The first of those is what's typically
called the Markov assumption. And the Markov assumption goes, is
effectively a type of conditional independence assumption.
So, it's the same building block that we use to compactify general purpose
graphical models, we're going to use here in the context of time course data.
So here, we're saying that the probability of the variable, X,
I'm sorry, the set of variables expanding the time
from zero all the way to capital T. so, I haven't made any assumptions yet in
this statement so I'm just writing this down.
I'm re-expressing it in terms of the chain rule for probabilities.
There is no chain rule for Bayesian networks yet here.
And this is just a chain rule for probabilities.
And the chain rule for probabilities in this context basically says that, that
probability is equal to the probability of X at time zero times the probability
of each consecutive time point, t + 1, given, so this is the state at t + 1,
given the state at all previous time points, zero up to t.
So, this is not in any way an assumption. This is just a way of re-expressing this
probability distribution in the way that time flows forward.
4:46
But it's not an assumption. You can represent any probability
distribution over these random variables in this way.
But now, we're going to add this assumption.
And this is an assumption. This is an independence assumption.
And this independence assumption tells me that X of t plus one,
that is the state of time t + 1, the next step.
So, this is the next step, is independent of the past,
given the present. So, this is a forgetting assumption.
Once you know the current state, you don't care anymore about your past,
okay? If you do that, we can now go back to
this chain rule over here and simplify it, because whereas before, we
conditioned on x upto zero, some times zero to time t,
now everything upto t - 1 is conditionally independent, given.
So, all of this is conditionally independent of t plus one given X of t,
which means that I've allowed myself to move X of t here as the only thing that
I'm conditioned on, conditioning on, in order to determine the probability of
distribution of X = 1. So, to what extent is this assumption
warranted? So, is this true?
And let's take as an example, X equals the location or pose of a robot or an
object that's moving. Is it the case that the location of the
robot at t plus one. So, L of t plus one plus one is
independent, of say, L of t minus one, to simplify our
lives, given L of t.
Is this a reasonable assumption? Well, in most cases, probably not.
8:38
a probability of X of t + 1, given X of t, but it's still an unbounded number of
conditional probabilities. Now, at least each of them is compact,
but there's still a probabilistic model for every P.
And this is where we're going to end up with a with a template based model.
We're going to stipulate that there is a probabilistic model, P p of x prime given
x. X prime denotes the, next time point and
X denotes the current time point. And we're going to assume that that model
is replicated for every single time point.
That is, when you're moving from time zero to time one, you use this model.
When you're moving time one to time two, you also use this model.
And and that assumption, for obvious reasons, is called time invariance.
Because it assumes that the dynamics of the system, not the actual position of
the robot, but rather the dynamics that move it from
state to state, or the dynamics of the system, don't depend on the current time
point t. And once again, this is an assumption,
and it's an assumption that's warranted in certain cases and not in others.
So, let's imagine that this represents now the traffic on some road.
Well does that traffic, does the, does the dynamics of that traffic depend on
on, say, the current time point of the system?
On most roads, the answer is probably yes.
It might depend on the time of day, on the day of the week,
10:17
on whether there is a big football match, on all sorts of things that might affect
the dynamics of traffic. The point being that just like in the
previous example, we can correct inaccuracies in our assumption by
enriching the model. So, once again, we can enrich the model
by including these variables in it. And once we have that, then the, again,
the model becomes a much better reflection of reality.
So now, how do we represent this probabilistic model in in the context of
a graphical model like we had before? So, let's now assume that our stayed
description is composed of a set of random variables.
And so, we're interested, we have we have a little baby traffic system where we
have the weather at the current time point, the location of, say, a vehicle,
the velocity of the vehicle. We also have a sensor, who's observation
we get at each of those time points. And the sensor may or may not be failing
at the current time point. And what we've done here is we've encoded
the the probabilistic model of this next state.
So, W prime, V prime, L prime, F f prime, and O prime, given the previous states.
So, given W, V, L, and F. Why is O not here on the right-hand side?
It's not here on the right-hand side because it doesn't affect any of the next
state variables. So, it would be kind of hanging down here
if we included it. But that doesn't, it doesn't affect
anything, we don't choose to to represent it.
So, this model represents a conditional distribution.
Now, we have a little network fragment. So, this is a network fragment.
And it doesn't represent a joint distribution,
it represents a conditional distribution. The conditional distribution of the t + 1
given t. But what, but in order to represent that,
we still use the same tools that we have in the context of variance, of graphical
models. And so, we can write that as the same
kind of chain rule that we used before. So, this would be the probability of W
prime, given W, based on this edge over here,
times the probability of V prime, the velocity.
So, this, this says that the weather, the first one says, that the weather at time
t plus one depends on the weather at time t.
The second one that the velocity of time t plus one depends on the weather at time
t and the velocity at time t which indicates a certain persistence in the
velocity as well as the fact that, you know, if there, if it's raining you might
slip sideways so the velocity might change.
Also if you're careful, you might slow down if it's raining.
And so again, there might be an effect of the weather on the velocity.
The probability of the location at time t + one, given the location at time t and
the velocity time t. The probability of a sensor failure at
time t1. + 1, given the failure, and at, at the
previous time and the weather. Which indicates that, once the sensor has
failed, it's probably more likely to stay failed.
But maybe rain can make the sensor behave badly.
And then, finally, the probability of the observation of time t + 1 given the
location of time t1. + 1, and the failure of time t1.
+ 1. So, there's several important things to
note about this diagram that are worth highlighting.
First of all, we have dependencies both within and across time.
So here, we have a dependency that goes from t to t plus one.
And here, we have a dependency that is within t T plus one alone.
What's, what induces us to make a modeling chose like this go one way
versus the other? The assumption here that this is a fairly
wide [UNKNOWN] dependency so that a, the observation is relatively instantaneous
compared to our time granularity. And so, we, we don't want that to go
across time but rather we want it to be within a time slice because it's a better
reflection for which variable is it that actually influences the observation.
Is it the current location or the previous location?
So these kinds of edges, let's just give the names.
These are called intra-time slice edges and these are called inter or between
time slice. And the model can include a combination
of both of these. Another kind of
anothwe, a particular type of inter-time slice edge that's worth highlighting
specifically are edges that go from a variable at one time point to the value
of that variable at the next time point. These are often called persistence edges
because they indicate the the tendency of a variable to persist in
state from one time point to another. Finally, let's just go back and look at
the parameterization that we have in this model. So, what CPDs did we actually need
to include in this model? And we can see that we have CPDs for the
variables on the right-hand side, the prime variables.
But there's no CPDs for the variables that are unprimed,
the variables on the left. And this is because the model doesn't
actually try and represent the distribution, O over W, V, L, and F.
It doesn't try and do that. It tries to represent the probability of
the next time slice, given the previous one.
So, as we can see, this graphical model only has CPD's for a subset of the
variables in it. The ones that represent the next time
point. So, that represents the transition
dynamics. If we want to represent the probability
distribution over an entire system, we also need to provide a distribution over
the initial state. And this is just the standard generic
Bayesian network which represent the probability over the state at times zero
using some appropriate chain rule. So, nothing very fancy here.
16:56
With those two pieces, we can now represent probability distributions over
arbitrarily long trajectories. So, we represent this by taking for time
slice zero and copying the times zero Bayesian network, which represent the
probability distribution over the time zero variables. And now, we have a bunch
of copies that represent the probability distribution at time one, given time
zero. And here, we have another copy of exactly
the same set of parameters that represents time two given time one.
And we can continue copying this indefinitely and each copy gives us the
probability distribution of the next time slice given the one that we just had and
so we can construct arbitrarily along Bayesian network.
So, to make this definition slightly more formal, we define the notion of a
two-time slice Bayesian network, also known as a 2TBN.
And the 2TBN over a set of template variables X1 up to Xn, is specified as a
Bayesian network fragment along exactly the same lines that we used in the
example. The nodes have two copies.
the next time state variable X prime up to Xn prime.
And some subset of X1 up to Xn, which are variables, the time t
variables, that affect directly the state of three
plus one, okay?
18:44
And because we want this to represent a conditional probability distribution,
only the time t + 1 nodes have parents and a CPD.
Because we don't really want to model the distribution over the variables of time
t. And the 2TBN defines a conditional
distribution using the chain rule. You can tell me it looks exactly like the
chain rule. So, the probability of X prime, given X
is the product of each variable and time t plus one, so only the prime variables
can have parents. Which may or be, may be in time P plus
one, time P, or a combination of both. A dynamic Bayesian network is now
basically defined by a 2TBN, which we just defined, and a Bayesian
network over times zero. So, this is the dynamics and this is the
initial state. And we can use that to define arbitrary
probability distributions for sorry,
probability distributions over arbitrarily long trajectories using
what's called the unrolled network or also called the ground network.
And this is exactly in the, as in the example that I showed,
the dependency model for time zero is copied from the Bayes net for time zero.
And this is, the transition is copied from the Base N under the nodes, the
conditional probability for transitions. So, before we conclude this lecture,
let's look at an example of a dynamic Bayesian network that is a more realistic
one than the simple examples that we've shown before.
This is a network that was actually designed for tracking vehicles in a
traffic situation. And so, we can see that there are
multiple variables here that represent both the position and velocity of the
vehicle both in an absolute sense. For example, this is Xdot and Ydot are
the velocities as well as various more semantic notions of location, like
whether you're in the lane. There are contextual variables such as
left clear and right clear. the engine status, for example as well as
what the driver is currently doing, for example the forward action and the
lateral action. We can see that there are persistence
edges that denote the persistence of various forms of the state from time t to
time t plus one, as well as a variety of these
intermediate variables over here that allow us to represent the probability
distribution in a more compact way by incorporating variables that do not
persist, or at least in this simplified model,, do not persist.
And finally, we see that there are a large number of
sensor observations, such as turn signal, whether the car is clear on the right and
on the left or appears to be clear on the right and left, and so, and so on.
So, this is a much more realistic model of how traffic evolves than the
simplified one that we saw before. To summarize, dynamic Bayesian networks
provides us with a language for encoding structured distributions over time.
And by making the assumptions of the Markovian evolution as well as time
invariance, you can use a single compact network to allow us to code arbitrarily
long transitions over arbitrarily long time sequences.