0:16

Now, it is important to define uncertainty, or entropy,

rigorously, because in information theory, the main thing we're concerned about

is quantifying changes in uncertainty after receiving a message.

So maybe in a card game, you're interested in how much your uncertainty

about what's in another player's hand changes after they play a certain card.

Or maybe you're interested in how much your uncertainty about

who your secret admirer is changes after you read one of their notes.

Or maybe, in a more neuroscience context,

you're interested in seeing how much your uncertainty about

who's speaking to you changes after you hear them say a few words.

Now, in order to discuss this with math,

we need to quantify this uncertainty.

You may have heard of other things, like standard deviation or

variants, referred to as some kind of uncertainty and

in certain contexts those are very good descriptions of uncertainty.

But today, we are going to think about a new kind of uncertainty and

that uncertainty is called entropy.

So, basically, the entropy of a probability

distribution tells us how uncertain

we are about that distribution.

So let's look at some examples.

To start, let's just think about discrete distributions.

Let's draw two different distributions and

ask, which one has a higher uncertainty?

Which one has a higher entropy?

2:08

Remember, we can think of distributions, probability distributions, as these random

object generators that when you push a button, spit out a random object.

So, let's say this is distribution machine one and

this distribution can spit out four different objects,

(0,0), (0,1), (1,0) and (1,1).

And let's say that the chance that this distribution spits out

(0,0) is the same as the chance that it spits out (0,1),

or (1,0), or (1,1).

So, in other words, the probability of any one of these is 0.25.

So, let's draw that out.

There's our probability and this is our, we'll call it the state.

We'll say that the object the distribution spits out currently is its state.

So we have four different states (0,0), (0,1), (1,0,) and (1,1).

And the probability for each state is .25,

so that's our first distribution.

Our second distribution is a little bit simpler.

Our second distribution spits out either (0,0) or (1,1).

So it only has two different possible states.

So if we plot those, this looks like this.

And, again, let's say that the states are equally probable.

If they're equally probable, the probability of each one is .5.

So, the question I ask you is if you had to play a game

where you bet on the object that comes out, and

you get a dollar if you get it right and you get nothing if you don't,

4:07

which distribution would you choose to play the game with?

Hopefully you picked distribution number two.

With distribution number two,

if you choose randomly there is a 50% chance you'll get it right.

So on average you'll make 50 cents every time you play the game.

If you had chosen distribution number one,

there are simply more choices to choose from.

So the chance that you get it right is only 0.25.

So on average you'll make a quarter every time you play the game.

4:40

Since, on average, if you are to play the game with distribution

one you would be less certain about what number comes out,

we say that the distribution one has a higher entropy.

Since if you played the game with distribution two,

you would be more certain about what state was going to come out,

we say that distribution two had a lower entropy.

5:07

So, what exactly was it about distribution one that gave it a higher entropy?

Well, distribution one had more options,

it had more possible states than distribution two.

So if we write out the number of states, how many states did distribution one have?

It had four.

And distribution two had two states.

5:32

So, in a way,

the uncertainty is just the number possible states the distribution has.

And, in fact, that's all entropy does.

All that entropy does is count the number of states, and

we write entropy as H, in information theory.

However, the entropy does not just count the number of states directly,

it actually takes the log of the number of states.

So let's write H = log(# of states).

And we usually use log base 2.

So what's the entropy of the first distribution?

Log 2 of 4 is just 2.

And what's the entropy of the second distribution?

Log 2 of 2 is just one.

6:32

So the second distribution has a lower entropy,

it has a smaller number of available states.

So here is our definition, the entropy, H,

of a random variable, let's say X,

7:19

So every time you push your random number generator button you get a color out.

And we'll call the color X.

So your options are red, green, blue, orange, yellow or violet.

And they're all equally probable with a probability of one-sixth.

How many states are there?

Well, there are only six states.

That's not too hard to fill out.

What is the entropy of this distribution?

7:50

That's just log of the number of states.

Log six, that's about 2.6 What if we

have a distribution over a three dimensional vector?

So x1, x2, x3.

And let's put a constraint that says in this vector

8:13

you are always going to have two zeros and one one.

So maybe x1 is one, and x2 and x3 are zero.

And let's say that there is an equal probability that the one is in any one

of those places.

So our three possible states are 1,00,

0,1,0 or 0,0,1.

And the probability of each one of those is one-third.

How many states are there?

One, two, three, three states.

What's the entropy, log2(3),

that's about 1.6.

What if our distribution is over three dimensional vectors, where

each element has a 50% chance of being a one and a 50% chance of being a zero?

So the probability that xi, any arbitrary element, is 0 is 0.5.

And the probability that any arbitrary element is 1 is 0.5.

How many states do we have?

There are eight possible states.

You can go ahead and count those up.

The entropy of the distribution over eight equally

probable states is log2(8), which is 3.

What if we had the same deal, but with a ten dimensional vector.

9:49

Now, the number of states is 2 to the 10.

So log2 2 to the 10, is just 10.

What if it's a 100 dimensional vector?

Now there are 2 to the 100 possible states,

so the entropy of that is 100.

And maybe now you can see why we use the logarithm for

entropy rather than just the number of states.

If you wanted to, you could have defined entropy as just the number of states but

that would have made the math a little harder because you would be dealing with

these really big numbers all the time.

Using the logarithm makes it so you don't have to deal with

numbers like 2 to the 100, you can deal with the smaller number just 100.

Okay, hopefully that is not too confusing.

10:45

The more possible states there are, the larger your entropy,

the larger your uncertainty about what state your distribution will spit out.

But, what if the states are not all equal?

11:21

And the probability of any of the other options was 0.01.

In that case, the game suddenly becomes a lot more certain, and it's very easy for

you to just bet on (0,0) every time, and make quite a bit of money.

Therefore when the states do not have equal probabilities the entropy should

decrease.

You should be less uncertain about the distribution.

We need a way to generalize

the definition of entropy to take

into account situations where not

all states are equally probable.

So let's say we have some distribution,

maybe they're three different states a, b, and c, and this is the probability.

Maybe a has a probability of 0.5,

b has a probability of 0.3, and

c has a probability of 0.2.

The entropy of this distribution should be less than the entropy

of the distribution where the probability of a, and

the probability of b and the probability of c, were all one-third.

So we need to generalize the formula

H(x) = log(# states).

12:58

And here's the answer, the answer, this is the full definition of entropy,

the most general definition, is minus

the sum over the probability of each state, so

x equaling xi times the log of

that probability where the sum is over all of the states, i.

13:25

This is the general definition of entropy.

Now, just to make sure we're all on the same page, let's just remind ourselves.

That when we calculate the entropy, so the entropy is a function of something right.

13:43

When we calculate the entropy the input to the function

is some probability distribution.

So the input is not just a single number, a single state.

The input is an entire distribution over states.

The output, however, is just a single number

that characterizes how uncertain that probability distributions is.

14:15

So remember it's important to think about what the inputs to functions are and

what the outputs to functions are.

However even though the input is the entire probability distribution,

we usually just write H(x), where that random variable x stands in for

the whole probability distribution over x.

14:49

It might feel a little off.

If you're interested in seeing the full derivation,

you can look in the supplementary materials of the course web page.

But for now,

I just want to talk through a bit of the intuition about what's going on.

First of all, since this is a generalization it should work

in the specific case in which all of our probabilities were equal.

15:15

So let's just check and make sure.

So this is the special case that we started with.

So if all probabilities are equal, so let's say we have

N different states, and all of the probabilities are equal.

15:31

That means that P(x=xi) is just 1/N.

So we can plug in 1/N for P(x=xi) in our equation for entropy.

So we have entropy of x is equal to minus,

sum over i, 1/N log 1 over N.

We can bring the 1 over N outside, and the minus inside,

but -log of 1 over N is just equal to log of N.

So when we do that sum, we have 1 over N times

(N log N), which is just equal to log N.

16:41

So let's just do this with an example.

Let's say we have 4 states.

If they're equally probable then we have the entropy

is equal to log2 (4), which is just 2.

What if the probability of the first the state is one-quarter,

the probability of the second state is one-quarter,

the probability of the third state is 0, and

the probability of the fourth state is one-half?

What does the entropy equal?

17:25

Well, let's plug in our formula.

So we have minus, we'll make everything negative, and now we're doing the sum.

One-quarter, that's the first probability,

log of one quarter + one quarter

log of one quarter + 0 log of 0

+ one half log of one half.

17:57

So the 0 term goes away and

we're left with minus one

half log of one quarter plus

one half log of one-half which

is equal to one-half log of

4 plus one-half log of 2.

Log of 4 is 2, so one-half of 2 is 1.

And log of 2 is 1.

So one-half of 1 is one-half.

So that equals three-halves, and

it is in fact less than the case where we had equally probable states.

So it's starting to feel like this definition might be okay after all.

18:55

And to start this one, we are going to note that

if you have some random variable x, the expected value

of a function of x is equal to a weighted sum

of all of the possible values that function can take on for different x's.

Weighted by the probability that those access are sampled from our distribution.

So from that perspective, we can say that the entropy

20:01

So we can take its expected value, just like we would any other function

such as xi squared, or xi itself,

which is what we wanted to do when were taking the mean of the distribution.

21:08

What this means is that states with a very small probability, so

maybe over there, have a very high minus log P sub i.

And states with a very large probability have a very low minus log P sub i.

So one way to think of minus log P sub i is as the amount of information

you would get if you're random number generator spat out the state X sub i.

And if it spits out a very probable state,

well you figure that was pretty likely to happen anyway.

So it's nothing new.

You don't get much information out of it.

If it spits out something really improbably,

something over here, then you get a lot of information out of that.

If you get a really improbable text message such as, your house is on fire.

That gives you a lot of information.

22:04

One other simple way to think about this,

is pretend we have a distribution over colors again.

So, the probability that the color is red is 0.6.

The probability that the color is blue is 0.3,

and the probability that the color is green is 0.1.

One way to think of this distribution is as a bag of

marbles where the marbles are label one through ten.

22:41

And let's say that marbles one through six are red,

marbles seven through nine are blue, and marble ten is green.

You know the drill.

Put all the marbles in the bag and reach your hand in the bag and

pull a random marble out.

23:05

then that doesn't give you very much information.

That only tells you that the marble is either marble one through marble six.

However, if the draw is green, which is an improbably state,

then you know exactly which marble you've pulled out.

So this isn't a really crazy way to think about things.

So minus log P sub i tells you the amount of information you

would get about your system having seen the state X sub i.

And because this is probabilistic,

we just take an average over this amount of information.

And that is the entropy of a probability distribution.

23:47

So again, this is just a fancy yet

mathematical way to talk about uncertainty.

And it will come up very often in information theory.

One last note is that we just talked about discrete distributions in this video.

So the entropy of a discrete distribution is minus the sum over all possible states

of the probability of the state times the log of that state's probability.

24:42

And in the continuous case it actually a probability density.

So remember you have to take the integral, or the area under that

probability density, to get an actual probability between 1 and 0.

So in the continuous case, going from sums to integral isn't too tricky.

We just take minus the integral over the probability

density of our random variable being xi times the log of

that probability density integrated with respect to little xi.

So there you have it.

That's the basics of entropy, a way to quantify uncertainty.

When we get to information theory we're going to be primarily concerned with

how much a distribution changes after we receive a message.

So maybe the distribution starts out with all force states being equal then you

receive a message and then suddenly the first state become

much more probable and the probability of the rest of the states decreases.

And the way we talk about how the distribution has changed

is by calculating the change in entropy.

How much our uncertainty has decreased because of the message?

Okay, that's it for now.

See you next time.