Hey guys. Today, we are going to talk about entropy. Essentially, entropy is a fancy word for uncertainty. Now, it is important to define uncertainty, or entropy, rigorously, because in information theory, the main thing we're concerned about is quantifying changes in uncertainty after receiving a message. So maybe in a card game, you're interested in how much your uncertainty about what's in another player's hand changes after they play a certain card. Or maybe you're interested in how much your uncertainty about who your secret admirer is changes after you read one of their notes. Or maybe, in a more neuroscience context, you're interested in seeing how much your uncertainty about who's speaking to you changes after you hear them say a few words. Now, in order to discuss this with math, we need to quantify this uncertainty. You may have heard of other things, like standard deviation or variants, referred to as some kind of uncertainty and in certain contexts those are very good descriptions of uncertainty. But today, we are going to think about a new kind of uncertainty and that uncertainty is called entropy. So, basically, the entropy of a probability distribution tells us how uncertain we are about that distribution. So let's look at some examples. To start, let's just think about discrete distributions. Let's draw two different distributions and ask, which one has a higher uncertainty? Which one has a higher entropy? Remember, we can think of distributions, probability distributions, as these random object generators that when you push a button, spit out a random object. So, let's say this is distribution machine one and this distribution can spit out four different objects, (0,0), (0,1), (1,0) and (1,1). And let's say that the chance that this distribution spits out (0,0) is the same as the chance that it spits out (0,1), or (1,0), or (1,1). So, in other words, the probability of any one of these is 0.25. So, let's draw that out. There's our probability and this is our, we'll call it the state. We'll say that the object the distribution spits out currently is its state. So we have four different states (0,0), (0,1), (1,0,) and (1,1). And the probability for each state is .25, so that's our first distribution. Our second distribution is a little bit simpler. Our second distribution spits out either (0,0) or (1,1). So it only has two different possible states. So if we plot those, this looks like this. And, again, let's say that the states are equally probable. If they're equally probable, the probability of each one is .5. So, the question I ask you is if you had to play a game where you bet on the object that comes out, and you get a dollar if you get it right and you get nothing if you don't, which distribution would you choose to play the game with? Hopefully you picked distribution number two. With distribution number two, if you choose randomly there is a 50% chance you'll get it right. So on average you'll make 50 cents every time you play the game. If you had chosen distribution number one, there are simply more choices to choose from. So the chance that you get it right is only 0.25. So on average you'll make a quarter every time you play the game. Since, on average, if you are to play the game with distribution one you would be less certain about what number comes out, we say that the distribution one has a higher entropy. Since if you played the game with distribution two, you would be more certain about what state was going to come out, we say that distribution two had a lower entropy. So, what exactly was it about distribution one that gave it a higher entropy? Well, distribution one had more options, it had more possible states than distribution two. So if we write out the number of states, how many states did distribution one have? It had four. And distribution two had two states. So, in a way, the uncertainty is just the number possible states the distribution has. And, in fact, that's all entropy does. All that entropy does is count the number of states, and we write entropy as H, in information theory. However, the entropy does not just count the number of states directly, it actually takes the log of the number of states. So let's write H = log(# of states). And we usually use log base 2. So what's the entropy of the first distribution? Log 2 of 4 is just 2. And what's the entropy of the second distribution? Log 2 of 2 is just one. So the second distribution has a lower entropy, it has a smaller number of available states. So here is our definition, the entropy, H, of a random variable, let's say X, is the log of the number of states that X can take on. This is the case if all states are equally probable. Let's do a couple more examples. What if your distribution is over colors? So every time you push your random number generator button you get a color out. And we'll call the color X. So your options are red, green, blue, orange, yellow or violet. And they're all equally probable with a probability of one-sixth. How many states are there? Well, there are only six states. That's not too hard to fill out. What is the entropy of this distribution? That's just log of the number of states. Log six, that's about 2.6 What if we have a distribution over a three dimensional vector? So x1, x2, x3. And let's put a constraint that says in this vector you are always going to have two zeros and one one. So maybe x1 is one, and x2 and x3 are zero. And let's say that there is an equal probability that the one is in any one of those places. So our three possible states are 1,00, 0,1,0 or 0,0,1. And the probability of each one of those is one-third. How many states are there? One, two, three, three states. What's the entropy, log2(3), that's about 1.6. What if our distribution is over three dimensional vectors, where each element has a 50% chance of being a one and a 50% chance of being a zero? So the probability that xi, any arbitrary element, is 0 is 0.5. And the probability that any arbitrary element is 1 is 0.5. How many states do we have? There are eight possible states. You can go ahead and count those up. The entropy of the distribution over eight equally probable states is log2(8), which is 3. What if we had the same deal, but with a ten dimensional vector. Now, the number of states is 2 to the 10. So log2 2 to the 10, is just 10. What if it's a 100 dimensional vector? Now there are 2 to the 100 possible states, so the entropy of that is 100. And maybe now you can see why we use the logarithm for entropy rather than just the number of states. If you wanted to, you could have defined entropy as just the number of states but that would have made the math a little harder because you would be dealing with these really big numbers all the time. Using the logarithm makes it so you don't have to deal with numbers like 2 to the 100, you can deal with the smaller number just 100. Okay, hopefully that is not too confusing. The more possible states there are, the larger your entropy, the larger your uncertainty about what state your distribution will spit out. But, what if the states are not all equal? What if some states are more profitable than others? Well, this should change the entropy. So in that first example when we were playing our game what if the probability of (0,0) for this first distribution was 0.99? And the probability of any of the other options was 0.01. In that case, the game suddenly becomes a lot more certain, and it's very easy for you to just bet on (0,0) every time, and make quite a bit of money. Therefore when the states do not have equal probabilities the entropy should decrease. You should be less uncertain about the distribution. We need a way to generalize the definition of entropy to take into account situations where not all states are equally probable. So let's say we have some distribution, maybe they're three different states a, b, and c, and this is the probability. Maybe a has a probability of 0.5, b has a probability of 0.3, and c has a probability of 0.2. The entropy of this distribution should be less than the entropy of the distribution where the probability of a, and the probability of b and the probability of c, were all one-third. So we need to generalize the formula H(x) = log(# states). And here's the answer, the answer, this is the full definition of entropy, the most general definition, is minus the sum over the probability of each state, so x equaling xi times the log of that probability where the sum is over all of the states, i. This is the general definition of entropy. Now, just to make sure we're all on the same page, let's just remind ourselves. That when we calculate the entropy, so the entropy is a function of something right. When we calculate the entropy the input to the function is some probability distribution. So the input is not just a single number, a single state. The input is an entire distribution over states. The output, however, is just a single number that characterizes how uncertain that probability distributions is. So remember it's important to think about what the inputs to functions are and what the outputs to functions are. However even though the input is the entire probability distribution, we usually just write H(x), where that random variable x stands in for the whole probability distribution over x. So, this seems like kind of a weird definition. A weird way to generalize something. We know longer have the log of the number of states in their. It might feel a little off. If you're interested in seeing the full derivation, you can look in the supplementary materials of the course web page. But for now, I just want to talk through a bit of the intuition about what's going on. First of all, since this is a generalization it should work in the specific case in which all of our probabilities were equal. So let's just check and make sure. So this is the special case that we started with. So if all probabilities are equal, so let's say we have N different states, and all of the probabilities are equal. That means that P(x=xi) is just 1/N. So we can plug in 1/N for P(x=xi) in our equation for entropy. So we have entropy of x is equal to minus, sum over i, 1/N log 1 over N. We can bring the 1 over N outside, and the minus inside, but -log of 1 over N is just equal to log of N. So when we do that sum, we have 1 over N times (N log N), which is just equal to log N. So in fact, we get back the log of the number of states, so that's good. Second, let's check to make sure that if we increase the probability of a certain state, the entropy of the distribution should go down. So let's just do this with an example. Let's say we have 4 states. If they're equally probable then we have the entropy is equal to log2 (4), which is just 2. What if the probability of the first the state is one-quarter, the probability of the second state is one-quarter, the probability of the third state is 0, and the probability of the fourth state is one-half? What does the entropy equal? Well, let's plug in our formula. So we have minus, we'll make everything negative, and now we're doing the sum. One-quarter, that's the first probability, log of one quarter + one quarter log of one quarter + 0 log of 0 + one half log of one half. So the 0 term goes away and we're left with minus one half log of one quarter plus one half log of one-half which is equal to one-half log of 4 plus one-half log of 2. Log of 4 is 2, so one-half of 2 is 1. And log of 2 is 1. So one-half of 1 is one-half. So that equals three-halves, and it is in fact less than the case where we had equally probable states. So it's starting to feel like this definition might be okay after all. But let's go over one last intuition. And to start this one, we are going to note that if you have some random variable x, the expected value of a function of x is equal to a weighted sum of all of the possible values that function can take on for different x's. Weighted by the probability that those access are sampled from our distribution. So from that perspective, we can say that the entropy of a random variable x or of the distribution over x, is equal to the expected value of what? Well that function is minus log of the probability of x = xi. So remember, the probability of x = xi is a function of xi. So we can take its expected value, just like we would any other function such as xi squared, or xi itself, which is what we wanted to do when were taking the mean of the distribution. So the entropy is the average of this log. And let's just write p of i for short. So the average value of this weird quantity minus log P sub i. To understand this, let's draw a plot of minus log of P sub i. So this will be our minus log P sub i axis, and this is our P sub i axis. Since probabilities are always between 0 and 1 for a discrete distribution, we're only worried about what's going on between 0 and 1. So what does this function look like between 0 and 1? Something like that. What this means is that states with a very small probability, so maybe over there, have a very high minus log P sub i. And states with a very large probability have a very low minus log P sub i. So one way to think of minus log P sub i is as the amount of information you would get if you're random number generator spat out the state X sub i. And if it spits out a very probable state, well you figure that was pretty likely to happen anyway. So it's nothing new. You don't get much information out of it. If it spits out something really improbably, something over here, then you get a lot of information out of that. If you get a really improbable text message such as, your house is on fire. That gives you a lot of information. One other simple way to think about this, is pretend we have a distribution over colors again. So, the probability that the color is red is 0.6. The probability that the color is blue is 0.3, and the probability that the color is green is 0.1. One way to think of this distribution is as a bag of marbles where the marbles are label one through ten. And let's say that marbles one through six are red, marbles seven through nine are blue, and marble ten is green. You know the drill. Put all the marbles in the bag and reach your hand in the bag and pull a random marble out. Well, if the draw is red, which is something very probable, then that doesn't give you very much information. That only tells you that the marble is either marble one through marble six. However, if the draw is green, which is an improbably state, then you know exactly which marble you've pulled out. So this isn't a really crazy way to think about things. So minus log P sub i tells you the amount of information you would get about your system having seen the state X sub i. And because this is probabilistic, we just take an average over this amount of information. And that is the entropy of a probability distribution. So again, this is just a fancy yet mathematical way to talk about uncertainty. And it will come up very often in information theory. One last note is that we just talked about discrete distributions in this video. So the entropy of a discrete distribution is minus the sum over all possible states of the probability of the state times the log of that state's probability. But in neuroscience, we often have continuous distributions. So remember a discrete distribution looked something like this. Whereas a continuous distribution looks something more like that, some continuous curve. And in the continuous case it actually a probability density. So remember you have to take the integral, or the area under that probability density, to get an actual probability between 1 and 0. So in the continuous case, going from sums to integral isn't too tricky. We just take minus the integral over the probability density of our random variable being xi times the log of that probability density integrated with respect to little xi. So there you have it. That's the basics of entropy, a way to quantify uncertainty. When we get to information theory we're going to be primarily concerned with how much a distribution changes after we receive a message. So maybe the distribution starts out with all force states being equal then you receive a message and then suddenly the first state become much more probable and the probability of the rest of the states decreases. And the way we talk about how the distribution has changed is by calculating the change in entropy. How much our uncertainty has decreased because of the message? Okay, that's it for now. See you next time.