Entropy was a fancy mathematical way to talk about uncertainty, to quantify the uncertainty associated with the probability distribution. Before we move into Information Theory, I just want to say that there are two ways you can think about a distribution with high entropy. So let's say the distribution over some message, or some whatever, x is large. First of all, you can think of this as a bad thing. It's bad because it is very uncertain, it has a high entropy. However, you can also think of it as a good thing. And why is this? Well, a distribution with a high entropy has more possible states. So, if you are to sample, or to see one of those states, you can learn a lot more about the system. For example, what's more informative, a message that can be only one or zero, or a message that can be an entire string of words? Certainly the entire string of words is more informative. However, because of how we defined entropy, the distribution over that string of words, or the possible states that string of words could be in has a very high entropy, compared to the one zero message. However, that means that it is more capable of being informative. And this is what we're interested when we study information theory. We're interested in using entropy to talk about how informative a message can be. So in general, information theory can be thought of as the study of how the entropy or the uncertainty of a distribution changes when you receive a message. So here is an example. Let's say that one of your friends and you are in the shower, so you cannot come down to answer it. They knock on the door, you can't answer, so they leave a note. Let's say you have three friends, like me, which we'll denote by the random variable F. So you have Billy, you have Carol and you have Francois. Those are your three friends. Before you read the note they left on your door, there is an equal probability that it could have been anyone of your friends and that probability is one- third. So, this is the probability distribution over F before you receive the note. But, then you go downstairs and you read the note, and you know that Billy and Carol only speak English and Fransua only speaks French. But when you read the note, you see that the note is in English. This changes the distribution of who that friend could have been that was knocking on your door. Since the note was in English, you know it could not have been Francois. Therefore, the probability of Billy and the probability of Carol both go up to 0.5, and the probability of Francois goes down to 0. So this is the distribution over F, given that the note was in English. So let's calculate the entropy of these distributions. Well, the entropy of the first one, before you read the note, is H(F), and that's log 2 of the number of states, 3, which is 1.6. And the entropy of the second distribution, the conditional entropy once you've realized that the note was log of. Well, now they're effectively just two states, two possible options, so log of 2 is 1. So the entropy decreases once you receive a message, which fits nicely with our idea that receiving a message decreases your uncertainty about the situation. What would the entropy have been if you had read that the note was in French? In that case, you would have known that the only possible person your friend could have been was Francois. So there would have been effectively one possible state and the entropy would have gone to zero. So, in general when you receive a message, your entropy decreases. Lets look at a neuroscience example, so example two. In this case, let's say you're wondering how cold it is outside. Before you go outside, you know that there is some probability about what the temperature is. Maybe it's in the spring time so the mean of that temperature is 60 degrees Celsius. This is before you go outside. But then you go outside and your sensory neurons send you a message. So you go outside and your sensory neuron sends you a message, R, which could be a spike train, for example or a group of spike trains. Once you receive that message, your certainty about what the temperature is increases. So maybe they send you the message that it's actually pretty hot out. You still don't know exactly what the temperature is, but now you're certain that it's centered around 80 degrees maybe, it's a hot day in the spring. This is the distribution over temperatures given despite train R, the message R. Now, I won't calculate the entropy exactly, but hopefully you can see that since P of T given R is a lot narrower than P of T, the entropy of the prior temperature distribution is much greater than the entropy of the conditional temperature distribution. So in this case, you gain information about the temperature from the message. And so we call this conditional entropy the entropy of the temperature given the the neural response, the noise entropy. because whatever uncertainty is associated with the temperature after you've gotten all the information you could from your neurons, will still be not perfect. There will still be some noise, so we call it the noise entropy. So it seems like this intuitive idea of information is very correlated with the change in uncertainty that results from you receiving a message. How would we write the change in uncertainty after you receive a message, the change in entropy? So let's say that we'll stay in neuroscience land for now. S is your stimulus, and R is your response. And the stimulus can be a scale or a vector, and the response can be a firing rate or a spike train, or a whole pattern of spikes, or whatever you like. But S is the stimulus and R is the response. And so at the start, you have some distribution, the probability that your random variable is equal to a specific value, S. And after you get the response, you have a conditional distribution. So this is the same thing, but conditioned on the fact that your response was equal to little r and each one of these has an entropy. Remember the entropy takes as input a distribution and produces as output just a number. So there's an entropy of the original stimulus distribution, H(S). And there's an entropy of the conditional stimulus distribution, H(S) given the fact that you measured response little r. And the amount that the entropy decreases is simply H(S) minus H(S) given R. However, as we saw with our Francious and Billy and Carol case, different messages could have yielded different conditional distributions, and therefore, different conditional entropies, different noise entropies. So, in order to get a very general quantity that talks about the entire distribution of S and R, rather than the distribution of S given a single R, we are just going to take the average decrease in entropy. Where the average is taken over all of the things that the message, the response, could have been. So we write that the information between S and R is equal to the original entropy minus the average, the expected value, of the conditional entropy, of the noise entropy. And that average is taken with respect to all of the possible things, the response, the message could have been. And this value is called the mutual information. And the information about S from R, takes as input the joint distribution over S and R and outputs just a number. And that number tells you something about how much you can learn about S if you listen to the message R. And so an important thing to realize is that the information, or when you calculate the information you don't need a model like a linear filter system, or a GLM, or anything like that. When you calculate the information, you just say, how much could the response tell you about the stimulus regardless of the mechanism in which the response is actually encoding the stimulus. So, the mutual information doesn't depend on a coding model. In other words, it's just a good way to characterize the system if we know the distribution over response and stimulus. Now, maybe you've noticed, but there's a very nice symmetry in the mutual information. So when we defined it as, what did we say, information about the stimulus given the response was equal to the entropy of the stimulus distribution minus the expected value of the entropy of the conditional distribution where just recall that the expected value of that function was, if R is continuous, this is H(S), given that R equals R times the probability of R DR. That's just the average conditional entropy. But when we defined this, I just said that S stood for the stimulus, and that R stood for the response. I could have said it the other way around. So, we have two random variables here, we have S And R. And we have distributions over S, we have distributions over R, we have conditional distributions, the probability of s given r. We also have the probability of R given s and we have a joint distribution, so it's very symmetrical. When we say the information about the stimulus, given the response, which is what we've been talking about so far, we're talking about how much this uncertainty, or the spread of the stimulus distribution, shrinks when we receive a message of information the response carries about the stimulus, averaged over all of the different responses you could have gotten. But you can also think the other way around. Before you even see a stimulus, there is some distribution of responses, some prior distribution of responses. When the stimulus comes, that distribution changes. So maybe before you go outside and feel the temperature, there are 10,000 different messages that your sensory neurons could be sending you. But once you go out and feel the temperature, and say it's 81 degrees, now there are only 20 different messages your sensory neurons could be sending you. So in a way, you can also think of the response distribution shrinking once a stimulus is presented. So here is how much information about the response the stimulus carries. If you just knew the response, you can take a guess about the stimulus. However, if you just know the stimulus, you can also take a guess about the response. This will happen if the two are correlated, which hopefully they are if your neurons are sending you important information. So how would we write out the information of the response given the stimulus. Well, same thing we had before, but the variables are flipped. So we have the entropy of the response distribution minus the expected conditional entropy of the response distribution, given that a stimulus was presented. And we average over all the different stimuli that could be presented. So these are very, very related quantities and they are so related in fact that they are the same. The mutual information about the stimulus from the response is equal to the mutual information about the response from the stimulus. So this is very, very cool thing. This means that H(S) minus the average value of H(S) given a certain response is equal to H of R minus the expected value of the entropy of the response given a certain stimulus. So, there's a very beautiful symmetry in how we calculate the information. And all that this tells us is that our definitions of response and stimulus are rather arbitrary. You can consider a spike train to be the stimulus, and whatever caused that spike train to be the response. It's the same mathematically. Importantly, however, sometimes, one of these quantities is easier to calculate than the other. So sometimes it's easier to calculate these guys, and sometimes it's easier to calculate those guys. And in week four, lecture two, we calculate information, mutual information using this second guy. So first, we calculate the entropy of the response regardless of what stimulus was presented. And second, we calculate the entropy of the response for a certain stimulus and we do that for a bunch of stimuli and then take the average over those stimuli and their probabilities. So this turns out to be a very useful calculation that tells us how much we can learn about one variable given the choice of the other variable. It tells us how correlated two random variables are, how correlated the temperature outside is to a pattern of spikes in your sensory neurons. How correlated the motion of a rat's whisker is to the signal that gets sent to its barrel cortex. In general, it's a very useful quantity to be able to calculate and understand, because it gives us a model free way of talking about how much two probabilistic quantities convey about one another. That's it for now, see you next time.