[MUSIC] So let's look at an example. So assume you're flipping a coin, and we see the same expression here on this slide that we saw on the last slide, except that this should say x instead of i, x or i is fine, it doesn't make any difference but they should be the same. Then you expand this expression, this way, if it's a fair coin. Then you have 0.5 times the log of 0.5 plus 0.5 times the log of 0.5. This representing heads and this representing tails. Okay. And if you compute that, the answer is 1. And this is appropriate because we imagine flipping a coin gives us exactly one bit of information. It tells us the result of one 50/50 choice, okay? So things are either on or they're off. And when you tell me the answer, I've learned one bit of information. All right, so now consider for rolling a die. There are six outcomes, each having a probability of one over six. And so use the same expression, 1/6 times the log of 1/6 and add all those up, which is just multiply by 6 in this case, and you get entropy of 2.58. So what this says is that when I roll a die and get the answer, I've learned more information. This is less predictable, more unpredictable. Right? It's a higher number. More unpredictable than flipping a coin, which agrees with our intuition. So now imagine rolling a weighted die that had these probabilities. Values 1 through 5 were all 10% chance, and it had a 50% chance of rolling a six every time. Well, we can do the same thing. And there's that bug again. This should be x. So, this is 10% times the log of 10% times 5 + 0.5 log 0.5. And so here you get 2.16. So, here a weighted die is [LAUGH] goodness, less unpredictable than a fair die. Right, which again agrees with our intuition. Because if you had to bet what was gonna come up, you would bet on that a six would be rolled. Okay. So, how do you apply entropy to a data set as opposed to a random variable? Well, you can do it with respect to the class labels. So, if you have 891 passengers in the Titanic data set, and 342 of them survived, then you can compute entropy from the set. So 342/891 log 342/891 + 549/891 and so on, gives you an entropy of 0.96. So this is almost one bit of information. And this makes sense because 342 is about half of 891, right? So it's about you learned a lot by. Identifying someone as having lived or died. But then say there are only 50 survivors. Well, then. The data set is inherently more predictable, right? Because most people did not survive and that's reflected in this calculation. You get an entropy value of 0.31, so it's inherently less unpredictable or more predictable. So fine, so that's entropy, going back to decision trees. Now, which attribute do we want to choose at each level? Right? We're gonna build a tree and we have to make a choice of which attribute we're gonna use to split on. Okay, and the one we're gonna pick is the one with the highest information gain and we're gonna use entropy to compute the information gain. And another way of saying that is the one that reduces the unpredictability the most. [MUSIC]