Suppose we have a page of text, and we want to store it in the computer. The computer understands zeroes and ones, and therefore, we have to convert the letters and the words in the page of the text into zeroes and ones. How shall we do it? One way is to use fixed binary codewords for each letter, upper and lowercase letters, and for each punctuation symbol, the commas, periods, quotes, and so on. This is exactly what ASCII is doing, the American Standard Code for Information Inter, Interchange. Seven bits per symbol are used by ASCII, and therefor, if we count all the letters and punctuation symbols in the page of text, and multiply them by seven, we'll find out the total number of bits required to encode the whole page. Can we do better than that? If yes, how can we do it? What information do we need to know about the source that generated the symbols so that we accomplish this? Well, the answer is, yes, we can do better than fixed length, length codes with the so called variable length codes. And what we need to know is the entropy of the source. So we'll cover these exciting concepts in this segment. Without further ado, let's begin. We need some ideas from information theory, since they provide the framework for the development of lossless data compression schemes. We're therefore going to briefly review them. So the first idea is the construct of a source which has a finite alphabet associated with it. This source emits a sequence of symbols which are chosen randomly from this finite alphabet. Such a source can model any information-generating process. Examples of sources are natural written languages, such as the English language. In this case, its alphabet has 26 letters. We can assume that they're all, for example, capital letters. Another example is a source that generates an n-bit image. Then the alphabet has two to the n symbols. For eight bit images, we have 256 elements, symbols, in their alphabet. Now the simplest source we can model is one that is discrete and without memory. So it's a discrete memoryless source, or a DMS. The successive symbols generated by this source are I-I-D. They are statistically independent and identically distributed. This source of course may or may not be a good model. For the source, for example, that generates the words in the English language or any language for that matter, we know that there is a dependency among letters in forming words, and also that certain combination of letters are not encountered in practice. So for example in English, I don't believe one can have this combination of, of four letters. Now this DMS source is completely specified by the source alphabet and the associate probabilities for each symbol in the alphabet. We first address the question of how much information is generated by the occurrence of a certain event or by the admission by the source of the certain symbol. Intuitively, this amount of information, the so called self information, should be a function of the frequency of occurrence of such an event or symbol, and even more specifically, it should be inversely proportional to it. Out of a number of possible choices, the function that fits the bill nicely, and also has additional desirable properties, is the log function. So more specifically, the self information is defined according to this formula. So this is the self-information of the symbol si equals the log of 1 over its probability, which is also clearly equal to the minus log p of i. So if I look at the log function, it looks like this. It goes through one. [SOUND] Now, if I'm to show the self information, if I want to plot p of i minus log p of i, then the probability clearly has values between 0 and 1. And therefore, it's this part of the curve I'll be using, so I'll take this part of the curve and invert its sign, so it will look something like this. I did not show the base of the log function. The result is independent of the base since I can convert from one base to another one. But, historically, if the base of log is 2 then the unit, the units are bits, if the base is e, the units are nots, and if the base is 10, then the units are referred to as Hartleys. From now on we'll be using a log base two, therefore the units will be cut, will be bits. So what this formula up here is telling me is that the occurrence of a less probable event provides more information. So a certain event, an event with a probability one over here, conveys zero information. This is, if in your conversations you report on facts that everybody in your group knows, you convey very little or zero information. If somebody says that day is followed by a night, this is a zero information event. If, on the other hand, an event is, has very low probability, down here, right? Such as, there would be a solar eclipse tomorrow, then the information content is very high. Now, due to the use of the log function, the information of independent events taken as a single event equals the sum of information. That's true, because if I look at the information of two symbols, let's say s1, s2, taken as a single event, this then, self information of this event s1, s2 equals 1 over the probability of s1, s2. And since the events are independent, this is equal to probability of s1 times the probability of s2. [SOUND]. And well, this is log of p s1, p s2 equals minus log p s1, but we called it p1 minus log p of 2, therefore, equals the self information of s1 plus the self information of s2. In moving away from the information conveyed by a sin, single symbol and in characterizing the whole DMS source, which is completely characterized by the, by the, by it's alphabet and the associated symbol probabilities, the entropy of a DMS source is defined as the average information per symbol, as you can see here. So, for the source S, we'll denote its entropy as H of S. And then we use a definition of the expected value, or the average of self information p of i, the self information, or symbol si, and due to the definition of the self information above, this is then the expression for the entropy. And the units are bits, since we have utilized the two as the base of the log. We show here a plot of the entropy of a source that has two symbols in its alphabet. One is emitted with the probability p. And clearly the other one is emitted with the probability 1 minus p. So this is the entropy of the discrete memoryless source. So again we show the plot of the entropy versus the value of p. We see that the entropy is positive, except at these locations where it's equal to zero. And this is the situation when one symbol has probability one and the other has probability zero. And this is a general result that the entropy is zero, if and only all the pi's, but one, are equal to zero,. And this one is having a value equal to 1. So this is the case when we are certain of the outcome, and that's the case when H vanishes. We also observe here that the maximum value of the entropy is obtained for p equals 0.5. So, in other words, when the symbols have equality probabilities. And this is a specific incarnation of a more general result. If I have a source with n symbols, [SOUND], then the entropy of this source is maximum [NOISE] when the probabilities of the symbols are equal, and there for equal to one over n. And in that case the maximum value of the entropy is equal to log n. This is an important result to remember. We're going to use it in some of our arguments to justify some of the reprocessing or some of the operations we are going to do to images and videos so as to improve their coding efficiency. We describe here the source coding problem, which is a problem of lossless compression. We are considering a source which generates symbols, s1 through sn. We can assume that this is a DMS source, although the concept of coding is not constrained by the model of the source. So in this case, the alphabet and the associated probabilities define the source. We want to design a coder that will map the source symbols into binary codewords. The codewords can have fixed lengths, as shown here, in this case all codewords have length four, or variable length, as shown here. So this is a fixed length. [BLANK_AUDIO] And this is variable length code. There is therefore, a length associated with each code word which we denote by l of i. With this setup, we can define the expected value or average of the code word length, as shown here. So the coding problem is to design this coder and find the code, the set of code words, so this is for example code one, this could be code two, so that the smallest possible average code word length is achieved. Regarding an example of a fixed code word length code. We can proceed to the source with alphabet the letters of the English language consisting of uppercase and lowercase letters and punctuation points. The fixed code word length code then can be the ASCII representation of these letters. So this is how the letters are represented into a computer. So, for example, the ASCII code for the letter a is shown here, [NOISE] while the ASCII code for the uppercase A is shown here. In designing VLC codes, the Morse principle typically applies. Here we have Morse principle. I'm not sure if you know Morse code. I don't, but from what I've covered here is the Morse code for the letter e, for the letter a, for the letter q. So what is the Morse principle? It states that frequent symbols should be assigned short code words. So that, when we form the product here, if the probability of the symbol is high, then the length, the code word length, should be low, the product is kept low. This is clearly an intuitive idea. The discussion of coding in the previous slide, we were referring to first order coders, in this case. Each symbol is encoded one at a time, independently from the rest of the symbols. Alternatively we can form block code. In this case, capital N symbols of the original source are combined together to generate a single new symbol generated by this now combined source, s of n. Since all capital N combinations of the original source symbols are considered, the alphabet of this new source S of N increases, and it's equal to N to the N. It can be shown that in this case the entropy of this new, extended also this is called source, is N times the entropy of original source. This is actually a mechanism to increase the efficiency of a code, as we'll discuss later and also demonstrate through specific examples. Finally, you'll also have non-block codes. And arithmetic and Lempel-Ziv codes belong to this class. And will discuss briefly these codes later in the class. Coding and entropy of a source are connected through the celebrated source coding theorem due to Shannon and his 1948 seminal work. According to this theorem, given a source S with entropy H of S, we consider block codes, that is, n original symbols are combined to form new symbols. Then, given a small positive delta, it's possible to choose N large enough, so as to construct codes that will have average code word length satisfying this relationship down here. This expression tells me two important things. The first one is that the entropy is the lower bound of the efficiency of the code. No matter how inventive we are, we cannot beat entropy. Designing a code or a lossless compression scheme that will have average code word length or as called also rate, equal to the entropy is the holy grail and is the best we can do. The second thing this relationship is telling us is that we can keep increasing the size of the block to have a code that performs very close to entropy. So when delta is very small, then clearly the ray or the average codeword length becomes close to the entropy. Of course, increasing N comes at the price, which is the increase in the dictionary size and also the loss of the instantaneous decodability, as we'll discuss later. By changing larger and larger blocks, in essence we extract the structure in the data. Of course, an important question is how accurate of an estimate of source entropy can we obtain? So let look at this in the next slide.