In this video, we will design a very simple and basic version of our approxpointquery algorithm. The procedure that we design in this part of the lecture will not be strong enough to solve our final problem. Nevertheless, it will serve as a building block for our final algorithm. Or to simplify our teaching, in this part of the lecture and after that, we'll make the following assumption. We will assume that the elements in our data universe are ordered in descending order of their frequency in string. So we think of element one as being the most frequent in the string, element two being the second most frequent etc, etc. So the stream looks like this. Again, at the bottom of the stream, we see data items arriving in the stream one by one. And a frequency histogram is updated on the fly. So this is the stream that our algorithm is processing. And at the end of the stream, we see that the elements are ordered in descending order of their frequency. Our final goal in this section is to design a small 15 structure that at the end of the string will let us output the top few most frequent items that we have seen. So, for example, if we were looking for the top four most frequent items, we would like to output these items shown in green on the screen. And we will refer to these items as the head of the stream because they contribute a bulk of the mass. The other items are shown in red, so these are the items that don't occur very frequently, we will refer to those as the tail of the stream. Okay, so our goal for now is to design a very basic estimate, very basic version of our approximate point career primitive. This version that we design, in this part of the lecture, will have excellent space requirements. It'll only use a constant amount of space. The problem with it will be analyze precision. Nevertheless, later on, we will show how to use it as a building block for the final hour. Our basic estimate works like this. Well we start by choosing a uniformly random hash function s, that maps our universe of data items. And remember that we associate our data items with numbers, integers, between one of them. Our has function maps integers from 1 and m to -1 and +1 independently. Now the algorithm our data structure is very simple, it only maintains a single counter that we call C. At the beginning of the stream, we initialized the count to 0 and then the update procedure upon receipt of a new data item i is extremely simple. We basically just increment the counter by the sign of i. So in fact, if the sign of i happens to be plus, the counter's incremented and if it's a minus the counter's decremented. Now the processing of the stream now becomes very simple. For every position p in the stream, between 1 and N, we run the update procedure, which simply increments the counter by the sign of the current data item i sub p at the end of the stream. If we're asked to return an estimate for the frequency of item i, what we do is, we take the counter and multiply it by the sin of i and return that as the estimate. So this is a very simple algorithm and indeed it is somewhat hard to believe that this will have the right precision and it will not, but it will have very interesting properties and it will give non-trivial precision that we will later boost into our final algorithm. Okay, so we want to analyze the precision of this estimator. This is a randomized estimator so the first question we need to ask is, well, how does one actually argue that a randomized estimate works? So we have a random variable which is the counter times the sine of a fixed data item i that we want to query. We want to prove that this random variable is close to the true answer, which is the frequency of i with high probability. Well typically, such proofs go as follows. They proceed in two steps. Well first, we'll look at our random variable, the counter times the sin of i, and we want to prove that in expectation this random variable equals the true answer fi. Another way of putting this is saying that our estimator is unbiased and it has the right expectation. If we're able to do this, then the next step is to show that our estimator in fact is close to it's mean with high probability. One way to do that is to bound the variants of our estimator. If we're able to show that the variants is appropriately small, we get the result by standard concentration qualities. Okay, so let's proceed and we'll proceed to step one. We would like to bound the mean of our randomized estimator. Now before we take any expectations with respect to our sine function, I want to look at our estimator which is the counter times the sine of i. And I want to write it in a somewhat more convenient way. Okay, well by definition, the counter, times the sine of i, can be written as the summation over all positions in the stream from 1 to N of the sine of the item that arrives in position p times the sine of i. Well, we can equivalently write this as a summation over all data items j in existence of the frequency of data item j in the stream times the sine of data item j, times the sine of i. Well, since we're estimating fi, it's convenient to extract from destination the contribution of item i itself. So we get the frequency of i times the sine of i squared which is nothing but one. Plus some contributions from the other data items. Namely, a summation over j and m minus i, data items j other than i. The frequency with which j occurred, times the product of science, s j times s i. Okay, and this is already looking great because what we see is that our estimator can be written as the true answer, plus some contribution from the other elements in the universe that, hopefully, we can say looks like noise. And, indeed, it does, as we will see in the rest of this lecture. Good, so our first step should be to show that the expectation of this other contribution that we hope to say is noise is zero. So let's just take expectations of both sides of this equality. Well to take the expectation of the counter times s(i) with respect to the sine function. Of course, the expectation of fi is fi. Fi is a deterministic quantity. Now, here we can take the expectation inside the sum. And now, because the sine of j and the sine of i are independent random variables when j is distinct from i, we get the product of the expectation of the sine of j, and the expectation of the sine of i. But both of these quantities have mean zero, each data item was mapped to minus one or plus one, with equal probability. So indeed, the contribution of this noise like term in expectation is 0. This means that the mean of our randomized estimate is exactly correct. The mean is fi. So this is great. So our next step, and this is what we will do in the next video, is to estimate the variance of our estimator and show that its small. This will show us that the estimator is actually close to the true answer, fi, with high probability