So this is the problem that we would like to solve. It is the heavy hitters problem. We're given a single pass over a stream of N data items, i1, i2, through iN. We would like to design a small space algorithm that will scan the entire stream and at the end of the stream will tell us the top k most frequent items that occurred in the stream. We also want this algorithm to have very low space complexity, and we're shooting for space complexity on the order of (k log N). Okay, so this is definitely a very nice and clean mathematical problem to consider. But are there good applications of this problem? Now it turns out that there are quite a few, so I would like to mention two of them in this lecture. The first one that I will talk about is to estimating network traffic. For this application, think about a switch, as pictured on this slide. And think about the traffic, the IP traffic that goes to through that switch. Well, this traffic can be quite conveniently visualized using the so-called traffic matrix, pictured on the slide. The rows of this matrix correspond to source IP addresses. The columns correspond to destination IP addresses. And the entry in i,jth element of this matrix is exactly the number of packets that IP address i sent to IP address j that went through our switch over a period of time. Now note that the amount of traffic that goes through a typical internet switch is a mess. At the same time, these switches tend to have fairly limited memory. On the other hand, it would be very useful to have statistical estimation parameters that would let us analyze the network traffic through the switch. And one particular question that we would like to ask is whether or not some IP flows that go through that switch actually contribute the bulk of the traffic. The reason for this is that if a certain source IP address talks a lot to a certain destination IP address, and occupies the bulk of this switch, this could be an indication of a denial of service attack or some spamming activity going through that switch. So we definitely would like to design algorithms that are able to operate using the memory available to the switch and detect such situations and find such source destination pairs. These pairs are known as dominant or elephant flows. Let me show you an example. So this switch is observing internet traffic going through it, basically packets go from various sources to destinations, and every packet updates an entry of the traffic matrix. So as shown in this example, it could be that at the end of the day, the number of flows, the number of source IP destination pairs that send packets through the switch is immense. But on the other hand, a handful of these flows contribute the bulk of the traffic that goes through the switch. This is exactly the situation shown on the slide. Here, we have six flows or seven flows going through the switch, and two of them contribute a bulk of the traffic. One contributes 4 and the other contributes 5. Well, we would like to be able to detect such situations and find these dominant flows. On the other hand, if you think about the trivial solution to this problem, this would need a lot of space because the trivial solution would just remember all the source IP destination pairs that ever sent a packet through our switch over a period of time. And the space is linear in the number of such pairs, which is very large. So in this lecture, we will design an algorithm for solving this problem approximately that is for detecting the dominant flows through a network switch using space only logarithmic in the total number of source IP destination pairs that are sending traffic flow through our switch over this period of time. And the improvement from N to log N is exponential, this is great savings. Okay, the other application that I would like to talk about, this one was to estimating network traffic, is to estimating query statistics in search engines. And for this application, think of our data steam as a stream of queries on a search engine, think of say google.com, over a period of time. So here's an example stream, three queries. Geneva to New York, coffee in Geneva, Geneva to New York. So it could have been me searching for flights from Geneva to New York wanting to go to a conference. And despairing and thinking that this is a very hard process. And going for coffee and then coming back to search for flights. So the question of finding the heavy hitters in this stream of three items is exactly the problem I'm figuring out, which one occurred the most. And if I want to find the one heavy hitter here, the answer is Geneva to New York because this query occurred twice and the other query occurred only once. Of course, this problem of finding heavy hitters in search logs is more interesting than the example that's shown on the slide. So if you think of solving this problem at scale, that is, think of queries on google.com over a period of time. And this becomes a very hard problem because the length of the stream is much, much longer than three. So again, the trivial solution would amount to just storing all the distinct search queries that we see over this period of time, and the space complexity is linear in the number of search queries. Now, the solution that we will see in this lecture, known as the CountSketch algorithm, has space complexity logarithmic in the length of the stream. And that's again, an exponential improvement. Well, this is great. So I'm claiming exponential improvements from linear space to order log N, but there's big O notation on this slide so the natural question to ask is are the constants actually small? So are these algorithms practical? Now in fact, it turns out that a lot of the algorithms that have been developed in the streaming literature actually are, including our algorithm in this lecture. Let me give a specific example which relates to a somewhat different problem. This is the so-called distinct elements problem. Given a stream of data items, count approximately how many distinct queries did you see. With this problem we have a very theoretically efficient algorithm, and we also have an extremely good implementation known as HyperLogLog. To put things in perspective, this algorithm HyperLogLog can be used to estimate the number of words in Shakespeare's vocabulary using only 128 bits of memory. So the constants in big O notation are actually quite small. Okay, so as a consequence of that, a lot of the algorithms that have been developed in the streaming literature are in fact now widely used in practice for scalable data analytics. So we definitely want to study these algorithms, and so enough motivation, so let's get to the actual problem. So in the rest of the lecture, we will design a small space algorithm for our heavy hitters problem. That is, given a single pass over N data items, i1 through iN, output at the end of the day, the k most frequent items that occurred in this string, using a small amount of space.