2:46

You can sort them. And, and then just go to the sorted list.

Â So either of these solutions would work. And, of course, this is a very simple

Â problem. One that you might encounter in a very

Â first algorithms course. You know, a standard solution is to use a

Â hash table. You could also use sorting, right?

Â A hash table is just the same thing as a symbol table.

Â As you encounter elements, put them into the hash table.

Â And obviously, if you encounter a duplicate, [inaudible] realize this.

Â Okay, so this is a very simple problem. What makes this problem very hard is, What

Â happens if the list was very big? What if you had to do this not on, you

Â know, 1000 elements, but a billion elements.

Â Or a trillion elements. Okay?

Â Then all of a sudden, you know, all of these simple approaches, like using a hash

Â table or using, or doing sorting or whatever, go out of the window.

Â It, well number one you need a whole bunch of, a whole lot of memory to actually keep

Â a hash table. And of course modern computers are capable

Â of doing hash tables even of this size perhaps.

Â But you know, this size where sort of it's C, okay?

Â Sorting again, you know? Sorting such a huge list takes a very long

Â time. Okay?

Â So, why would you care about a problem like this?

Â Well, if you think about Google. For example, some statistics that, they,

Â they released earlier this year saying that you know, Google in its entire life

Â time has seen something like. 30 trillion URLs.

Â They crawl something like twenty billion pages a day and they serve about three

Â billion searches a day which means, you know, 100 billion searches a month.

Â Okay. So that's the scale at which they are

Â operating. And If you had to solve the simplest kinds

Â of, if you had to answer the simplest kinds of questions about what they do.

Â Like, you know, how many queries did they get in a day?

Â How many distinct queries they get in a day?

Â How many distinct queries did they get in a month or a year?

Â You're faced with solving what seems to be a very simple problem.

Â But is very difficult just because of the immense scale at which you need to

Â operate. Okay.

Â Of course, you know? Distinct, answering this question of how

Â many distinct queries you have is the very simple statistic that you might want to

Â compute and keep track of as time, as time goes by.

Â But you might want more sophisticated statistics like, you know, how much did

Â the queries change from day to day? How much are they different today compared

Â to yesterday. How much are they different this month

Â compared to previous month. This month compared to the same month last

Â year and so on and so forth. And all of these very, very simple

Â westerns just seem incredibly difficult when your dealing with, you know, the

Â sizes of data that go with the info. Okay.

Â You know, there are other settings where you have, lots of data and you'd like to,

Â come up with algorithms, real simple things.

Â You know, a network router is one example. It sees a lot of packets whizzing by at

Â fast speed, you know. Current, home routers are even capable of

Â speeds. About a one gigabyte per second.

Â Commercial routers are capable of speeds of something like ten gigabytes per

Â second. The router is something that's, that's,

Â you know, really simple computing device, all it does is receive packets and forward

Â them along. It could perhaps do some very simple sort

Â of analysis on the data when it's sync. How could you use this very, very simple

Â computing device To do simple monitoring like you know, can you tell me how many

Â active flows are there, are, are in the router.

Â You know, which are the flows that are contributing most, most of the traffic.

Â And these things are important because monitoring these sorts of things are

Â essential to you know, various security measures like, Early detection of denial

Â of service attacks. Okay?

Â It's important to keep track of, you know, what's going on, what sort of traffic

Â you're seeing, because changes in this might signal that something strange is

Â happening and you might want to take some security measures.

Â Okay? So very, very simple statistics like this

Â are, are a huge challenge just because of the amount of data that this router is

Â seeing and the fact that the router itself is a very, very simple device.

Â Okay? So.

Â You know in general there's the stock of big data everywhere.

Â If you, if you read the news read magazines you know, every other day

Â there's a, there's an article about how we're surrounded with data, we need to

Â find ways to make sense of the data, and. What I am really going to talk about is

Â what are some of the algorithmic techniques that we can do to tame these,

Â that we can use to tame these big data problems.

Â Okay, alright. So, that's, that's the introduction of

Â what the problems are. Let me talk about what, what models we

Â could envision to design algorithms. So this clearly, our traditional notion

Â of, you know having all of the data available to us.

Â Having the algorithm that has random access to the data.

Â You know, fitting all the data in memory. All of this, doesn't really apply when we

Â are dealing with data at such large scale. So you know, the two operatives are,

Â number one, you know, the data size is very large.

Â It far exceeds the available memory. And that's just a reality we have to face.

Â We can't actually store all the data in memory.

Â And the other thing is that there's a new emphasis on efficiency.

Â Usually when you design algorithms, polynomial time algorithms are considered.

Â Efficient and implementable. But, in reality, when you are dealing with

Â data size, size of this, this scale, your algorithms are run in you know, linear,

Â near linear time perhaps. You know.

Â Even something like quadratic, at, at the scales of which we talked about are

Â [inaudible], okay? So that's, that's what we are.

Â Those are, sort of the, the barometers. Okay.

Â So one kind of model that's used, we'll see, later on in the, in the, in the

Â lecture is. The idea of taking our [inaudible] and

Â somehow reducing it to a smaller size, okay?

Â So, it turns out that we can do this is many cases.

Â Replace the original data by some kind of compact summary.

Â Such that, you know, by. Discarding the original data, and just

Â looking at this compact summary. We can actually solve, or we can actually

Â answer the questions that we care about. Now, it would be wonderful [inaudible] if

Â you could do this. Because you would [inaudible] reduce the

Â data size. It turns out that in many cases, this is

Â actually possible, okay? Now, you know, one thing you never give up

Â on is the idea that you really want to solve a problem exactly.

Â In most cases, like, you know? If you wanna find out how many distinct

Â elements were there in the queries. Do you really care about.

Â Whether I give you a one percent approximate answer, or whether I give you

Â the exact answer, it doesn't really matter.

Â So, in most cases of interest it's okay to not give the exact answer, but give

Â something that is fairly accurate. And turns out that this relaxation moving

Â from getting the exact answer to tolerating some error.

Â [inaudible] error is something that actually makes many problems tractable.

Â Okay? So we'll see some examples of this, this

Â idea. Another idea that's, that's been very

Â useful. Is, a certain model called, the streaming

Â model. And a streaming model is the following.

Â It says, you know, we, you have to design your algorithms.

Â It operates in the following way. Imagine the data as being, you know, one

Â stream. And you make one pass over the data.

Â Or perhaps, you know, [inaudible]. A few passes over the data.

Â But you can remember only a small amount of information relative to the size of the

Â area. So, you can afford to remember very little

Â and you are only allowed to. See the data in one bias or in several

Â biases. So in, for example, you know you're going

Â to zig zag back and forth. You don't have random access to the data,

Â okay? And the point is, if you've actually

Â designed an algorithm that works in this very, very restrictive way, could actually

Â run it in on very large areas that's very efficient.

Â Okay. So it seems like we're really, really

Â tying our hands and, you know, putting a blindfold on.

Â How could we possibly do anything in this very, very restricted mode of operation?

Â But turns out that there are very interesting things that you can do even

Â when you restrict yourself to interacting with this data in this very controlled

Â way. Just a little bit of history of, you know,

Â where this notion of streaming algorithms came from.

Â Actually the. This idea of, trying to design algorithms,

Â at work you know, in one pass of the data, remembering very little.

Â 11:35

Predates. You know, the internet even are big data

Â sets, and so on. It was really Sort of a, a fun problem

Â that people worked on in fact, you know, they're.

Â There's, there's some work by [inaudible] and Martin that exactly deals with this

Â problem of estimating distinct elements. And this is, we're actually gonna see

Â what, what was the idea that they came up with.

Â Because it's actually relevant today. To the scale of, for the data sizes that

Â we're actually faced with today. You know, another, another piece of work,

Â back in 1980 was, by Monroe and Patterson. And they look at the problem of, you know,

Â how much space do you need in order to compute the media.

Â So, you know, you have a data set of size, and.

Â You have only a small amount of space. You can make a few parcels of the data.

Â Without K parcels, how much space do you think to actually compute the median?

Â And if I found a very sort of tied on sense to this question turns out if they

Â are allowed to make K parcels, you can do it using something like N to the one zero

Â K space. Okay, and these, these algorithms, as I

Â said, were newly developed very, very early on, even before this model was

Â really formalized. This model was really formalized by Work

Â of Alon and Matias in 1796. This was a landmark paper because it

Â really sort of, formalized this model. It talked about, you know, what you can

Â and cannot do in this model, get examples of how you, how to pull negative results.

Â And for this work, they won the Gou0308del Prize in 2005, which is a very prestigious

Â prize in Computer Science. Okay, so, Let me also mention, some of you

Â might have seen, heard about MapReduce. That's another model of interacting with

Â radar data sets, something that was proposed by Google, in fact.

Â And it is the basis for a bunch of practical implementations and disparate

Â algorithms. There is something called Hadoop that some

Â of you might have heard of. All of this is sort of relevant to the

Â discussion of how do you deal with large data sets very efficiently.

Â But unfortunately, I'm not going to have time to talk about all of these things,

Â okay? So, next in the rest of the lecture I just

Â wanna talk, give you a glimpse of some of the, the ideas that algorithm, the

Â ingredients of the algorithms. O be with [inaudible].

Â So this is why I gave everyone the problems.

Â I told how we might interact with data sets in order to solve these problems.

Â Now I want to tell you, okay. Now that we've tied our hands for once

Â over this problem, very large area. Okay so, how do we actually do it.

Â So, I talk about data reduction the idea. Table data set is reducing the silent data

Â set. Actually if we think about it this is,

Â this is a concept that we know and are very, very familiar with in our lives.

Â So you know, if you were to ask the question what fraction of voters today

Â prefer Obama over Romeny, [inaudible] Other than questions, all people get

Â about. But very difficult to answer those

Â questions if you really wanna know what exact [inaudible] is.

Â Right? You have to ask every single word, what do

Â you think? What do you prefer?

Â Takes a lot of time and effort to actually get the exact answer.

Â We really do this every once in every four years for a good reason.

Â On the other hand this is something that people actually make predictions about.

Â Every day. Every day, you look at the newspaper, and,

Â and you get, you know? Some estimate of wh-, you know?

Â What's the, what's the approval rating of Obama today?

Â What's the approval rating yesterday, you know?

Â How many, and various other questions that you might, you might think about.

Â So, This is something that we solve, or, or certainly solve every day, by sampling,

Â okay. We don't actually ask everybody what their

Â opinion is. But the point is, if you actually pick a

Â random set of people, and you ask them, what is your opinion, that, the percentage

Â that you compute on the random sample, is actually a very good estimate of the

Â actual percentage of the entire population.

Â Okay that is George Gallup. And you know, all of you heard about, read

Â about Gallup polls, something, done by the Gallup organization that he set up.

Â Okay, so let's think about sampling. What, what is sampling does, what does

Â sampling do? The idea is if you pick a random voter and

Â you ask them, you know, do you prefer Obama, then the probability that this

Â random voter says yes is just a fraction of voters that prefer Obama, right?

Â 16:20

It's a basic fact. Do we agree on this?

Â Okay, alright. It's a very simple thing.

Â But this is actually very useful. Why?

Â Because, now, if you pick several random orders, and, you know, independently ask

Â them the same question. Then the estimate that you get from the

Â random sample is actually a very good estimate of the actual fraction.

Â Okay? That's, that's the basis for polling.

Â That's the basis for random sampling, okay?

Â So let's, let's all take a step back, and think about what this, what this actually

Â did. The idea is that you produce a random

Â variable. Which has a correct expectation.

Â Okay? In this case, the expectation is the

Â fraction of voters who prefer Obama. This is the, this is the quantity that we

Â actually want to ask about. So you, you design a random variable that

Â has a correct expectation. You take many copies, independent samples

Â of this random variable, and you average. Alright.

Â Because this kind of variable has the right expectation once you take many

Â copies. The, the expectation of the average is, is

Â exactly what you want, but the point is that by taking many copies and averaging,

Â you get tied concentration. Okay.

Â Now, in reality, if you really wanna analyze this.

Â You need to think about what's the variance of the random variable.

Â You know, what's the confidence balance that you want on your estimate, and so and

Â so forth. This is a topic for probability which some

Â of you may have seen, some of you may not. I'm not gonna go into this.

Â But believe me that, you know, once you design a variable [inaudible] with the

Â right expectations. With you know, a little more calculations,

Â you can sort of determine what sort of sample size you need in order to get.

Â Confidence bonds and so on and so force. Okay so, I'm only going to focus on.

Â For all the bonds we care about, we're gonna try to design random variables that

Â have the right expectations. Okay.

Â And further, the important thing is that these random variables, will be computable

Â very, very efficiently by making just one pass over the data.

Â Okay? That's the, that's the key idea that we're

Â gonna [inaudible]. Alright, so let's, having said that, let's

Â go back to our version of distinct elements.

Â It turns out that the solution that [inaudible] and Martin came up with is, is

Â tied to a simple puzzle. Okay?

Â So let me tell you what the puzzle is. I know some of you probably know the

Â answer to this question. If you haven't seen this before, I

Â encourage you to. Think about it, So here's the experiment.

Â Okay. So, let's say you have the interval 0-1.

Â And you throw K random darts inside this interval.

Â So you pick K random points in the interval 0-1.

Â Okay? Everyone, experiment clear?

Â Okay. So the question is, what's the expected

Â value of the minimum of these k elements? Any guesses?

Â If you had to guess what w-, what should the minimum be?

Â 19:21

Okay, so zero seems a little extreme. Because for, for it to be zero, you want

Â some element to line up here. What I'd like to know is, what, is it as a

Â function of K? And you're right that, in some sense, as K

Â gets larger and larger, this minimum is gonna get closer and closer to zero.

Â [inaudible] understand what behavior is. >> 1/k is actually a very good guess.

Â >> What about k + one? >> 1/k + one is actually, a better guess.

Â So okay. How do you guys wanna work, 1/k or 1/k +

Â one? >> The probably it is lesser, if it was

Â going to the lowest, is going to be 1/k + one because probability of each of k + one

Â in the list is that. And that's what's same as the probability

Â of it being lower than the lowest third ...

Â >> Okay, okay. But how does that help you compute the

Â expected value of the mineral? >> Because then you know that the

Â probability of it landing lower than the lowest star right now is [inaudible].

Â So that the expected value of, the expected length of that segment is one

Â over K+1. I think your suggesting a very slick proof

Â which escapes me at this moment. And after, when you, when you stand in

Â front of the glass, your like, you usually drops like 50%.

Â So I think you're probably right. But maybe we should, we should talk about

Â this after class. Okay, so Don't know the right answer is

Â indeed one over k+1. And we might have a very clever proof of

Â that here. If you didn't know that the answer, you

Â know, what the answer should be. You know, the way I, I usually think about

Â this problem is, well. It's kinda reasonable to assume that these

Â k elements are sort of uniformly spread in the interval K 0-1.

Â Okay? And if that's really the case, if we

Â believe this. Then, you know, each of these gaps should

Â be, you know, one over K+1, there are K gaps.

Â So, in fact, the smallest sum, which would be one over K+1.

Â Okay. This is very crude, you know?

Â It's no, by no means, the correct answer. But, you know, if we are to guess, one

Â over K or one over K+1 is a reasonable guess.

Â And if you actually do the calculations, it turns out that it is indeed one over

Â K+1. Alright, good.

Â So if you, if you have the minimum of K random numbers in 01, the expected value

Â is [inaudible] +one. Why is this actually important to the

Â question that we started out with? The question was, how do you estimate how

Â many distinct elements there are? In your data stream.

Â 23:16

Okay? And the end average.

Â And you get a very good estimate of one over K +one.

Â Where K is the number of distinct elements.

Â Invert this, and you have a very good estimate of the number of distinct

Â elements, okay? There's a really beautiful trick,

Â originally discovered by [inaudible] and Martin.

Â And it's, it's amazing how it, it completely bypasses the need to store any

Â sort of hash function, do any sort of sorting.

Â And comes up with a very good estimate of the number of distinct elements.

Â Okay? Alright.

Â Next I wanna talk a little bit. Actually I don't have very much time so

Â I'm gonna have to breeze through this, this section.

Â A class of methods that are used to. Figure out, you know, in your, in a very

Â large collection of documents. Which documents are similar?

Â Again, this is actually an important problem, because number one, when a web s,

Â web search engine crawler goes out, it brings in a bunch of documents and many

Â duplicates and near duplicates. And you wanna get rid of these guys,

Â There's no point in Google returning 100 results, which are almost the same.

Â It wants, you know? It, it wants resul, return results that

Â are distinct. And ideally, actually, not even store

Â these [inaudible] documents, so that's number one.

Â And technically, you know, in addition to not storing them sometimes.

Â At query time you might want to figure out that the results of the [inaudible],

Â [inaudible] user are almost the same and you might want to suppress these things

Â that are almost the same. And if you searched on Google, you might

Â often see, you know, there are. So many resolves that are very similar to

Â the ones we have shown to you. And we suppressed them.

Â If you'd like to see them, then click here, right?

Â So there are various reasons why you'd like to [inaudible] duplicate documents.

Â Now, you know? This is sort of a difficult question.

Â How do you find an element [inaudible] duplicate documents.

Â And I'm gonna describe a scheme that was originally developed by Andre [inaudible]

Â and his collaborators, in the context of a search engine called Alta Vista.

Â Which, in the old days, predated Google, and was the leading search engine of the

Â time. Now, you know, I'm, I'd be surprised if

Â very many of you know about Alta Vista at all.

Â So, let, let me so very quickly describe the main idea behind this.

Â It was a very, very cute idea. First I would [inaudible] you, how do you

Â define similarity of documents? Right?

Â What does it mean for two documents to be similar?

Â There are many was of doing this. One way is the following.

Â You take your document. It's the Declaration of Independence.

Â And you slide a window of three words across the document.

Â And look at all the three word sequences that you encounter.

Â In doing this, that's your set. So now I've taken the document and I've

Â produced a set. And so they'll think of the set as

Â representative of the document. Having produced sets I can now define

Â similarity of two sets. And I'll use that as a measure of

Â similarity of the original documents. One reasonable measure of similarity of

Â the documents is the following, look at how much they overlap, look at the size of

Â the intersection, whatever size of the union.'Kay.

Â So henceforth let's agree that this is the measure of similarity that we'll use.

Â Okay. If two documents are identical, the sets

Â they produce would be identical so similarity would be one and we are going

Â to work with a hypothesis similarity of reasonable measure.

Â Similarity defined this way of is a reasonable measure of, of similar

Â document. Okay?

Â So here's a question, right? Can you produce a sketch?

Â A very short summary of a document. A set, in this case.

Â So as that, you can estimate the similarity just by looking at these two

Â sketches. Instead of actually looking at the written

Â document. So, clearly I can look at the full text of

Â the document, compare them. And [inaudible].

Â But that would be too slow. And, you know, not acceptable.

Â So, can I get rid of the original documents?

Â Replace them all by short sketches. And estimate this invariant just by

Â looking at these short sketches. This is something that you could actually

Â hope to do at run time. Right?

Â If you actually had to go and look at the original documents at run time, not a good

Â idea. Alright.

Â The first thing is that. First thing that might come to mind, is

Â something called random [inaudible]. You might say well how I'm storing a

Â random subsets of elements. This just does not work.

Â Okay. If you have two doc, two sets with ten

Â thousand elements let's say. Identical.

Â What's the? I mean if you look at a random sample from

Â set A and a random sample from set B. Say you sample 100 elements.

Â The expected intersection of these two samples is one.

Â It gives you no information whatsoever that these two sets are identical.

Â So, random sampling, per se, doesn't work. Okay?

Â But here's a little variant of random sampling that actually works.

Â Okay? And it's amaz-, it's, it's actually very,

Â very cute. And it's very, it's related, in some

Â sense, to what we did before. So here's the idea.

Â You apply random [inaudible], hash function to every element.

Â Of the universe, incase you have a random hash function.

Â You apply to every element of the set, and you just remember the minimum.

Â 29:07

Okay. So, great.

Â We have been able, we, we're able to design random variable that has exactly

Â this expectation, to get high, you know, get a high confidence estimate.

Â You would do this not with one hash function, but with 100 hash functions.

Â And for every set, you would store a sketch consisting of the minimum with 100

Â such hash functions. Okay.

Â And now you can use these sketches. Do I have to get a very good estimate of

Â somebody? Alright.

Â I had hoped to talk about a lot more, which I will skip.

Â But, maybe I'll, I'll give the slides to [inaudible], and he can post them.

Â This is another scheme to estimate similarity that I actually worked on when

Â I was at Google. I'm not gonna be able to tell you about

Â it. But, you can look at the slides a little

Â later. I just wanted to say that, you know?

Â Many of these, these techniques are, are really useful beyond the problems that we

Â talked about. The idea of random projections, which we

Â didn't quite discuss is, is really useful in dealing with very large data sets.

Â It, it's a very useful technique to reduce data sizes.

Â Nearest neighbor size data structures are, are very useful in, in searching large

Â complex high dimensional data sets. And designing half functions like the ones

Â that we saw. You know for set similarity, they're

Â actually the building blocks of designing very, very efficient data searches to do

Â these search problems. There's a little issue with randomness.

Â We've sort of assumed that everything is completely random, we have completely

Â random hash function. Turns out in practice is not really true.

Â There's a way to fix this, which. I'm afraid we won't be able to discuss.

Â And so, in conclusion, I just wanted to say that you know?

Â There are a lot of cool ideas, Two article ideas which are very practical in design.

Â That are useful for designing algorithms and working with very large data sets.

Â I want to give you a glimpse of this algorithmic tool kit.

Â It was a shorter glimpse than I had. Thought I would do.

Â But you know, there are many interesting problems here.

Â I hope I will motivate some of you to think about them.

Â