2:46

You can sort them. And, and then just go to the sorted list.

So either of these solutions would work. And, of course, this is a very simple

problem. One that you might encounter in a very

first algorithms course. You know, a standard solution is to use a

hash table. You could also use sorting, right?

A hash table is just the same thing as a symbol table.

As you encounter elements, put them into the hash table.

And obviously, if you encounter a duplicate, [inaudible] realize this.

Okay, so this is a very simple problem. What makes this problem very hard is, What

happens if the list was very big? What if you had to do this not on, you

know, 1000 elements, but a billion elements.

Or a trillion elements. Okay?

Then all of a sudden, you know, all of these simple approaches, like using a hash

table or using, or doing sorting or whatever, go out of the window.

It, well number one you need a whole bunch of, a whole lot of memory to actually keep

a hash table. And of course modern computers are capable

of doing hash tables even of this size perhaps.

But you know, this size where sort of it's C, okay?

Sorting again, you know? Sorting such a huge list takes a very long

time. Okay?

So, why would you care about a problem like this?

Well, if you think about Google. For example, some statistics that, they,

they released earlier this year saying that you know, Google in its entire life

time has seen something like. 30 trillion URLs.

They crawl something like twenty billion pages a day and they serve about three

billion searches a day which means, you know, 100 billion searches a month.

Okay. So that's the scale at which they are

operating. And If you had to solve the simplest kinds

of, if you had to answer the simplest kinds of questions about what they do.

Like, you know, how many queries did they get in a day?

How many distinct queries they get in a day?

How many distinct queries did they get in a month or a year?

You're faced with solving what seems to be a very simple problem.

But is very difficult just because of the immense scale at which you need to

operate. Okay.

Of course, you know? Distinct, answering this question of how

many distinct queries you have is the very simple statistic that you might want to

compute and keep track of as time, as time goes by.

But you might want more sophisticated statistics like, you know, how much did

the queries change from day to day? How much are they different today compared

to yesterday. How much are they different this month

compared to previous month. This month compared to the same month last

year and so on and so forth. And all of these very, very simple

westerns just seem incredibly difficult when your dealing with, you know, the

sizes of data that go with the info. Okay.

You know, there are other settings where you have, lots of data and you'd like to,

come up with algorithms, real simple things.

You know, a network router is one example. It sees a lot of packets whizzing by at

fast speed, you know. Current, home routers are even capable of

speeds. About a one gigabyte per second.

Commercial routers are capable of speeds of something like ten gigabytes per

second. The router is something that's, that's,

you know, really simple computing device, all it does is receive packets and forward

them along. It could perhaps do some very simple sort

of analysis on the data when it's sync. How could you use this very, very simple

computing device To do simple monitoring like you know, can you tell me how many

active flows are there, are, are in the router.

You know, which are the flows that are contributing most, most of the traffic.

And these things are important because monitoring these sorts of things are

essential to you know, various security measures like, Early detection of denial

of service attacks. Okay?

It's important to keep track of, you know, what's going on, what sort of traffic

you're seeing, because changes in this might signal that something strange is

happening and you might want to take some security measures.

Okay? So very, very simple statistics like this

are, are a huge challenge just because of the amount of data that this router is

seeing and the fact that the router itself is a very, very simple device.

Okay? So.

You know in general there's the stock of big data everywhere.

If you, if you read the news read magazines you know, every other day

there's a, there's an article about how we're surrounded with data, we need to

find ways to make sense of the data, and. What I am really going to talk about is

what are some of the algorithmic techniques that we can do to tame these,

that we can use to tame these big data problems.

Okay, alright. So, that's, that's the introduction of

what the problems are. Let me talk about what, what models we

could envision to design algorithms. So this clearly, our traditional notion

of, you know having all of the data available to us.

Having the algorithm that has random access to the data.

You know, fitting all the data in memory. All of this, doesn't really apply when we

are dealing with data at such large scale. So you know, the two operatives are,

number one, you know, the data size is very large.

It far exceeds the available memory. And that's just a reality we have to face.

We can't actually store all the data in memory.

And the other thing is that there's a new emphasis on efficiency.

Usually when you design algorithms, polynomial time algorithms are considered.

Efficient and implementable. But, in reality, when you are dealing with

data size, size of this, this scale, your algorithms are run in you know, linear,

near linear time perhaps. You know.

Even something like quadratic, at, at the scales of which we talked about are

[inaudible], okay? So that's, that's what we are.

Those are, sort of the, the barometers. Okay.

So one kind of model that's used, we'll see, later on in the, in the, in the

lecture is. The idea of taking our [inaudible] and

somehow reducing it to a smaller size, okay?

So, it turns out that we can do this is many cases.

Replace the original data by some kind of compact summary.

Such that, you know, by. Discarding the original data, and just

looking at this compact summary. We can actually solve, or we can actually

answer the questions that we care about. Now, it would be wonderful [inaudible] if

you could do this. Because you would [inaudible] reduce the

data size. It turns out that in many cases, this is

actually possible, okay? Now, you know, one thing you never give up

on is the idea that you really want to solve a problem exactly.

In most cases, like, you know? If you wanna find out how many distinct

elements were there in the queries. Do you really care about.

Whether I give you a one percent approximate answer, or whether I give you

the exact answer, it doesn't really matter.

So, in most cases of interest it's okay to not give the exact answer, but give

something that is fairly accurate. And turns out that this relaxation moving

from getting the exact answer to tolerating some error.

[inaudible] error is something that actually makes many problems tractable.

Okay? So we'll see some examples of this, this

idea. Another idea that's, that's been very

useful. Is, a certain model called, the streaming

model. And a streaming model is the following.

It says, you know, we, you have to design your algorithms.

It operates in the following way. Imagine the data as being, you know, one

stream. And you make one pass over the data.

Or perhaps, you know, [inaudible]. A few passes over the data.

But you can remember only a small amount of information relative to the size of the

area. So, you can afford to remember very little

and you are only allowed to. See the data in one bias or in several

biases. So in, for example, you know you're going

to zig zag back and forth. You don't have random access to the data,

okay? And the point is, if you've actually

designed an algorithm that works in this very, very restrictive way, could actually

run it in on very large areas that's very efficient.

Okay. So it seems like we're really, really

tying our hands and, you know, putting a blindfold on.

How could we possibly do anything in this very, very restricted mode of operation?

But turns out that there are very interesting things that you can do even

when you restrict yourself to interacting with this data in this very controlled

way. Just a little bit of history of, you know,

where this notion of streaming algorithms came from.

Actually the. This idea of, trying to design algorithms,

at work you know, in one pass of the data, remembering very little.

11:35

Predates. You know, the internet even are big data

sets, and so on. It was really Sort of a, a fun problem

that people worked on in fact, you know, they're.

There's, there's some work by [inaudible] and Martin that exactly deals with this

problem of estimating distinct elements. And this is, we're actually gonna see

what, what was the idea that they came up with.

Because it's actually relevant today. To the scale of, for the data sizes that

we're actually faced with today. You know, another, another piece of work,

back in 1980 was, by Monroe and Patterson. And they look at the problem of, you know,

how much space do you need in order to compute the media.

So, you know, you have a data set of size, and.

You have only a small amount of space. You can make a few parcels of the data.

Without K parcels, how much space do you think to actually compute the median?

And if I found a very sort of tied on sense to this question turns out if they

are allowed to make K parcels, you can do it using something like N to the one zero

K space. Okay, and these, these algorithms, as I

said, were newly developed very, very early on, even before this model was

really formalized. This model was really formalized by Work

of Alon and Matias in 1796. This was a landmark paper because it

really sort of, formalized this model. It talked about, you know, what you can

and cannot do in this model, get examples of how you, how to pull negative results.

And for this work, they won the Gou0308del Prize in 2005, which is a very prestigious

prize in Computer Science. Okay, so, Let me also mention, some of you

might have seen, heard about MapReduce. That's another model of interacting with

radar data sets, something that was proposed by Google, in fact.

And it is the basis for a bunch of practical implementations and disparate

algorithms. There is something called Hadoop that some

of you might have heard of. All of this is sort of relevant to the

discussion of how do you deal with large data sets very efficiently.

But unfortunately, I'm not going to have time to talk about all of these things,

okay? So, next in the rest of the lecture I just

wanna talk, give you a glimpse of some of the, the ideas that algorithm, the

ingredients of the algorithms. O be with [inaudible].

So this is why I gave everyone the problems.

I told how we might interact with data sets in order to solve these problems.

Now I want to tell you, okay. Now that we've tied our hands for once

over this problem, very large area. Okay so, how do we actually do it.

So, I talk about data reduction the idea. Table data set is reducing the silent data

set. Actually if we think about it this is,

this is a concept that we know and are very, very familiar with in our lives.

So you know, if you were to ask the question what fraction of voters today

prefer Obama over Romeny, [inaudible] Other than questions, all people get

about. But very difficult to answer those

questions if you really wanna know what exact [inaudible] is.

Right? You have to ask every single word, what do

you think? What do you prefer?

Takes a lot of time and effort to actually get the exact answer.

We really do this every once in every four years for a good reason.

On the other hand this is something that people actually make predictions about.

Every day. Every day, you look at the newspaper, and,

and you get, you know? Some estimate of wh-, you know?

What's the, what's the approval rating of Obama today?

What's the approval rating yesterday, you know?

How many, and various other questions that you might, you might think about.

So, This is something that we solve, or, or certainly solve every day, by sampling,

okay. We don't actually ask everybody what their

opinion is. But the point is, if you actually pick a

random set of people, and you ask them, what is your opinion, that, the percentage

that you compute on the random sample, is actually a very good estimate of the

actual percentage of the entire population.

Okay that is George Gallup. And you know, all of you heard about, read

about Gallup polls, something, done by the Gallup organization that he set up.

Okay, so let's think about sampling. What, what is sampling does, what does

sampling do? The idea is if you pick a random voter and

you ask them, you know, do you prefer Obama, then the probability that this

random voter says yes is just a fraction of voters that prefer Obama, right?

16:20

It's a basic fact. Do we agree on this?

Okay, alright. It's a very simple thing.

But this is actually very useful. Why?

Because, now, if you pick several random orders, and, you know, independently ask

them the same question. Then the estimate that you get from the

random sample is actually a very good estimate of the actual fraction.

Okay? That's, that's the basis for polling.

That's the basis for random sampling, okay?

So let's, let's all take a step back, and think about what this, what this actually

did. The idea is that you produce a random

variable. Which has a correct expectation.

Okay? In this case, the expectation is the

fraction of voters who prefer Obama. This is the, this is the quantity that we

actually want to ask about. So you, you design a random variable that

has a correct expectation. You take many copies, independent samples

of this random variable, and you average. Alright.

Because this kind of variable has the right expectation once you take many

copies. The, the expectation of the average is, is

exactly what you want, but the point is that by taking many copies and averaging,

you get tied concentration. Okay.

Now, in reality, if you really wanna analyze this.

You need to think about what's the variance of the random variable.

You know, what's the confidence balance that you want on your estimate, and so and

so forth. This is a topic for probability which some

of you may have seen, some of you may not. I'm not gonna go into this.

But believe me that, you know, once you design a variable [inaudible] with the

right expectations. With you know, a little more calculations,

you can sort of determine what sort of sample size you need in order to get.

Confidence bonds and so on and so force. Okay so, I'm only going to focus on.

For all the bonds we care about, we're gonna try to design random variables that

have the right expectations. Okay.

And further, the important thing is that these random variables, will be computable

very, very efficiently by making just one pass over the data.

Okay? That's the, that's the key idea that we're

gonna [inaudible]. Alright, so let's, having said that, let's

go back to our version of distinct elements.

It turns out that the solution that [inaudible] and Martin came up with is, is

tied to a simple puzzle. Okay?

So let me tell you what the puzzle is. I know some of you probably know the

answer to this question. If you haven't seen this before, I

encourage you to. Think about it, So here's the experiment.

Okay. So, let's say you have the interval 0-1.

And you throw K random darts inside this interval.

So you pick K random points in the interval 0-1.

Okay? Everyone, experiment clear?

Okay. So the question is, what's the expected

value of the minimum of these k elements? Any guesses?

If you had to guess what w-, what should the minimum be?

19:21

Okay, so zero seems a little extreme. Because for, for it to be zero, you want

some element to line up here. What I'd like to know is, what, is it as a

function of K? And you're right that, in some sense, as K

gets larger and larger, this minimum is gonna get closer and closer to zero.

[inaudible] understand what behavior is. >> 1/k is actually a very good guess.

>> What about k + one? >> 1/k + one is actually, a better guess.

So okay. How do you guys wanna work, 1/k or 1/k +

one? >> The probably it is lesser, if it was

going to the lowest, is going to be 1/k + one because probability of each of k + one

in the list is that. And that's what's same as the probability

of it being lower than the lowest third ...

>> Okay, okay. But how does that help you compute the

expected value of the mineral? >> Because then you know that the

probability of it landing lower than the lowest star right now is [inaudible].

So that the expected value of, the expected length of that segment is one

over K+1. I think your suggesting a very slick proof

which escapes me at this moment. And after, when you, when you stand in

front of the glass, your like, you usually drops like 50%.

So I think you're probably right. But maybe we should, we should talk about

this after class. Okay, so Don't know the right answer is

indeed one over k+1. And we might have a very clever proof of

that here. If you didn't know that the answer, you

know, what the answer should be. You know, the way I, I usually think about

this problem is, well. It's kinda reasonable to assume that these

k elements are sort of uniformly spread in the interval K 0-1.

Okay? And if that's really the case, if we

believe this. Then, you know, each of these gaps should

be, you know, one over K+1, there are K gaps.

So, in fact, the smallest sum, which would be one over K+1.

Okay. This is very crude, you know?

It's no, by no means, the correct answer. But, you know, if we are to guess, one

over K or one over K+1 is a reasonable guess.

And if you actually do the calculations, it turns out that it is indeed one over

K+1. Alright, good.

So if you, if you have the minimum of K random numbers in 01, the expected value

is [inaudible] +one. Why is this actually important to the

question that we started out with? The question was, how do you estimate how

many distinct elements there are? In your data stream.

23:16

Okay? And the end average.

And you get a very good estimate of one over K +one.

Where K is the number of distinct elements.

Invert this, and you have a very good estimate of the number of distinct

elements, okay? There's a really beautiful trick,

originally discovered by [inaudible] and Martin.

And it's, it's amazing how it, it completely bypasses the need to store any

sort of hash function, do any sort of sorting.

And comes up with a very good estimate of the number of distinct elements.

Okay? Alright.

Next I wanna talk a little bit. Actually I don't have very much time so

I'm gonna have to breeze through this, this section.

A class of methods that are used to. Figure out, you know, in your, in a very

large collection of documents. Which documents are similar?

Again, this is actually an important problem, because number one, when a web s,

web search engine crawler goes out, it brings in a bunch of documents and many

duplicates and near duplicates. And you wanna get rid of these guys,

There's no point in Google returning 100 results, which are almost the same.

It wants, you know? It, it wants resul, return results that

are distinct. And ideally, actually, not even store

these [inaudible] documents, so that's number one.

And technically, you know, in addition to not storing them sometimes.

At query time you might want to figure out that the results of the [inaudible],

[inaudible] user are almost the same and you might want to suppress these things

that are almost the same. And if you searched on Google, you might

often see, you know, there are. So many resolves that are very similar to

the ones we have shown to you. And we suppressed them.

If you'd like to see them, then click here, right?

So there are various reasons why you'd like to [inaudible] duplicate documents.

Now, you know? This is sort of a difficult question.

How do you find an element [inaudible] duplicate documents.

And I'm gonna describe a scheme that was originally developed by Andre [inaudible]

and his collaborators, in the context of a search engine called Alta Vista.

Which, in the old days, predated Google, and was the leading search engine of the

time. Now, you know, I'm, I'd be surprised if

very many of you know about Alta Vista at all.

So, let, let me so very quickly describe the main idea behind this.

It was a very, very cute idea. First I would [inaudible] you, how do you

define similarity of documents? Right?

What does it mean for two documents to be similar?

There are many was of doing this. One way is the following.

You take your document. It's the Declaration of Independence.

And you slide a window of three words across the document.

And look at all the three word sequences that you encounter.

In doing this, that's your set. So now I've taken the document and I've

produced a set. And so they'll think of the set as

representative of the document. Having produced sets I can now define

similarity of two sets. And I'll use that as a measure of

similarity of the original documents. One reasonable measure of similarity of

the documents is the following, look at how much they overlap, look at the size of

the intersection, whatever size of the union.'Kay.

So henceforth let's agree that this is the measure of similarity that we'll use.

Okay. If two documents are identical, the sets

they produce would be identical so similarity would be one and we are going

to work with a hypothesis similarity of reasonable measure.

Similarity defined this way of is a reasonable measure of, of similar

document. Okay?

So here's a question, right? Can you produce a sketch?

A very short summary of a document. A set, in this case.

So as that, you can estimate the similarity just by looking at these two

sketches. Instead of actually looking at the written

document. So, clearly I can look at the full text of

the document, compare them. And [inaudible].

But that would be too slow. And, you know, not acceptable.

So, can I get rid of the original documents?

Replace them all by short sketches. And estimate this invariant just by

looking at these short sketches. This is something that you could actually

hope to do at run time. Right?

If you actually had to go and look at the original documents at run time, not a good

idea. Alright.

The first thing is that. First thing that might come to mind, is

something called random [inaudible]. You might say well how I'm storing a

random subsets of elements. This just does not work.

Okay. If you have two doc, two sets with ten

thousand elements let's say. Identical.

What's the? I mean if you look at a random sample from

set A and a random sample from set B. Say you sample 100 elements.

The expected intersection of these two samples is one.

It gives you no information whatsoever that these two sets are identical.

So, random sampling, per se, doesn't work. Okay?

But here's a little variant of random sampling that actually works.

Okay? And it's amaz-, it's, it's actually very,

very cute. And it's very, it's related, in some

sense, to what we did before. So here's the idea.

You apply random [inaudible], hash function to every element.

Of the universe, incase you have a random hash function.

You apply to every element of the set, and you just remember the minimum.

29:07

Okay. So, great.

We have been able, we, we're able to design random variable that has exactly

this expectation, to get high, you know, get a high confidence estimate.

You would do this not with one hash function, but with 100 hash functions.

And for every set, you would store a sketch consisting of the minimum with 100

such hash functions. Okay.

And now you can use these sketches. Do I have to get a very good estimate of

somebody? Alright.

I had hoped to talk about a lot more, which I will skip.

But, maybe I'll, I'll give the slides to [inaudible], and he can post them.

This is another scheme to estimate similarity that I actually worked on when

I was at Google. I'm not gonna be able to tell you about

it. But, you can look at the slides a little

later. I just wanted to say that, you know?

Many of these, these techniques are, are really useful beyond the problems that we

talked about. The idea of random projections, which we

didn't quite discuss is, is really useful in dealing with very large data sets.

It, it's a very useful technique to reduce data sizes.

Nearest neighbor size data structures are, are very useful in, in searching large

complex high dimensional data sets. And designing half functions like the ones

that we saw. You know for set similarity, they're

actually the building blocks of designing very, very efficient data searches to do

these search problems. There's a little issue with randomness.

We've sort of assumed that everything is completely random, we have completely

random hash function. Turns out in practice is not really true.

There's a way to fix this, which. I'm afraid we won't be able to discuss.

And so, in conclusion, I just wanted to say that you know?

There are a lot of cool ideas, Two article ideas which are very practical in design.

That are useful for designing algorithms and working with very large data sets.

I want to give you a glimpse of this algorithmic tool kit.

It was a shorter glimpse than I had. Thought I would do.

But you know, there are many interesting problems here.

I hope I will motivate some of you to think about them.