You can sort them. And, and then just go to the sorted list.

So either of these solutions would work. And, of course, this is a very simple

problem. One that you might encounter in a very

first algorithms course. You know, a standard solution is to use a

hash table. You could also use sorting, right?

A hash table is just the same thing as a symbol table.

As you encounter elements, put them into the hash table.

And obviously, if you encounter a duplicate, [inaudible] realize this.

Okay, so this is a very simple problem. What makes this problem very hard is, What

happens if the list was very big? What if you had to do this not on, you

know, 1000 elements, but a billion elements.

Or a trillion elements. Okay?

Then all of a sudden, you know, all of these simple approaches, like using a hash

table or using, or doing sorting or whatever, go out of the window.

It, well number one you need a whole bunch of, a whole lot of memory to actually keep

a hash table. And of course modern computers are capable

of doing hash tables even of this size perhaps.

But you know, this size where sort of it's C, okay?

Sorting again, you know? Sorting such a huge list takes a very long

time. Okay?

So, why would you care about a problem like this?

Well, if you think about Google. For example, some statistics that, they,

they released earlier this year saying that you know, Google in its entire life

time has seen something like. 30 trillion URLs.

They crawl something like twenty billion pages a day and they serve about three

billion searches a day which means, you know, 100 billion searches a month.

Okay. So that's the scale at which they are

operating. And If you had to solve the simplest kinds

of, if you had to answer the simplest kinds of questions about what they do.

Like, you know, how many queries did they get in a day?

How many distinct queries they get in a day?

How many distinct queries did they get in a month or a year?

You're faced with solving what seems to be a very simple problem.

But is very difficult just because of the immense scale at which you need to

operate. Okay.

Of course, you know? Distinct, answering this question of how

many distinct queries you have is the very simple statistic that you might want to

compute and keep track of as time, as time goes by.

But you might want more sophisticated statistics like, you know, how much did

the queries change from day to day? How much are they different today compared

to yesterday. How much are they different this month

compared to previous month. This month compared to the same month last

year and so on and so forth. And all of these very, very simple

westerns just seem incredibly difficult when your dealing with, you know, the

sizes of data that go with the info. Okay.

You know, there are other settings where you have, lots of data and you'd like to,

come up with algorithms, real simple things.

You know, a network router is one example. It sees a lot of packets whizzing by at

fast speed, you know. Current, home routers are even capable of

speeds. About a one gigabyte per second.

Commercial routers are capable of speeds of something like ten gigabytes per

second. The router is something that's, that's,

you know, really simple computing device, all it does is receive packets and forward

them along. It could perhaps do some very simple sort

of analysis on the data when it's sync. How could you use this very, very simple

computing device To do simple monitoring like you know, can you tell me how many

active flows are there, are, are in the router.

You know, which are the flows that are contributing most, most of the traffic.

And these things are important because monitoring these sorts of things are

essential to you know, various security measures like, Early detection of denial

of service attacks. Okay?

It's important to keep track of, you know, what's going on, what sort of traffic

you're seeing, because changes in this might signal that something strange is

happening and you might want to take some security measures.

Okay? So very, very simple statistics like this

are, are a huge challenge just because of the amount of data that this router is

seeing and the fact that the router itself is a very, very simple device.

Okay? So.

You know in general there's the stock of big data everywhere.

If you, if you read the news read magazines you know, every other day

there's a, there's an article about how we're surrounded with data, we need to

find ways to make sense of the data, and. What I am really going to talk about is

what are some of the algorithmic techniques that we can do to tame these,

that we can use to tame these big data problems.

Okay, alright. So, that's, that's the introduction of

what the problems are. Let me talk about what, what models we

could envision to design algorithms. So this clearly, our traditional notion

of, you know having all of the data available to us.

Having the algorithm that has random access to the data.

You know, fitting all the data in memory. All of this, doesn't really apply when we

are dealing with data at such large scale. So you know, the two operatives are,

number one, you know, the data size is very large.

It far exceeds the available memory. And that's just a reality we have to face.

We can't actually store all the data in memory.

And the other thing is that there's a new emphasis on efficiency.

Usually when you design algorithms, polynomial time algorithms are considered.

Efficient and implementable. But, in reality, when you are dealing with

data size, size of this, this scale, your algorithms are run in you know, linear,

near linear time perhaps. You know.

Even something like quadratic, at, at the scales of which we talked about are

[inaudible], okay? So that's, that's what we are.

Those are, sort of the, the barometers. Okay.

So one kind of model that's used, we'll see, later on in the, in the, in the

lecture is. The idea of taking our [inaudible] and

somehow reducing it to a smaller size, okay?

So, it turns out that we can do this is many cases.

Replace the original data by some kind of compact summary.

Such that, you know, by. Discarding the original data, and just

looking at this compact summary. We can actually solve, or we can actually

answer the questions that we care about. Now, it would be wonderful [inaudible] if

you could do this. Because you would [inaudible] reduce the

data size. It turns out that in many cases, this is

actually possible, okay? Now, you know, one thing you never give up

on is the idea that you really want to solve a problem exactly.

In most cases, like, you know? If you wanna find out how many distinct

elements were there in the queries. Do you really care about.

Whether I give you a one percent approximate answer, or whether I give you

the exact answer, it doesn't really matter.

So, in most cases of interest it's okay to not give the exact answer, but give

something that is fairly accurate. And turns out that this relaxation moving

from getting the exact answer to tolerating some error.

[inaudible] error is something that actually makes many problems tractable.

Okay? So we'll see some examples of this, this

idea. Another idea that's, that's been very

useful. Is, a certain model called, the streaming

model. And a streaming model is the following.

It says, you know, we, you have to design your algorithms.

It operates in the following way. Imagine the data as being, you know, one

stream. And you make one pass over the data.

Or perhaps, you know, [inaudible]. A few passes over the data.

But you can remember only a small amount of information relative to the size of the

area. So, you can afford to remember very little

and you are only allowed to. See the data in one bias or in several

biases. So in, for example, you know you're going

to zig zag back and forth. You don't have random access to the data,

okay? And the point is, if you've actually

designed an algorithm that works in this very, very restrictive way, could actually

run it in on very large areas that's very efficient.

Okay. So it seems like we're really, really

tying our hands and, you know, putting a blindfold on.

How could we possibly do anything in this very, very restricted mode of operation?

But turns out that there are very interesting things that you can do even

when you restrict yourself to interacting with this data in this very controlled

way. Just a little bit of history of, you know,

where this notion of streaming algorithms came from.

Actually the. This idea of, trying to design algorithms,

at work you know, in one pass of the data, remembering very little.