We're going to begin by putting our programming knowledge and

our ability to be able to scientifically study programs in operation,

real applications to good use by studying one of

the most important applications of computing: sorting and searching.

We'll start with a typical example of a client program.

It's called a whitelist filter.

So, this is something that comes up often in applications.

Maybe you're familiar with the idea of a blacklist.

So, that's a list of things that are going to be rejected for service.

So, for example, if your credit card is over

overdrawn or if you have spam email, you want to reject that.

The opposite of that is called a whitelist,

that stuff that's going to be accepted for service.

And it's very similar.

So that's when your account is in good standing,

the charge will go through any one email for your friends and relatives.

So, what we're going to talk about just as

an example of a typical client is a whitelist filter.

So, the idea is you're going to have a file that has

a strings of all the things that you want to accept the whitelist.

And then, we're going to read strings from standard in, say,

that's a much larger input,

and we're going to write to standard out only the ones that are in the whitelist.

And, of course, then there might be information attached and we can use

this to accept email or to accept an account charge.

So, in our examples,

we'll leave out this associated action or message contents,

say, in an email spam,

but here's an example.

So, let's say, I want to only accept messages from these email addresses: Alice,

Bob, Carl, and Dave.

So now, a standard in is the stream of emails that I'm presented with.

So, if I get Bob.office,

that's in the whitelist.

So, I'll check that and write it out to standard out.

And that's just a surrogate for processing it further.

Carl, that's in the whitelist.

So that gets written out.

Marvin, it's spam.

That's not in the whitelist.

So, we ignore it at this point.

That's the purpose of the whitelist filter,

is to keep those bad emails off the list that gets processed.

There's another one from Bob that gets checked.

Still another one.

Here's some more spam, that doesn't make it.

There's one from Dave that gets checked and goes to standard out.

There is our friend Eve.

That one's not going to make it.

And there's Alice, and so forth.

So, that's the whole idea,

is to filter out the things that we want to

reject or to filter through the things that we want to accept for service.

So, we want to write Java program that has this functionality,

and where both the whitelist and the list of transactions could potentially be very huge.

So, here's what the client code is going to look like.

What we need is a search method,

and that's going to be our focus for a lot of this lecture.

And so, that's going to be a method that takes a string and an array as input,

and it's going to let us know whether that string called key is in the array.

So, given that we have that method, and remember,

this is a client and we're going to look at the implementation of

that method much later in this lecture.

Here's what the main looks like.

So, the first argument is a file that contains the whitelist.

And so, we use our read all strings method from our in and

our standard IO library to take all the strings on

standard input and put it in an array called words, that's the whitelist.

Then from standard input,

as long as it's not empty,

we're going to read a key and then call

our search method to see if that key is in the list of words.

The interface for that method is to return minus one if it's not there.

If it is there, it returns its index in the array.

So, in this case, all we need to know is is it not equal to minus one?

That means it is there.

That's when we print it back out to standard output.

And that's it. That's our client.

So, these are the examples that I just gave.

If white4 is our whitelist and test is that list that I gave the example,

if we call this program white filter with white4.text as its argument,

then it will read that whitelist in,

and then piped in from test.tech

goes into standard input or we get printed are the ones that are in the whitelist.

So, that's the setup.

Our goal now is to write this search program that can enable this behavior,

particularly, when both the whitelist and the list from standard input is very large.

So, here are our friends Alice and Bob again.

So, Bob says, " Well, Alice.

I think I'm going to start an Internet company."

And Alice says, " Sure. Me too.

I'm going to try to have 1,000 customers by next month and maybe a million next year."

And Bob says, " Well, we're going to grow even faster than that."

Alice says, " Good luck.

By the way, you know you're going to need a whitelist filter,

because you're going to have accounts and

clients and you're going to have to accept and reject."

And he said, " Yes, I know.

I'm going to go to a hackathon to knock it out."

And Alice says, " Well,

I'm going to take a few CS courses,

maybe an online course like this one first."

So, we'll follow Alice and Bob through this process.

Okay. So here's our first try,

which is a simple strawman implementation called Sequential search.

And the whole idea is you've got the array that's got the whitelist and you

just go through the array to see if you have a match with the search string.

If the match is found,

you return that index,

and if it's not, you return minus one.

That's got the functionality that we need for our whitelist filter.

And the code for that is pretty simple.

Go through all the array,

check if the array element is equal to the key,

if it is, return it, if you don't find it, return minus one.

So, say we have this array of 15 words,

the names, and we're looking to see if Oscar is in there.

So, we're just searching through the array looking for Oscar.

Maybe this is the code that Bob came up with in his hackathon.

First thing to notice about this is it didn't stop at Oscar,

didn't even find it.

Well, so, let us just getting our heads back together on

our basic parameters for how to do things in Java.

Think about why this didn't work,

didn't stop at Oscar.

If you think about it for a minute,

you'll see that what it's doing is with equal equals,

that doesn't actually compare the strings,

that compares the references.

So, it's checking to see if those two things refer to precisely,

the same sequence of characters, whereas,

what we want is to check if the strings

that they refer to are equal character by character.

So what we need is to use compare to.

So, okay, that's all right.

So, now, let's fix it by putting in a [i].compareTo

which is going to return zero if and only if the strings match character by character.

And in that case, then this works,

it finds Oscar and returns 10.

And Bob says, "Okay.

This was even easier than I thought."

But as we know,

we have the ability to test the performance of this before

we try to run it in production on large inputs.

And so, we want to do two things.

We want to do a rough mathematical analysis

and then we want to do some empirical testing,

and we want to know if those models match,

and then we can proceed from there.

So, let's just think of a simple mathematical model for

how this thing performs with sequential search.

So, let's say that we've got our whitelist length is N,

and then for some constancy,

we have cN transactions.

So, 10 times or 100 times as many transactions as there are on the whitelist.

This situation might be different in an application but this will give us an indication.

And we'll also assume the string length is not too long compared to N. Really,

the number of names is what's huge.

So, here's a quick analysis of that.

So, first thing is if it's on the whitelist, what's going to happen?

Well, if everything's random,

you're going to go about halfway through the list on the average.

So, that's an easy thing to analyze.

Even easier is if it misses,

then it's going to go all the way through the whitelist on the average,

either halfway through or all the way through.

Well, if you've got a constant times N transactions and either you're checking N strings,

or N over 2 strings,

the expected order of the growth of the running time is going

to be proportional to N squared, it's quadratic.

So that's a mathematical analysis we should expect

the running time for random strings to be quadratic.

Well, let's check that with empirical tests.

So first thing we need is an input generator,

so this is a short program that we're going to use to

generate random strings of a given length from

a given alphabet and this will give us

the flexibility to test our programs in lots of different situations.

So, it's got a method random string

that takes the length as one argument and an alphabet,

the characters that make up the string as the second argument.

So, what we're going to do is make an array of characters

that is the length that we want and then we're going to go

through that array and what we're going to do for

each position in that array is generate a random integer,

between zero and the length of

the alphabet and then pick out that character in the alphabet.

So, it's a random character out of the alphabet and we're going to assign that to a of i,

and then when we're done we'll just make a new string from that array of characters.

So that's a simple method to return random strings.

Then our driver will take the number of strings that we want from the command line,

and then the length as the second argument,

and the alphabet as a third argument and then

just generate N strings of a given length from the given alphabet.

A pretty simple program,

once you've seen it,

and here's the type of things we can do.

So we can say, give me 10 strings of length

3 from the alphabet ABC and that's what it does.

And that might be useful in a situation where we have not many customers,

but lots of transactions from those customers.

N turns out that the performance of sorting and

searching algorithms can vary depending on how many duplicates there are,

so that one is a good chance of duplicate.

Or if we take say eight digits and we take our alphabet to be from zero through nine,

then we're going to have much less of a chance of duplicate.

Or in another application that we're going to look at later we might have

a really long string just one from the alphabet ACTG,

we use that to test algorithms from genomics where it's a long string,

there might be only one of them,

and all of those situations are covered by this generator.

Okay, so let's take a look at what

the test client for our sequential search program will look like.

So, we'll match what we assume for a mathematical model and

say we'll do 10 N searches in a whitelist of length N. So,

we'll first have our search routine,

that's the code that we just discussed and then

here's what the main routine looks like for this.

So again, we take our whitelist from

standard input and so that's the length of the whitelist is N,

and now we'll keep track of the time

using System.currentTimeMillis divided by a 1,000 system in seconds.

We could also use our stopwatch class that we developed earlier

and now for I from 0 to 10 times N,

so that we're going to do 10 times N in this loop.

And what we'll do is just pick a word at random out of the whitelist,

so we know that our search will be successful so there'll be no output.

So, that's good for running a big and parallel test like this to have no output.

So then we're just going to check whether or not it's successful.

We know it's going to be successful so we're not going to print it out,,

but we did the search,

and then after we've done 10 N searches,

then we'll check the time again and what we want to print out is

the number of seconds that it takes to do this process.

So with this test client now we can run this for huge files using our generator.

So for example this generator generates

10,000 words of length 10 all lowercase characters,

and I just abbreviated the alphabet with A to Z in this,

so it would fit in the slide.

So in this case, if we run this we're

generating 10,000 10-letter words and we're going to print out

the time to do 100,000 searches in that list of

10,000 10-letter words and we get the result three seconds.

So now, we can use this test client to go ahead and do doubling tests of

the type that we learn when studying performance in the first part of the course.

So again, weightless of size N, 10 N transactions,

and all I'm going to change is the size of the file from that last example,

so we get three seconds for 10,000.

If we run it for 20,000 it takes nine seconds and also

keeping track of the transactions per second just as a measure of how well we're doing.

Run it for 40,000 and it takes 35 seconds.

Now we start to compute the ratio of the time for

N over the time for N over to 80,000 149 seconds.

It's taking a fair amount of time

for sure and the transactions per second is going way down.

So, Bob wants to have his million customers,

well, it's going to take 38,000 seconds.

That's 10 hours just to do this whitelist thing

and what's worse is it's down to 34 transactions per second and dropping.

So, Bob can see that's maybe not too

good and if we go and just look a little more carefully at what the doubling test means,

remember that this ratio tells us the exponent

in the growth since their running time is proportional to

a constant times N to the B with this doubling test,

if you take a log to the base two of this ratio,

that's going to give the exponent.

So this says, that the order growth is N squared log based to a force two.

So, that validates the mathematical model and we have a lot of

confidence that the running time in this thing is going to be proportional to

the square of the size of the whitelist and that

doesn't scale and that's not going to work too well for Bob's business.

So we're going to have to look at better methods than

sequential search for solving this problem.