0:02

Now we're going to take a look at what happens when we have significant numbers

of duplicate keys which is not at all unusual in practical applications. So, in,

in fact, often, the purpose of a sort is to bring items with equal keys together

for like the example that I gave where we had cities and time. There's a lot of

detailed data and the time and maybe the whole goal of the sort is to group them by

cities so we can ship out the data for each city, to each city and there's plenty

of other examples like that in data processing where we find maybe remove

duplicates from a mailing list or all the job applicants that we get, we might want

to sort them by the college attendant. So the sort does huge files with huge numbers

of duplicate keys. So, a sort, it's worthwhile to take a careful look at what

the implication of that is. So again, typical characteristics we have a huge

file but small number of different key values. So let's look at our algorithms in

that situation. So Mergesort, it doesn't matter that much what the key values are

like and it's actually, we can show that Mergesort always uses between one-half, N

log N and N log N compares. Quicksort actually, they're up until the 1990s the

most widely used implementation took quadratic time. For files with large

numbers of equal keys and that was actually found by applications user and,

and that's the standard Quicksort that was in all the textbooks almost all the

textbooks if you did not stop the partitioning on equal keys it would run in

quadratic time. So we want to do better than this. So the mistake happens if we

put all the items equal to the partitioning item on one side which is a

natural way to implement it and the consequence is if you have all the keys

equal, then partitioning doesn't really do anything. You just peels off one key to do

file size n then you get a sub file size n - one and then n - two and so forth and

the result is a quadratic tim e algorithm. Our implementation, we stopped the

partitioning scans on items equal to the partitioning item and then in that case,

when all the keys are equal, it's going to divide it exactly in the middle. And then

in that case, when all the keys are equal, it's going to divide at exactly in the

middle. And then in that case you get an N log N compares. But actually when you

think about it, why don't we just put all the items equal to the partitioning item

in place. That's, that's really a desirable way to look at it and let's take

a look at that option. So the goal is so called three way partitioning. So what we

want to do is get the array into three parts so then now we have two pointers

into the middle. One that is the boundary between the keys that are less than the

partitioning element and those that are equal of the partitioning element. Another

one that's the boundary between the keys that are equal of partitioning elements

and the one that is greater. And then in the middle are all the equal keys and

that's what we'd like to arrange. Now until the 1990s, conventional wisdom among

people implementing system was, wasn't worth doing this. But, but actually it's a

problem that Edsger Dijkstra had proposed in the 70s as an example of, of

programming problem. He was really interested in analyzing correctness of

programs and showing that this how you could convince yourself that this program

was operating as expected. But in the 1990s we figured out that really this was

going to be an effective way to sort. And this was taking a look at the Qsort that a

user found was broken and, and now, this method is incorporated into some plenty of

system sorts. So let's take a look at how it works with the demo its more

complicated than standard Quicksort partitioning. A bit more complicated

because there's more to do. So now we have our i pointer which is right to the left

of stuff we haven't seen ye t and then, we have two other pointers that maintain,

maintain these boundaries everything to the right of gt is known to be greater

than partitioning element. Everything to the left of lt is known to be less and

between lt and i is known to be equal. And all the method does is maintain this in

variants so let's do an example or two and see how that works. So we increment i and

then figure out what to do. So, now it's less than the partitioning element. So,

5:15

what we want to do is increment lt's. So now we have one that's definitely less

than the partitioning element and lt is, is pointing at the partitioning element.

And so now in, increment both lt and i so that's the first case there. So now we

have a second item b. That's less than the partitioning element so exchange with lt

and increment them both. So, we're moving the smaller ones than the partitioning

element to the left of lt and keeping lt pointing on a partitioning element. Now,

what about when we get one that's greater than the partitioning elements? So, in

that case, we exchange greater the one over at the right with i and decrement gt.

So now, we have one over to the right that we know is greater than the partitioning

element. Notice that we didn't increment i because that element z that is over in the

right, really hasn't been compared to the partitioning element yet. But we did

decrement gt so we made progress. So now what happens here, now i is pointing to a

bigger one so we're going to exchange it with the one at gt and decrement gt again.

And again, y is bigger, so exchange it, decrement gt. Now we have the c, that one

smaller so that's the first case. So we'll move it to the left of lt and increment

both lt and i. W is a bigger one, let us to go over to the right. Now we have i

pointing to an element that's equal to the partitioning element. And what are we

supposed to do then? Well, to maintain the variant there we just need to increment i.

And again it 's pointing to one that's equal of partitioning element increment i.

And now one more time and now it's pointing to one that's greater so we

exchange that with gt and decrement gt and i is pointing to the one that was there

and that ones smaller. So we exchange it will lt and increment both i and lt and

now where the point, where the pointers have crossed i and gt across there's

nothing that we haven't examined yet. So, our partitioning is complete. Here's a

trace of Dijkstra 3-way partitioning for his problem which is when there's just

three different values in the file. Our problem we were treating partitioning,

equal of partitioning element as one value less than as another and greater than as

another. And so this, this trace illustrates how we always make some

progress and eventually we get the file sorted. The implementation is amazingly

simple. The whole partitioning process for three-way partitioning and the modern

programming language like Java simply maintains the invariances described in the

demo. If we find one that's less we exchange i and lt and increment them both.

If it's greater we exchange i and gt and decrement that. Otherwise, we increment i.

Could, could hardly be simpler. In fact, is simpler in many ways than these

standard implementation of Quicksort. In fact, there's an argument for just using

this implementation of Quicksort and forgetting about horse because it performs

so well in so many practical situations. Here's a visual trace of 3-way Quicksort

for situation with plenty of equal keys. And again, when there's a lot of equal

keys then there's going to be place where one of those is chosen, it's partitioning

element then a big chunk of the array gets handled just in a partitioning process.

9:20

Now, there's actually some deeper reasons why this method is important and one thing

to do is to realize that the lower bound that we talked about before depended on

the keys being distinct. So, worst case for lower bounds is when the keys are all

distinct. But if we have a situation where there are a lot of equal keys, that model

is wrong. It's not too difficult to get a similar lower bound for the model when we

know that there are some equal keys. So, for example this is a rather complicated

formula but not too bad but in a sense that if you know that the i-th key, it

occurs xi times you can write down a lower bound for the number of comparisons that

are going to be required in the worst case. And, what's interesting about three

way partitioning is that the number of compares that it uses is equal to this

lower bound within a constant factor. So that's entropy-optimal and what that means

is whatever the distribution of equal keys in there, this thing is going to use a

number of compares that's proportional to the best that you could possibly do. The

proof for this fact is quite beyond the scope of this course but it's still an

important fact. And the, the bottom line is that if you randomize the order and use

three-way partitioning then there's lot of applications where your sort routine is

going to be linear not N log N so it will be much more faster than Mergesort and you

know, the methods for really a broad class of applications. So, taking a look at

equal keys is carefully is something that can lead us to very efficient Quicksort