0:01
Now, we'll take a look at how the sorting algorithms that we talked about or
expressed in the systems that we use everyday. Now, the key point is that
sorting algorithms rhythms are essential in a very broad variety of applications
and, and all of us use sorting algorithms pretty much every day. Many obvious out
applications like or, organizing your music library or displaying your search
results or listening feeds in your in your web browsers. There's some other
applications that are not so obvious where we use sorting as a to make a problem easy
once you know that they're sorted. And so, for example, finding the median and if
it's already sorted, it's much easy to find the median. And now, the statistical
problems are like that or finding duplicates. Probably finding duplicates by
itself is not quite obvious what to do but the easy way to solve it is to just go
ahead and sort. And then there are plenty of applications that we'll see later in
this course like data compression or computer graphics like finding the convex
hull, applications in science such as computational biology or, or in systems
development. We're having a efficient sort as absolutely crucial. So, because there's
all these applications most programming systems have a fast sort as an important
part of their infrastructure and Java is no exemption. So, Java has a method called
arrays.sort and it's intended to be a general purpose sorting method for use by
Java programmers. And now, that you have some understanding of the classic methods
you can have a better idea of why arrays.sort is the way that it is. So it
has the infrastructure that allows us to be used for all types of data types and
all types of ordering so it's got a method that implements comparable then its got
methods easy compare order. So that you can use the natural order or you can
provide a compare order and provide your own order for any type of data. It uses
actually both quicksort and mergesort. It uses two quick sort for primitive types of
data and a two mergesort for objects. Why two different well it's just the
designer's assessment of the idea that if a programmer is using object maybe spaces,
not a, a critically important consideration. And so, the extra space
used by merge sort maybe is not a problem. And if the program is using primitive
types, maybe performance is the most important thing. And so, then we'll use
the quicksort. To invoke arrays that sort, you have to import the name space from
java.util.Arrays and then all you need to do is invoke Arrays.sort. So I just
answered this question, why do we use different algorithms for the two types?
And this is, is maybe arguable. Now are referred to this idea of a good system
sort, there was a good system sort that a lot of people used for many years. And in
1991, there were some scientists that, that Bell Labs that were using qsort for a
scientific problem and they were used to taking just a few minutes and then they
realized that it was taking hours of CPU time. And the fact was that all the qsort
implementations at that time in Unix had this flaw well, there are two flaws and
one of them is a little complicated about the way they are raised order and the
other one was for a raise that had lots of equal keys and this is Wilks and Becker
problem and have lot of equal keys, it was quadratic time. So, the system designer,
Jon Bentley was one of the designers to take a look at these problems and that
lead ultimately to the development of the 3-way quick sort that were used today. He
worked with Doug McIlroy and they wrote a, a, a paper that outline this problem and
talk about some of these things and they had a three-way partitioning method that
was somewhat like the Dijkstra method that we showed but a bit more complicated.
Anoth er thing they did was rather than shuffling the array. They [cough] used
what's called a method for choosing partitioning element called Tukey's
ninther. Tukey is a statistician and he had this particular method for order
statistics that has some interesting properties and use that for the
partitioning element. This paper was very influential and, and that basic method is
widely used. And Tukey's ninther is just pick nine items out of the array and take
the median of the mediums and that's the ninther. So very inexpensive and they had
macros to do this so and use not too much cost to find a partitioning element that's
much closer to the middle than, and if you use a, a random one. So the, the reason
they used that is they thought they got them closer to the middle and they also
don't like the, some system designers don't like the idea of using random
choices in a system method because of way that it changes the state of the system.
So they felt that they got better partitioning than a random shuffling and
it was also less costly and then generating random numbers including this
change of state problem. But there's a problem so you would think that the system
sort would be completely solid with all this resource with all these research and
all of the development that's going into it. In fact there's a file out there in
your book site and get it that will actually break the Java system sort. There
was a killer input that will make the thing run in quadratic time. Actually it
usually crashes because it's recursive and it crashes the system stack. And it can
cause all sorts of problems. There's a killer input for the system sort and, and
it can be disastrous in various systems and the reason is, they didn't do the
random shuffling. Mcilroy, himself, actually found this problem that you could
while the sort is running figuring out an inp ut that would make it crash. And so,
you can just run that program and if the sort doesn't use randomness then it's
vulnerable to this attack. So which algorithm should we use to sort that's,
that's really a key question. We've looked at lot of sorting algorithms and actually,
there's hundreds of sorting algorithms out there and we have chosen the most
important and the most interesting for you but you could literally spend a year
reading all the papers on sorting and then you still continue to be invented new
algorithms are developed and that are found to have good characteristics all the
time. And really, the key idea is really important to think about cuz it applies to
all sorts of algorithmic problems. On our way, we've been talking about rules of the
game. What is it that we care about in a sort? It's a little bit more complicated
than just put stuff in order. There's a lot of attributes, the different
applications have. Like stability, that's a fairly sophisticated attribute that you
really have to think about, you maybe not be aware of. Maybe your computer is
parallel and the sort has to be parallel and we found that equal keys make a huge
difference. And I didn't really think about that at the beginning but it can
make a huge difference. Maybe the way your computer's memory is organized make a
difference. Maybe the keys are small and the items are large or maybe the keys are
really huge. Do we need guaranteed performance? Are we happy with random
performance? Do we know, is the array randomly ordered? You can think of a
matrix shown in the right here where we list out all the possible attributes and
then there's algorithms that worked well for different combinations of attributes.
But the thing is, there is way more possible combinations of attributes than
there are algorithms. So there is going to be situations that are going to require an
understanding of what it takes to engineer a, a sort method that's appropriate for
your application. There can't be a system sort out there that's going to cover all
possible combinations of attributes. Now, usually it's going to be good enough but
it's definitely worth while to understand what's going on with different sorting
algorithms in order to even find improved performance over the system sort. So,
here's the summary of some of the things that we've talked about. We've talked
about six different sorting algorithms. Then, we've talked about a bunch of
attributes. For example, the top line, selection sort is in place it always takes
about N^2 / two comparisons. And one of the things to remark about it is that it
only uses N exchanges and so forth. Insertion sort best case linear,
quadratic, and place unstable. And it'll work well if the file is small or
partially ordered. Shellsort, we don't know it's a running time, it's not stable
but it's a fast method for intermediate size files and not much code. Mergesort
and log N guarantee unstable but not in place, need that auxiliary array.
Quicksort not stable. Quadratic time worst case but that's unlikely to occur if you
do the random shuffling. And its the fastest and most useful in practice
particularly if you make improvements to deal with duplicate keys. And it's
interesting to note we've looked at important and classic algorithms that are
widely deployed but we don't have a, a useful, practical algorithms that are
widely used that's got all of these characteristics that's in place and stable
worst case N log N. There's versions of merge sort that come close but they are
too complex for practitioners to have adopted them. Those are some of the
situations that we encounter when developing a system sort. And, quicksort
certainly plays a role in most system sorts.