The primary topics in this part of the specialization are: greedy algorithms (scheduling, minimum spanning trees, clustering, Huffman codes) and dynamic programming (knapsack, sequence alignment, optimal search trees).

Loading...

来自 Stanford University 的课程

Greedy Algorithms, Minimum Spanning Trees, and Dynamic Programming

487 个评分

The primary topics in this part of the specialization are: greedy algorithms (scheduling, minimum spanning trees, clustering, Huffman codes) and dynamic programming (knapsack, sequence alignment, optimal search trees).

从本节课中

Week 3

Huffman codes; introduction to dynamic programming.

- Tim RoughgardenProfessor

Computer Science

So in this video we'll finally discuss Huffman's Algorithm which is a greedy

Algorithm that constructs the prefix free binary code minimizing the average

encoding length. So, let me just briefly remind you the

formal statement of the compositional problem that I left you with last time.

So, the input is just a frequency for each symbol I, coming from some alphabet

sigma. So the responsibility of the algorithm is

to get optimal compression, so to compute an optimal code.

So the code has to be binary. We have to use only zeros and ones.

We have to be prefix free meaning the encodings of any two characters, neither

one can be a prefix of the other, that's to facilitate unambiguous

decoding. And finally, the average number of bits

needed to encode a character, where the average is with respect to the input

frequencies, should be as small as possible.

Now remember these kinds of codes correspond to binary trees.

The prefix free condition just says that the symbols of sigma should be in one to

one correspondence with the leaves of the tree.

And finally remember the encoding lengths of the various symbols just correspond to

depths of the corresponding leaves. So we can formally express the averaging

coding length. Which given a legal tree capital T.

I'm using the notation capital L of T. So what do we do?

We just take the average over the symbols of the alphabet weighted by the provided

frequencies and we just look at the number of bits used to encode that

symbol. Equivalently depth of the corresponding

leaf in the tree T. So we want the tree that makes this

number as small as possible. So this task is a little bit different

than any we've seen so far in this course, right?

All we're given as an input is an array of numbers and yet we have to produce

this actual full blown tree. So how can we just take a bunch of

numbers and produce them in a sensible, principled way?

Some kind of tree whose leaves correspond to symbols of the alphabet.

So let's spend a slide just thinking about, you know, at a high level where

would this tree come from, how would you build it up given this unstructured

input. So there's certainly no unique answer to

this question and one idea which is very natural but that turns out to be sub

optimal is to build a tree using a top down approach, which you can also think

of as an. Initiation of the divide and conquer

algorithm design paradigm. The divide and conquer paradigm, you'll

recall, involves breaking the given sub-problem into usually multiple,

smaller sub-problems. They're solved recursively, and the

solutions are then combined into one for the original problem.

Because trees, the desired output here, have a recursive substructure, it's

natural to think about applying this paradigm to this problem.

Specifically you love to just lean on recursive call to construct for you the

left sub tree, another sub call constructing the right

sub tree. And then you can fuse the results

together under a common root vertex. So it's not clear how to do this

partitioning of the symbols into two groups, but one idea to get the most bang

for your buck, the most information out of the first bit of your encoding. You

might want to split them in, the symbols, into groups that have roughly, as close

to as possible, of 50% of the overall frequency.

So this topped out approach is sometimes called Fanno-Shannon coding.

Fanno was Huffman's graduate adviser. Shannon is the Claude Shannon inventor of

information theory. But Huffman in a term paper believe it or

not, realized the topped down is not the way to go.

The way to go is to build the tree from the bottom up.

Not only are we going to get optimal codes, but we're going to get the

blazingly fast greedy algorithm that constructs them.

So what do I mean by bottom up? I mean we're going to start with just a

bunch of nodes, each one labelled with one symbol of the alphabet.

So, in effect, we're starting with the leaves of our tree.

And then we're going to do successive mergers.

We're going to take at each step two sub-trees thus far and link them together

as sub-trees under a common internal node.

So, I think you'll see what I mean in a picture.

So imagine we want to build a tree where the leaves are supposed to be just A, B,

C, and D. So one merger would be oh, well let's

just take the leaves C and D and link them as siblings under a common ancestor.

Now in the second step let's merge the leaf B with the sub-tree we got from the

previous merge, the sub-tree comprising the nodes C, D, and their common

ancestor. So now of course we have no choice but to

merge the only two sub-trees we have left,

and then that gives us a single tree. Which is in fact exactly the same

lopsided tree we were using in the previous video as a running example.

So let me explain what I hope is clear and what is maybe unclear at this

juncture. I hope what's intuitively clear is that

the bottom of approaches, a systematic way to build trees that have a prescribed

set of leaves. So what do we want?

We want trees whose leaves are labeled with the symbols of the alphabet sigma.

So if we have an alphabet with n symbols, we're going to start with just the N

leaves. What does a merger do?

Well, there's two things. First of all, it introduces a new

internal node, an unlabelled node. And secondly, it takes two of our old

subtrees and fuses them into one, merges them into one.

We take two subtrees, we make one the left child of this new internal node, the

other, the right child of this new internal node.

So that drops the number of subtrees we're working with by one.

So if we start with the N leaves and we do N minus one successive mergers, what

happens? Well on the one hand, we introduce N

minus one new unlabeled internal nodes. And on the other hand, we construct a

single tree. And the leaves of this tree that we get

are in one to one correspondence with the alphabet letters as desired.

Now what I don't expect you to have an intuition for is what should we be

merging with what and why? Forget about, you know, how do we get an

optimal tree at the end of the day. I mean, even just if we wanted to design

a greedy algorithm. If we just wanted to make a myopic

decision, that looks good right now, how would we even do that?

What's our greedy criteria that's going to guide us to merge a particular pair of

trees together? So we can re-frame this quandary in the

same kind of question we asked for minimum cost spanning trees and really

more generally with greedy algorithms. When you're making irrevocable decisions

which strikes fear in your heart, is that this decision will come back and haunt

you later on. You'll only realize at the end of the

algorithm that you made some horrible mistake early on in the algorithm.

So just as for MST's, we ask, you know, when can we be sure that including an

edge irrevocably is not a mistake, it's safe in the tree that we're building?

Here we want to ask, you know, we have to do a merger, we want to do successive

mergers and how do we know that a merger is safe?

That it doesn't prevent us from eventually computing an optimum solution.

Well here's the way to look at things that will at least give us an intuitive

conjecture for this question. We'll save the proof for the next video.

So what are the ramifications when we merge two subtrees, each containing a

collection of symbols? Well, when we merge two subtrees, we

introduce a new internal node which unites these two subtrees under them, and

if you think about it, at the end of the day on the final tree, this is yet

another internal node that's going to be on the root to leaf path,

for all of the leaves in these two sub trees.

That is, if you're a symbol and you're watching your subtree get merged with

somebody else. You're bummed out, your like, man that's

another bit in my encoding. That's yet one more node I have to pass

through to get back to the root. I think this will become even more clear

if we look at an example. So naturally, we'll use our four symbol

alphabet A, B, C, D. And initially, before we've merged

anything, each of these is just its own leaf A, B.

C, D. So there's no internal nodes above 'em.

In the sense, everybody's encoding length at the beginning is zero bits.

So now, imagine we've merged C and D, we introduce a new internal node, which is

the common an-ancestor of these two leaves.

And as a result, C and D are bummed out. They said, well, there's one bit that

we're going to, to have to incur our encoding length, there's one new internal

node we're always going to have to pass through enroute back to the root of the

eventual tree. Now suppose next we merge B with the

subtree containing both C and D. Well everybody but A is bummed out about,

about this merger because we introduce another internal node, and it contributes

one bit to the encoding of each of B, C, and D.

It contributes an extra one to the encoding of C and D, and it contributes a

zero to the encoding of B. So all of their encodings in some sense

inc, get incremented, go up by one, compared to how things were before.

Now in the final iteration, we have to merge everything together.

We have no choice, there's only two sub-trees left.

So here, everybody's encoding length gets bumped up by one.

So, A finally picks up a bit at zero to encode it and B, C, and D each pick up an

additional bit of one, which is prepenned into their encodings thus far.

And, in general what you'll notice is that the final encoding length of a

symbol, is precisely the number of mergers that its subtree has to endure.

Every time your subtree gets merged with somebody else you pick up another bit in

your encoding, and that's because there's one more internal node that you're going

to have to traverse enroute to the root of the final tree.

In this example, the symbols C and D, well they got merged with somebody else

in each of the three iterations. So, they're the two symbols that wind up

with the encoding length of three. The symbol B, it didn't get merged with

anybody in the first iteration, only the second two.

That's why it has an encoding length of two.

So this is really helpful. So this, lets us relate, the operations

that our algorithm actually does, namely mergers back to the objective function

that we care about, namely the average encoding length.

Mergers increase the encoding lengths of the participating symbols by one.

So this allows us to segway into a design of a greedy heuristic for how to do

mergers. Let's just think about the very first

iteration. So we have our N original symbols, and we

have to pick two to merge. And remember the consequences of a merge

is going to be an increase in the encoding length by one bit, whichever two

symbols we're going to pick. Now we want to do is minimize the average

encoding length with respect to the frequencies that were given.

So which pair of symbols are we the least unhappy to suffer an increment to their

encoding length, was going to be the symbols that are the least frequent.

That's going to increase your averaging poding length by the least.

So that's the green merging hiuristic. Somethings gotta increase.

Pick the ones that are the least frequent to be the ones that get incremented.

So that seems like a really good idea of how to proceed in the first iteration.

The next question is, how are we going to recurse?

So, I'll let you think about that in the following quiz.

So let's agree that the first iteration of our greedy heuristic is going to merge

together the two symbols that possess the lowest frequencies.

Let's call those two symbols little A and little B.

The question is then how do we make further progress?

What do we do next? Well, one thing that would be really nice

is if we could somehow recurse on a smaller subproblem.

Well, which smaller subproblem? Well, what does it mean that we've merged

together the symbols A and B? Well, If you think about it, in the tree

that we finally construct by virtue of us merging A and B, we're forcing the

algorithm to output a tree in which A and B are siblings, in which they have

exactly the same parent. So what does it mean for the encoding

that we compute that A and B are going to be siblings with the same parent?

It means that their encodings are going to be identical,

save to the lowest order bits. So A will get encoded with a bunch of

bits followed by a zero, B will be encoded by exactly the same prefix of

bits followed by A1. So they're going to have almost of the

same encoding, so for our recursion let's just treat them as the same symbol.

So let's introduce a new meta symbol, let's call it AB, which represents the

conjunction of A and B. So it's meant to represent all of the

frequencies of either one, all of the occurrences of either one of them.

But remember the input to the computational problem that we're studying

is not just an alphabet, but also frequencies of each of these symbols of

that alphabet. So my question for you is when we

introduce this new meta symbol A B. What should be the frequency that we

define for this meta symbol? All right, so hopefully your intuition

suggested answer C, that for this recursion to make sense, for it to

conform to our semantics of what this merging does.

We should define the frequency of this new meta symbol to be the sum of the

frequencies of the two symbols that it's replacing.

That's because, remember this meta symbol AB is meant to represent all occurrences

of both A and B. So it should be the sum of their

frequencies. So I'm now ready to formally describe

Huffman's greedy algorithm. Let me first describe it on an example

and then I think of the general code will be, self evident.

So let's just use our usual example, so we're going to have letters A, B, C, D,

with frequencies 60, 25, 10, and 5. So we're going to use the bottom up

approach, so we begin with each symbol just as its own node, A, B, C, D.

I'm annotating in red the frequencies 60, 25, 10, 5.

The greedy heuristic says initially we should merge together the two nodes that

have the smallest frequencies. So that would be the C and the D with

their frequencies of 10 and 5. Now based on the idea of the last slide

when we merged C and D, we replaced them with a meta symbol cd whose frequency is

the sum of the frequencies of C and D, namely fifteen.

Now we just run another iteration of the greedy algorithm meaning we merge

together the two nodes, the two symbols that have the smallest frequencies.

So this is now B 25, and CD 15. So now we're down to just two symbols,

the original symbol A, which still has frequencies 60 and the in some sense,

meta, meta symbol BC, D who is now cumulative frequency is 40.

So when we're down to two nodes, that's going to be the point in Huffman's

algorithm where we hit the base case and the recursion begins to unwind.

So if you just have two symbols to encode there's pretty much only, one sensible

way to do it. One is a zero, and one is a one.

So at this point, the final recursive call just returns now a tree with two

leaves corresponding to the symbols A and BCD.

And now is the recursion on lines we in effect undo the mergers.

So for each merge we do a corresponding split of the meta node and we replace

that with an internal node. And then two children corresponding to

the symbols that were merged to make that meta node.

So for example, when we want to undo the merge of the B and the CD, we take the

node labeled BCD and we split it into two.

The left, left child being B, the right child being CD.

So the original outer most call, what it gets from its rehersive call is this tree

with three nodes and it has to undue it's merger, it merged C and D.

So it takes the leaf labeled with symbol CD and splits it into two.

It replaces it with a new unlabeled internal node, left child C, right child

D. So how does the algorithm work in

general? Well just as you'd expect given the discussion on this concrete example.

I'm going to take as the base case when the alphabet has just two symbols, in

that case the only thing to do is encode one with a zero, the other with a one.

Otherwise, we take the two symbols of the alphabet that have the smallest

frequencies. You can break ties arbitrarily, it's not

going to matter, it's going to be optimal either way.

We then replace these two low frequency symbols A and B with meta symbol AB,

intended to represent both of them in some sense.

As we just discussed with those semantics, we should be defining the

frequency of the meta symbol AB as the sum of the frequencies of the symbols

that it comprises. We now have a well defined, smaller sub

problem. It has one fewer symbol than the one we

were given. So, we can recursively compute a solution

for it. So, what the recursive call returns is a

tree who's leaves are in one to one correspondence with sigma prime.

That is, it does not have a leaf labeled A, it does not have a leaf labeled B.

It does have a leaf labeled AB, so we want to extend to this tree T prime

to be one who's leaves correspond to all of sigma.

And the obvious way to do that is to split the leaf labeled AB, replace that

with a new internal unlabeled node with two children which are labeled A and B.

The resulting tree capital T with leaves and correspondents to the original

alphabet sigma is then the final output of Huffman's Algorithm.

As always, with a greedy algorithm we may have intuition, it may seem like a good

idea. But we can't be sure without a rigorous

argument. That is the subject of the next video.