0:00

Hello and welcome back. During this video I want to talk about

the implementation of sparse modeling, in other words how do we compute the alpha

that we have seen in the previous videos? During the video we are also going to

discuss some theoretical results, I'm just going to mention their existence and

why they are so important. Let's start with that.

Assume that we have just produced a signal X as this product, basically as a

sparse model. So assume that we know the dictionary D

and we randomly pick an alpha with a few non zero coefficients and we generate X.

And now our goal is to find back this alpha, so now we know the dictionary.

And, we know the signal and we want to compute it's sparse representation.

That's our goal right now. As we said later on we are going to

discuss about how to compute the dictionary itself, so.

As we have seen in the previous video, we want to compute the alpha that has the

smallest possible number of non-serial coefficients, that's also called a

support, and we assume that there is no noise right now.

We have just generated this signal, so it's kind of the ideal scenario.

So that's our goal right now. And there are a number of questions.

One of the questions is why would we get, with whatever algorithm that we produce

here, that we are going to explain in the next few slides.

Why are we expected to get exactly the same alpha that produce the sigma?

In other words, is that alpha unique? Can you think about a scenario where we

know that alpha non necessarily has to be unique?

And can you think of a scenario where we know that alpha will be unique?

Let's think for a second about that problem.

One example where alpha is not unique if for example several of the columns of D

are identical. So.

Let's assume that this was generated by one of those columns.

I can actually pick any other column, because they're identical, and I can

replace that alpha by, by the alpha corresponding to that column.

I assume that column one or two are identical.

I assume that, basically, I generated this signal with exactly this pattern.

So, I generated it with the second column.

But when I go to reconstruct, I can pick column one, you still have column two,

and therefore, I get that different alpha.

So that's an example where basically, the solution is not unique.

An example where the solution is unique is in DCT, Discrete Cosine Transform or

in the Fourier Transform. We know that actually the transform of a

Signalling Fourier or in Discrete Cosine Transform is unique.

So those are cases, now where was the difference?

In one of the scenarios, D was basically had the repetitions.

In the other scenario we know that Fourier and the cosine every column is

independent of the rest, they are both orthogonal to the rest.

So it looks like the structure of the dictionary D has a lot to do with the

uniqueness of alpha. Here comes the second question.

Will I get, because in the previous case at least the support, the number of

non-zero proficients, meaning the size of the support, was the same.

Could I get basically maybe even a more sparse representation than the one that

produce it? And, once again it's not hard to think

about of an example of that. Assume that basically you generated a

signal with all these three, and this is X.

Now assume that by chance that X was part of the dictionary D.

Then when you're going to try to recover, your sparsity goes back to 1.

So you went from three to 1. What's, what's the problem there once

again? One of the columns is basically a

combination of the others, so if you produce a signal that has such

combination. And then you might be able to get it back

without actually less coefficients. So these two problems, actually the

questions depend a lot on a number of things.

The dictionary, meaning it's size k, and also its general structure.

And the basic idea, as we have seen from this example, is that we want the

dictionary D to have atoms, columns that are as anchor related as possible to

guarantee that basically we have unique solutions.

And that basically we get back the sparsest signal that actually produced

the signal. So depending on the sparsity level.

Basically, depending on the number of non-zero coefficients.

And on some conditions on the dictionary. There is beautiful theoretical research

out there that guarantee uniqueness. And that guarantee that we get the,

basically, smallest possible support when we do the type of reconstruction

algorithms that I'm going to show next. So it's important for us to know the

existence of this very important theoretical resides in the background

that support the type of techniques that we are developing and that support the

use and the exploitation of sparse modelling.

Our goal now, assuming the theoretical results, is actually to solve sparse

modeling. So we want to find.

Now, we are giving a signal, y. Remem-, now is not the ideal case

anymore. X is going to be represented by this.

So this is our x. We want to find their representation of

x. Given the dictionary d such that we get

the smallest possible super meaning the smallest possible non zero coefficients

and we don't get too far away from the observation.

And that's our goal right now. To solve this problem.

This is actually what's called a commutatorial problem, and it's MP.

It's actually, you cannot solve this problem as I just posed it in basically

in realistic time. Let's just explain why.

We're going to basically do the following, which is the only way to solve

exactly this problem. We start with L equal one, so we look for

a signal that has sparcity one. That's what we are going to do first.

So, we generate all the possible supports of that signal, because L is equal one,

basically the support means only the first one is active, or only the second

one is active, or only the third one is active, and so on.

Now, how many possibilities we have? We have already seen that.

We have l chosen out of k. In the case of l equal one we have

basically k possibilities. When we are going to iterate this,

because then we are going to do k equal two, we have of the order of k square

possibilities. And as l increases, we're going to have

more and more possibilities. But for now, we basically have all the

possible supports. Now with.

Each one of those possible supports will basically go and solve the fitting

problem. So we say okay, what happen if I'm only

going to use the first atom? All what's left is to pick the scale.

How much of that first atom will I use? And then I solve this problem.

This is a least squares problem. It's a very simple quadratic problem.

Many ways to solve it, but the basic idea is that we are going to project the

signal into the atom that we have just selected.

And we see basically we try for each one of those.

And we actually seen what's the error. If the error is less than what we want

then we're done. We basically peak the alpha that produced

that small error. Now if there is not then we increase the

support. So, we have just tried, basically, with L

equal one. Now let's try with L equal two.

So we pick all the possible supports that have two none two active, two non-zero

coefficients. So we say okay,

how can I approximate the signal Y with the first and second atom together, or

with the first and third atom together, but with the seventh and the twenty fifth

atom together. So we pick every possible tool.

Again there are two chosen out of K which is in the order of K^2 and we try all of

them we solve for each one of them this problem again just a least squares

problem or a projection on to the sub-space generated by the selected atoms

we try again we keep going and we keep going so, thus we solve this.

Basically absolutely no problem. What's the problem with this technique.

The problem with this technique is the following.

Assume that K1000. = 1000, that's a standard size.

It's a good size as I say it's about the size that we're going to be using when we

talk about imagery noising. And assume that I tell you they have to

be ten atoms. So basically it's ten chosen out of 1000.

That's about of the order, starts being the order of possibilities starts, starts

being about the order of 1000 to the tenth something like that.

So, it's a very, very large number. Assume that each one of the computations

that we have here takes about one nano second.

So, you try all possible supports of size ten.

You solve this problem in one nano second.

These are all reasonable numbers. What you get is that you're going to need

a long, long time to solve this problem. Look at this number.

It's a huge number. That's how much you're going to need to

solve the problem if I already gave you that L is equal ten.

Imagine what would happen if you have to run over all possible L's.

It's impossible. That's why the problem is very hard.

It's impossible to solve as it is. So, are we done with, wick, with this

week? We just presented a great model that we

cannot solve for it. No of course we are not done.

Now we are going to have to find a way to solve this.

How do we do that? There is actually a couple of fundamental

ways to solve this problem. Again, a good fitting, with as sparse as

possible representation. There are again two ways to do that, one

way is what's called relaxation methods. Can we change these.

For something that is solvable. Something that we can actually do in our

computer. Meaning, can we relax our condition in

such a way that it's almost the same, but we can solve it.

The other way is to do what are called greedy algorithms, and that means, okay,

forget about finding all of them at the same time.

Give me the most important item, then the second, then the third, and so on.

So let us basically briefly present both of these techniques.

Relaxation methods are also some times called basis pursuit methods.This is our

problem, this is what we want to solve, relaxation methods will change this.

Look what I'm doing, I'm replacing the zero norm, or pseudo-norm by the one

norm. I'm now, no, not just counting the number

of knowns zero efficients, I'm basically counting with their magnitude.

13:26

Remember in the F0 norm we had a penalty which was something like this, flat.

Okay. So there was no penalty when you are zero

and the same penalty when you are non zero.

Here in the L1 norm. This was a type of penalty.

It increases with the magnitude. Now we are a bit surprised.

The beauty of this, is that under certain conditions, on D and there's certain

conditions on L, the level of sparsity. Basically we can guarantee that these two

are equivalent. This is a fantastic result.

It says that I can move from an np complete and solve other problem into an

l1 problem. This is a convex problem.

It's solvable. And sometimes obtain exactly here, the

solution I was looking here, but in a reasonable time.

Very beautiful theory behind this and. it's not the only case but it's a

powerful case where relaxation, actually we don't lose anything, by doing

relaxation. And this problem has been studied quite a

lot. As I say, it is called the basis pursuit,

is convex. So, we can solve.

There is a lot of algorithms in the literature to solve it.

And in the packages I mentioned that are free for you to download and to play with

sparse modeling, you have implementations and very efficient implementations of

these. And this actually has been the source of

a lot of research in the community because.

Now that we know that this is convex, and we're solving, under certain conditions,

the original problem. Let's try to solve it the best possible

way and this open the door to a lot of excellent research in this area, and I'm

just mentioning a few here. The literature is, is huge and you might,

actually. Know about it from your own research

area. Now the second say, so this is the

relaxation way and as I say, extremely powerful.

So you can pick to use one. The second one is the greedy approach,

the greedy approach is often called the matching pursuit and the idea is very,

very simple. Again, remember we have the signal.

That we are trying to approximate. That would be Y.

We have our dictionary D. And we are trying to find the sparsest

possible way to represent this signal. Now, what we do, and much in pursuit is a

standard technique in the statistics community and it was brought into signal

processing about twenty years ago or some, or so.

The basic idea is let's find, first find the most important atom in the

dictionary. So you go and travel the dictionary.

Let's observe what's happening here. You travel a dictionary and you find the

first one. So once again, you go, you try the

dictionary, and you find the most important one.

What do I mean by the most important one? The one that minimizes the D alpha - 1^2.

So. We have the'd' alpha minus'y'.

Square, and this no, this is alpha, this was cut.

This is no, this no, Y is no. This is d, this is y, and supports one.

So you try the first one, you try the second one, you try the third one, you

try the fourth one. The one that does this the best.

Actually because this is the method we're using the, this one will be the one that

maximizes the inner product with the signal Y.

Basically which is the one which is most parallel to the signal Y.

That's going to be this one. Now I keep it.

I cannot give it back. In contrast with what we saw in the

previous slide that we basically tried L equal one, then tried L equal two, then

tried L equal three. This I'm forced to keep.

That's why the technique is called greedy.

Once you select it, you keep it. So now I selected this, I keep it.

I keep it and there is still an error here, if the error is below what I wanted

I stop, if not I have an error. Now the next step is pick the one that

helps you to make that error the smallest possible.

So, for the first one was Y, what we are trying to approximate.

For the second one, we're trying to approximate the error.

If we approximate the error, we add that atom to our collection, and we are

starting to shrink the error. So the next time you keep this one.

And basically you go to the one that has the best fit to the error so far.

So again, you try all of them. And you keep the one that minimizes the

error. If you see that again you try all of

them. And we keep the one that minimizes the

error. Now, please note I've kept the first one

and I'm now keeping the second one. Okay.

So I kept the first one and I kept the second one Now I have a new error because

I'm using two items. I have a new error and I'm going to pick

the third one that basically approximates that error again.

And every time all that you're doing is inner products.

To approximate those errors and now the algorithm is doable because for every one

they have to try, you basically try N times, you are picking one at a time

instead of picking this L at a time and interchanging and putting back and trying

new ones. The one you pick, you stay with it.

And now this algorithm is very efficient, it's just a bunch of inner products all

the time. And you stop when you go to a reasonable

error, an error that satisfies your requirements.

Now, this is what basically I wrote here, you stop when your error is satisfied.

There are some variations of this. The most important one is that every time

that you pick a new atom, because it's the most important one, you take again

your signal Y. And you project into all the selected

atoms. In other words, when I pick the first

atom. I have a certain coefficient alpha here.

When I picked the second atom, I got a certain coefficient.

Now, these two are already selected. So, may be I can take the signal and

change the actual value here and the actual value here to get a better error.

If I'm already selecting it, why to stay with the actual coefficient that I got

when I selected it? I, I'll to change it.

So, every time you change, you do a projection onto the sub space of all the

selected atoms. Again that's a least course proceeded to

very fast to implement. So.

Orthogonal matching pursuit, you select the best atom so far, you project on to

the best selected atoms, that gives you an error so far.

Next step you select, once again, the best atom to minimize that error that you

are getting so far, you project onto all the atoms and then you keep going.

So that's a very simple algorithm. Again, it's implemented in some of the

software that we give you the link for. So, the basic idea is now we have

techniques to solve these problems. We have techniques which are basically of

several classes. The greedy algorithms that I just

mentioned to you, that you go one at a time.

There are variations of that. There are many, many variations.

Some people say, why don't we pick two at a time?

I do have, sent computation in time to do that.

And things like that or there are Orthogonal Matching Pursuit that I

mentioned. They are the relaxation algorithms.

The literature is huge in that topic as well.

And the basic idea was to relax this by something that will give us hopefully the

same result what we're looking for. And there hybrid algorithms that do both

of them at the same time in some very clever combinations.

Now why should I work. Because there is theory behind them that

tells us that sometimes the relaxation, although it's a relaxation, it actually

produces the right results. There are theories that gives us

conditions for that. There is theory that gives us conditions

for the greedy. Although it's greedy, sometimes it

produces the right result. For example, if the matrix D is a Fourier

matrix, you got the right results there. So there are ways of doing this kind of.

Transform this basically l0 type of approach, incompletely dissolvable and

there is a beautiful theory that there is a, when relaxation already, actually

gives us the best, the result that we were looking for.

If our problem, for example the D matrix doesn't hold those conditions, in

practice we still use one of these two classes of algorithms.

Why? Because solving the real problem is not

doable. It's impossible, so the best we can do is

do either a relaxation or a greedy. And if we are in the right conditions we

get the exact solution we were looking for.

If not hopefully we get a very good approximation of that.

So where are we so far? We started by image denoising and looking

for the need of modeling. We derived sparse modeling as one way of

doing a model of images that will help us in denoising.

Now we discussed some of the problems and basically we saw how to solve, how to

compute the alpha that will give us the representation.

The next step is the dictionary, and that's what we are going to do in the

next video. Thank you very much.

I'm looking forward to that.