0:01

All right, so

it's good to start by saying a couple of words about how sound propagates in rooms.

So here we're going to have to deal with two effects, mainly.

And the first one of them is something that we call free space propagation,

and this is something that you observe in practice, right?

The further you are from the source of sound, the lower the volume.

It turns out that for

a point source of sound, this can be described by the following statement.

Sound pressure is inversely proportional to the distance

that the sound had to travel.

0:39

Another effect is reflections.

Okay, in rooms, sound reflects off the walls.

And every reflection attenuates the sound.

In practice, these reflections are,

these attenuations are depending on the frequency, and

on some other things, like different materials and different walls.

Well, we are going to model them using a single co-efficient to alpha here,

so we're going to say that every reflection attenuates the sound by alpha,

and what we mean by that is that if there is some pressure with the sound wave,

just before hitting the wall is P, then the sound pressure,

just after bouncing off the wall, is going to be alpha times P.

Okay, very simple.

So implicitly, this alpha is strictly smaller than one.

So it turns out that we can describe the room as a linear filter.

And what does this mean?

Well, this means that if we want to describe the system

between the source of sound and some sound sink, for example, a microphone,

1:48

then we can simply write out what the microphone picks up.

So the thing recorded by microphone as the emitted sound,

the input sound convolute with some input response, okay, corresponding to the room,

describing the room, we call this simple response the room impulse response.

Okay, this is a very common abbreviation.

2:46

This room is designed in special ways so

that there are no reflections of the walls.

There are no echos.

So it has the name anechoic, and, you know, for example, having a conversation

in such a room is really weird because you really have to look at the speaker.

Once you turn your head around, there are no reflections.

So, you also say that this room sounds very dry.

Let us listen to a sound sample recorded in this room.

>> [FOREIGN] >> All right.

So, without any context, you might say, okay,

this was recorded somewhere, I don't know, in some room.

But in fact, you might have noticed that it is extremely dry.

3:34

All right, now we move on to the next room, which is not anechoic anymore.

So this is a small classroom that we emptied here at EPFL and

we were running some experiments in it, so we had the impulse response measurements.

Now, just for fun, we can try listening to the impulse response itself,

so to the signal H.

It sounds like this.

[SOUND] As if someone fired a small starter pistol or something.

So, you might notice that it's relatively short.

We say that this room has a relatively low reverberation time, short.

And the same sound sample from earlier, reproduced in this room, sounds like this.

>> [FOREIGN]

4:58

All right, here it comes.

[NOISE] So it is very different from the impulse

response of a small classroom, right?

It's much, much longer.

These are very reflective surfaces and there's a huge volume.

So the reverberation time is very long.

And the same sound sample we produced in the cathedral, sounds like this.

>> [FOREIGN]

>> As mentioned earlier,

these sound samples were obtained simply by convolving the original sample

from an anechoic chamber with the impulse response of the corresponding room.

So basic street or line, or if we have a classroom, or the cathedral.

5:56

Now we can start having some fun.

Let us analyze a very simple room

that will allow us to write some formulas explicitly.

This is not really a room.

So imagine that you're standing half way between two very long walls.

Say that these walls are infinite, or

just very long, and these walls are at a distance, d, from one another, okay?

And the fact that you're standing exactly half way between them means that

the reflections, and you're having a microphone, okay, and

the reflection from this wall, okay,

will arrive at the microphone at exactly, or approximately, but we're

going to think about exactly the same time as this reflection from this wall.

This simplifies some things, okay?

And the impulse response of this room, where you're standing,

is given by this formula.

So notice that it is just a bunch of shifted delta functions.

Each delta function models one reflection.

6:55

And we know that the gate reflection must be scaled by alpha to the k because

it was attenuated k times by wall, and there is also this free space propagation.

So notice that in the denominator, there is

this k times T term, which models the free space propagation.

And this epsilon, here, it just helps us.

It's like a patch for a formula to avoid division by zero, or if you want,

it models the first direct path to the microphone,

because the microphone is between most of the paths.

It's not exactly colocated.

And what is T?

Well, it is just the time necessary for

the sounds to go from the source or from the mouth to the wall and back.

So what is it?

The sound, or the distance of the sound has to travel is exactly two times d or

half, so it's d, and the time it takes then is d over c, the speed of sound.

Okay, so T is equal to d over c.

Okay, the speed of sound.

8:06

So, what is capital N here?

It's the time measured in samples that it takes for one reflection to occur, okay?

And it has to be an integer because we're working in discrete time, and

you want everything to be on the grid.

So we want it to be shifted by an integer number of samples.

So, we just round, okay?

So, we round the time in seconds, multiplied by the sampling frequency,

which will correspond to the number of samples.

And finally, you should not see this formula as being

a very exact model for something.

But it's a very good model, so it describes very well what happens in this

situation, and it's going to serve us to derive some interesting things.

8:49

In fact, this formula is still a bit too complicated.

We don't really like this 1/kT term,

it will wreak havoc in the z-transform.

So, what we want to do is we're just going to rid of it.

It's difficult to handle in the z-domain, it will make us struggle, so

why not simplify further?

So we're going to assume that the dominant attenuation is due to reflections only and

arrive at this approximate impulse response if you're going to use

in the beginning.

Okay, we just features of, and this approximation is not very good.

But we'll see that even if it's not very good, it gives some very good results.

Okay, and nice thing is that this has a simple z-transform.

9:35

Okay, so now we want to hear how these things sound like and how they look like.

Okay, so this here is the simplified impulse response.

And this here, the right hand side,

is the realistic impulse response with a 1/t term.

And we can see that they are quite different.

And they also sound quite differently.

So here, we can first listen to the original sound,

this is going to be our benchmark sound.

>> One, two, three, four.

[MUSIC]

>> Okay, it's some voice and some guitar, then if you play this sound and

convolve it with the approximate room, it sounds like this.

>> One, two, three, four.

[MUSIC]

>> Sounds bad.

It says that the room is really large, so we can actually

hear the individual reflections, and the walls are quite reflective.

And in what we call a realistic room, it'll sound like this.

>> One, two, three, four.

[MUSIC]

>> Okay, so it is much more natural, even though, obviously, it is not a real room.

Our goal now is to invert the room, so we have the reverberated sound.

And we want to get rid of the room influence, so

we want to remove the echoes, the reverberation from this sound.

And we're going to do it using single processing, of course.

So the reverberated sound is given as a convolution, okay,

and here, we say that it is given as a convolution between the input sound x and

the approximate impulse response, okay?

And our goal is to design a filter that reverberates this sound.

So we're going to have a simple linear scheme, nothing complicated.

So we want to design a new filter that we call the inverse filter, hi here,

that when convolved with the output signal, with the reverberated sound,

gives us back the dry sound, the original input signal, okay?

11:57

So let's play with this expression, and

then get a very simple solution to these problems.

So we want the inverse filter convolved with y gives us x, okay?

But we have the expression for y,

it's just the input single convolved with the room, okay?

12:17

And we come over with the inverse filter.

And now we use the properties of the convolution,

okay, and the particular one that we use here is associativity.

So we put parentheses different, we just parenthesize these guys.

And so we see that what we actually ask for

is that x is equal to x, convolved with something, okay?

12:41

Okay, so first, we simply write out an expression,

a definition of the z-transform.

Then we plug in what we had computed for the room impulse response for

the approximate room impulse response, okay?

And here we just use the properties of the delta function with the delta sequence.

So we can switch the sums, so this here is actually call to first sum over k,

and then we can put alpha to the k here, and

then we can put sum over n and whatever depends on n, inside.

So this is just z to the -n, delta of n-kN.

All right, and now the delta sequence will sieve out the values of

whatever is left here, multiplied with it at kN, right?

So we can write this out as being equal to sum over k,

and then alpha to the k, and then z to -kN, okay?

And this is exactly this expression here, with k exchanged by N for whatever reason.

Okay, what remains to be done is just to sum up these geometric series,

and we did it many times, so we know how to do it, okay.

All right, now what happens because now we can invert the room,

and as we said, z-transform of our inverse filter is just 1

over the z-transform of the room impulse response.

This is a transform of the room impulse response, okay.

And luckily, it turns out to be a very simple filter.

It's a fine act impulse response filter that has only two taps different from 0,

one at position 0 and another one at position capital N.

14:37

And now even if this observation might seem very innocent,

that finite impulse response filters cancel exponential impulse responses.

It's, in fact, in the basis of some modern sampling theories,

of something that is called finite rate of innovation sampling,

that you might want to look up if you're interested.

15:13

room impulse response on the left-hand side and

the realistic room impulse response on the right-hand side, okay?

And as we were designing our inverse filter exactly for

the approximate impulse response, this comes as no surprise that

we get the delta function on the left-hand side.

What maybe surprising is that even on the right-hand side,

we get something that's not too far.

Even if the room impulse responses are very different,

they appear to look very different.

The bottom part shows magnitude of the BTFT

15:54

We would expect this to be constant, to be close to 1.

And we see that it is indeed, of course,

the case for The approximate impulse response, but even when we apply to

the realistic impulse response, we get something that somehow stays close to one.

16:11

And now let's hear how this thing sound.

So first let us hear how it sounds if we convolved the sound with

the approximate impulse response, and then apply our inverse filter to it.

Okay, here it comes.

>> One, two, three, four.

[MUSIC]

It sounds exactly like the original sound, and this is absolutely no surprise, since

the equalized the room impulse response, we can see that it's a delta function.

And what if the sound was convolved with the realistic impulse response, but

then we apply the filter that was designed for a different RIR, that was designed for

the approximate one?

17:34

What happens if we have a different kind of model mismatch?

Assume that we designed everything almost perfectly, but

somehow when we were designing our inverse filter,

we thought that the room has a different size than what it has in reality.

If we made just 1% error in the room size, then the things would sound like this.

Okay, first I will play the original sound,

just so that you remember how it sounded.

One, two, three, four.

[MUSIC]

Okay, and now we equalize it with a filter that was correctly designed but

for a slightly different room with 1% size error.

>> One, two, three, four.

[MUSIC]

It's just that.

And you can see what happened if you look at the equalized impulse response,

it looks nothing like the delta sequence.

Also, the frequency domain plot shows that some frequencies

are very amplified around here, and some frequencies are very

attenuated very close to these amplified frequencies.

It's nothing like the constant, that it should be right around one.

So, it's clear that something bad happened, and

it's also something to think about.

It tells us that our design method is not really robust, it's quite brittle,

actually.

19:47

Notice also that what gets transmitted from Room 1 to Room 2 is not only

the voice directed into the microphone, but when person in Room 1 speaks,

its voice gets bounced off the walls, so it gets reflected.

It gets convolved with Room 1, and

this is what gets transmitted with some delay to Room 2.

Now in Room 2 it's reproduced over the loudspeaker, and

then this sound is again convolved with Room 2, picked up by the microphone and

transmitted back into Room 1 with some delay, okay.

So what person in Room 1 hears in his headphones is his own voice,

I mean, coming from his mouth, then he hears person in Room 2 talking,

but he also hears the delayed version colored with Room 1 and

Room 2 of its own voice which is extremely annoying.

And this is how it sounds like.

21:05

Well, somehow the most natural idea that first comes to mind is we know why it gets

transmitted from Room 1 to Room 2.

We know what comes in to be reproduced over the loudspeaker.

So call this s of n..

So why not just subtract s of n from whatever is being sent back to Room 1?

And this is a very nice idea.

Makes a lot of sense, except that it does not sound very well.

So here's how it sounds like.

>> One, one, two, three, four, four.

[MUSIC]

It doesn't help much.

So the reason why it doesn't help much is that not only s of n gets transmitted back

to Room 1, but also the reflections of s of n follow the walls in Room 2, okay?

Okay, so s of n gets convolved with Room 2, and this is what gets transmitted back.

So in order to correctly do the echo cancellation,

we must first estimate the impulse response of Room 2 and

this is the rule of Geothentic here, okay.

So we must first estimate the impulse response of Room 2, and

convolve s of n with this impulse response and

then subtract this convolved signal from whatever is being sent back to Room 1.

The situation is actually a bit more complicated,

since we also need to estimate the impulse response, for example of the loudspeaker,

which is not just a simple delta, okay?

And it's even further complicated by the fact that the conditions in Room 2 change.

So people move around, the temperature changes and so

on, so we must re-estimate geothental work time.

If we take all these things into account, then after properly

doing the echo cancellation, this is the sound that we get.

>> One, two, three, four.

[MUSIC]

As expected, the result is near perfect, right?

Because we assumed perfect knowledge of Room 2.

And so the only information that we have in this sound now is coming from

the fact that is convolved with Room 1.