0:47

In order to take any sound that we might want to look at and represent it as a

series of sound waves. And and we'll talk about some

implications of this algorithm in terms of particularly two parameters of the

frame size and bin width. but need to think about very carefully as

we're configuring it, because they have some serious implications in terms of

what we get are zeroes. so it's, it's pretty obvious now that we

know how sound is represented digitally on a computer.

It's pretty obvious how a waveform representation like this comes about.

you know, we simply take the successive amplitude values, and, and we kind of

plot them over time on, on the x axis. and and then we have our wave form, we

can connect the dots if we want to to make it look a little nicer.

but how we get from this to that is not obvious, because you know, when we

represent sound digitally we're encoding a series of amplitude values over time.

We're not including any information about frequency at all.

So that's why we need to think about this a little bit more carefully and think

about how we get to this. So we're going to revisit the Fourier

Theorem which we looked at in the timbre video earlier.

I want to look at it in a little bit more depth now.

Just to recap we said the Fourier Theorem said that any periodic wave form.

Can be represented as a sum of sine waves at frequencies that are integer multiples

of a fundamental frequency. And we looked at examples of this with a

soft tooth wave and we looked at examples of the trombone sound.

Of how we could kind of combine these sine waves together.

I mean we wouldn't hear them anymore as individual sine waves, but we'd hear them

kind of coming together come possibly to, to create this, this single sound for us.

because of this special relationship they had to each other in terms of being

integer multiples in this, this based frequency.

and, and and because of the way that they were linked.

so at the time I, I also mentioned a really important limitation here.

This periodic limitation here. this only works for periodic wave forms

like a perfect sine wave or a perfect square wave or, or something like that.

And that isn't how sounds work in the real world really you know they're,

they're not perfectly periodic. They don't repeat a cycle infinitely over

and over and over again without any variation.

so that's, that's one problem is that, you know, we've gotten this, this spectra

aspect of, of timbre here, but not the envelope of timbre, not the changing in

time, aspect of it. The other problem here and this is

actually something I didn't write into, to, to, to, this text for the Fourier

Theorem at the time. Is that when we say that sum of sine

waves here there's an important caveat here.

Its a potentially infinite number of sine waves may be required to do this.

and computers don't tend to like infinity very much they're not continuous beings.

They're, they're, they're discrete, they do things as, as you know as sets of

zeros and ones. So if we need potentially infinite number

of sine waves to do this, that's also going to be really problematic for us.

and so what we do instead is, is we use this basic idea of the Fourier Theorem

but we, we tweak it a little bit, we kind of, we kind of fake it out if you will.

Of to, to pretend that we're working periodic waves, and, and we do process,

it doesn't do things perfectly. but doesn't use an infinite number of

sine waves either to make this happen. And so there's three stages to this that

I'm going to talk about in detail. windowing is when we take a wave form and

split it up into tiny little bits. Then, we take each of those tiny little

bits and we do this thing called Periodicization.

There's really nothing to this we just pretend that that little bit repeats

infinitely so that it, it's a periodic it is a periodic sample.

and then on each of those little windows we apply a method called the Fast Fourier

Transform which you'll often see abbreviated as FFT.

and so we, we apply this process in order to, convert our, time domain, set of

amplitudes values into a, a information about, frequency.

So, I'm going to go through each of these steps in more detail now.

the first step is wind [UNKNOWN]. so what we're we do is we divide the

audio into equal size, overlapping frames.

So, let me show you what I mean. We pick a number of samples that would be

included in each frame. So, like our frame size might be 1024

samples, for instance. So these are tiny frames.

So 1024 samples if our sampling rate were 44,100 hertz is about 140th of a second.

So tiny fractions of a second. And so if we were taking this waveform

and splitting it up we might have That be one.

And then we're going to overlap them with each other.

So that might be another, that might be another.

That might be another, that might be another, and so on and so forth.

Well, all the way through our file. But it's more complicated than this

actually, because these are overlapping and we want smooth transitions from one

to the next. As we're doing this, each of them kind of

fades in and fades out. So the first one I'm going to fade, in

fade out with an amplitude envelope like that.

This one we'll fade in and fade out too, this one we'll fade in and fade out too,

and so on. So there's always one that's kind of

fading in and always one that's kind of fading out with an overlap like that, and

so on and so forth. So that's what windowing is, we end up

with these windows that kind of fade in and fade out that are each a tiny

fraction of a second long. then we take each of those windows and,

this is the easy part, we pretend that it's a periodic function.

So we take this tiny little window here, and we repeat it and we repeat it and we

repeat it and we repeat and we repeat it and again, and again.

Okay, we just pretend that this goes on forever.

So now we've met the periodic requirement of the Fourier Theorem.

7:13

The details of how this algorithm works are a little bit beyond the scope of this

course. I encourage you to look up some more

details. If you're interested, I'll point you

towards some references, but right now I just want to explain about, kind of

pretend that it's a black box. And explain kind of what goes in and what

goes out. Are these amplitude samples over time in

the frame. So if our frame size is 1,024, we'd have

1,024 we'd have 1,024 amplitude values that would go in.

And what would come out are a set of amplitudes and, and phases for each

frequency bin. So in other words, I'm going to divide up

my frequency space into a series of linearly spaced bins.

And I'll get into more of how this works in a second.

And then I'm going to look at what's going on in each of those.

How much energy is there in each of those bins?

And also it's the phase. of, of, of the sine wave it's represented

by each of those bins. And so there's some simple ways to, to

calculate how the algorithm does this and my number of frequency bins is half of my

frame size. and and then the width between each of

these bins, so it's a you know, from one to the next to the next.

Is my Nyquist frequency, the highest frequency I can represent in my sampling

rate, divided by my, my number of bins. so let's work through an example here

just to make sure this is, this is totally clear.

so my frame size is 1,024 samples and my sampling rate is 44,100 Hertz then my

nyquist frequency would be 100 divided by two so 22,050.

So then my number of bins is the frame size, 1024 divided by two.

So that's 512, and my bin width is going to be my nyquist frequency that's 22,050

and that's Hertz just to be clear. 22,050 divided by my number of bins, 512

This comes out to about 43 Hertz. It's a little bit more than 43 Hertz.

So that means that my frequency bends are going to be spaced zero, 43, at 86, 129,

so on and so forth all the way up Is 22,050 Hertz.

so that's, that's how this stuff is divided up.

and and then I, I have information at that point about what's going on in each

of those, those frequency areas. And so you see can how it could generate

a sonogram from there. I could I could take each of these frames

and generated one vertical strip of frequency view in my sonogram.

based on that data that's coming back, and I'm going to show you how that works

in a second. but before I do that I want to talk about

some of the issues with this process, because it is not a perfect process.

First of all its a Lossy process, I lose data in this process.

If I do this fast Fourier Transform and then I go back to my wave form I've lost

something in the process. Because I've split these things up into

these linear frequency bins. So I only know whats happening with a

very low resolution, as they're moving up in frequency.

And I also only know things about a fairly low resolution in terms of time.

because I only know what's happening frame by frame by frame so 1,024 samples

in the example we've been using at a time.

and so there's actually a, a big trade-off here when I pick my frame size.

In terms of how much resolution do I want in a time domain versus how much do I

want in the frequency domain. If I want to know exactly when things are

happening in time along my x axis. I can pick a very low frame size.

So my frames are really tiny. So I get a lot of time resolution or

horizontally. But then my bin width gets huge.

And so I know very little about what's happening vertically in my frequency

dimension. if I want to know a lot vertically in my

frequency dimension, I can pick a really high frame size.

But then there's a lot of time that passes from one frame to the next to the

next. And so I, I lose a lot of resolution in

the horizontal in the time domain. I'll show you this in, in, in a demo in a

second. the one point I wanted to make first

before I go there is is this word here. These bins, the frequency space is

divided linearly. but if you remember from our, our module

on, our, our video on psychoacoustics, we actually listen we actually hear a, a

pitch not not linearly, but logarithmically.

And so a lot of these frequency bins are, kind of, wasted, if you will, on things

very high up in frequency space. so half of the bins are for what we would

hear as just the final octave of our frequency space.

so so this isn't a great match either, but that's how this particular algorhythm

works. so let me go ahead and and open this up

in Reaper for you. And what I want to show you, I'm going to

play a sound here. And I want to show you the sonogram for

it. and we have an option here to pick our

frame size. So, I'm going to show you how this is

going to start looking differently, that time versus frequency resolution

trade-off, as I pick different frame sizes.

13:03

[MUSIC]. So at 1024 right now samples which is a,

a good compromise, but what if I want really, really, really good frequency

dimension. I may go over to 32,768 [MUSIC], and you

can see how much clearer the things are on the vertical dimensional in.

Its a lot less grainy I can tell exactly were things are happening vertically then

frequency dimension [MUSIC] but now it's kind of blurred in the horizontal

dimension. I don't get a very good sense of the

rhythm at all anymore. This is a very rhythmic sampling.

So if I went down to something really low, 16.

Now I've seen the rhythm very precisely. You can see all those peaks representing

every single note that's playing, but I see almost nothing.

To represent what's happening [MUSIC], in the vertical dimension.

It's all just kind of these bars, [MUSIC], that are, are going up and more

or less the same height as each other. I'm not seeing very much at all that

specifically where things are happening in frequency space.

so I just wanted to show that to you to illustrate that trade off, that decision

you have to make when you pick the frame size.

And that's why pretty much anything that's doing frequency domain analysis is

going to give you the option of the frame size.