0:01

In the last section we argued that a good basic coding

model for many neural systems is a combination of a linear filter,

or feature, that extracts some component from the stimulus, and a

nonlinear input-output function that maps the

filtered stimulus onto the firing rate.

Our goal in this section is to understand

how to find the components of such a model.

You'll be doing this for yourself in the homework.

We'll then go on to think about how to modify

this model to incorporate other important neuronal properties.

0:31

Let's step back to the original problem, which is to build a model like this.

To build this general model our problem is dimensionality.

Let's caste our minds back to the case of the movie we showed the retina.

We can define a movie in terms of the intensity of

three colors in every pixel in, say, a one megapixel image.

And to capture any time dependence, we'll also need to keep

enough frames of the movie to go back for maybe a second.

So each example of a stimulus is given by 3,000 times

maybe 100 time points, or in the order of 300,000 values.

That's just one stimulus.

To sample the distribution of possible stimuli, when each is

specified by hundreds of thousands of values is just impossible.

It would be impossible to fill up that response distribution,

even if our stimulus was just 100 dimensions.

The amount of data needed is unmanageable.

So we need a strategy to find a way to pull out one or two or a

few meaningful components in that image, so that we

have any hope of even computing this response function.

So to proceed at all we need to find the feature that drives the neuron.

To do this, we'll sample the responses of this

system to many stimuli but not to build the complete

model, just enough so we can learn what it is that really drives the cell.

That will let us go from a model that depends on arbitrary characteristics

of the input to the one that depends only on the key characteristics.

1:57

So, we're going to start with a very high dimensional description.

Let's say, a time bearing wave form or an image.

And pick out a small set of relevant dimensions, that's our goal.

So how do we how do we think

about our arbitrary stimulus as a high dimensional vector?

2:15

So we start with our s(t).

What we're going to do is discretize it, so we take, time t1.

We take

the value of the stimulus at that time, we'll call it s1, time t2, and we, we take

the value of the stimulus at that time, and

we plot these two points in this 2-dimensional space.

2:35

As we keep taking more and more time points, that gives us more

and more axes in this diagram in which we're now plotting that stimulus.

So this is s(t), plotted as

the components of its representation at these different time points.

3:07

One common and useful method to use is Gaussian white noise.

Gaussian white noise is a randomly varying input, which is generated

by choosing a new Gaussian random number at each time step.

In practice, the time step sets a

cut-off on the highest frequency that's represented

in the signal.

White noise, therefore, contains a very broad spectrum of

frequencies, and in fact, depending on how the noise

is smoothed in practical applications, almost all frequencies that

are there are present in the signals with equal power.

Here's an example of a white noise input that's been smoothed a little bit.

You'll be using an example of white noise in your problem sets.

3:47

Now each chunk of white noise, let's say a hundred

time units long, can be plotted in a hundred dimensional space.

The axes I've drawn here might describe the value

at time t1, the value at time t2, et cetera.

As we continue to stimulate with new examples of white

noise, the different examples are plotted in these different points.

And they start to fill up a distribution.

Remember, because each one of these examples

is chosen randomly.

4:46

Now, a multidimensional distribution that's Gaussian in

all directions is called a multivariate Gaussian.

The beauty of such a distribution is that

it's Gaussian no matter how you look at it.

If we chose to look at the distribution of stimuli projected across

some other dimension that's not in our original time points, but maybe

some linear combination of them.

Let's take a new vector and now project our stimuli onto that new vector.

We would find that even along that

new vector, the distribution is again Gaussian.

5:23

Now let's take a look at the stimuli that trigger spikes to happen.

Here's one and let's say there's a bunch more.

You'll notice that there's some structure in this group of points.

Ordinarily if I were really plotting an arbitrary

choice of three of the hundred possible dimensions.

I wouldn't be able to see this.

I need to search for the right way to rotate

this hundred dimensional cloud, so that I can see that structure.

One way to find a good coordinate axis is to take the average of these points.

6:07

So let's imagine we now take this vector through the data.

And let's project all these spike-triggering points onto that vector.

They're all going to have projections that are large and

similar to one another, so this will be

the distribution of points projected onto the spike-triggered average.

6:33

So while I wanted to give you a geometrical perspective

of what you're doing that might seem a little abstract.

Operationally it's quite straightforward and intuitive.

Let's say you gave this system a long random white noise stimulus like

this one, which is just a scaler quantity that's varying randomly in time.

And the system, this neuron spiked during this presentation several times.

Here's a spike, here's another spike, here's another spike.

7:00

For every time there's a spike, we look back in time,

at the chunk of stimulus preceding that spike, and grab it.

Put it down in this list.

This will be one example of your spike-triggering stimulus set.

7:24

So what you're doing

is approximating whatever is common to all of the stimuli that triggered a spike.

So if all goes well, you'll see that

this average is much less noisy than the examples.

And it's generally quite sensible looking.

So what this system apparently likes to see is an

input that generally ramps up a bit, and then goes down.

7:52

Here's an example of the same procedure, but when the stimulus

is not just a scalar value, but more like an image.

Every column here, is an image, with pixels of different colors,

maybe one that's been unwrapped into a single vector of values.

The spike-triggered average, now average over these chunks of spatiotemporal

data, that precede every spike, now has both time,

a time dimension, and also a space dimension.

8:25

Now let's go back to only deal with time.

So back in the time representation that we

introduced before, our spike trigger to averages some vector.

We'll take it to be a unit vector. Let's call it f.

This is the object of our desire, the single

feature that captures a relevent componant of the stimulus.

Now, recall the previous section of the lecture.

What do we do with this identified feature?

We used it as a linear filter.

Linear filtering we said is the same as convolution.

And it's also the same as projection.

Let's take some arbitrary stimulus s, remember we can

represent it as a point in this high dimensional space.

9:02

And if we filter it by

this spike-triggered average, that's the same as

projecting it as a vector onto the

spiked-triggered average, which is also a vector.

So what does that mean?

We have this vector

of our stimulus s. To project it onto s, f, means that we

take its component that's aligned along the

direction of f. This is s.f.

9:46

Okay.

Now we've seen that a good way to find a feature that drives the neural

system is to stimulate with white noise and

use reverse correlation to compute the spike-triggered average.

This is a good approximation to our feature, f1.

Now, how do we proceed to compute the input/output

function of the system with respect to this feature?

10:07

Remember that we're trying to find the probability of

a spike, given the stimulus, but where the stimulus, here,

is now replaced only by the component of the

stimulus that's extracted by the linear filter that we've identified.

We can find this relationship from quantities we can measure in data

by rewriting it using an identity

about conditional distributions known as Bayes' rule.

We can rewrite this

probility spike, given s1, in terms of the probability of s1 given a spike.

Probability there's a spike divided by the probability of s1.

10:41

Let's see what this means.

We have this now in terms of two distributions, here the prior, again, now

the prior only with respect to that

one variable that we've extracted from the stimulus.

And here what we called the spike conditional, conditional

11:07

We run a long stimulus and collect a bunch of spikes.

We project the stimulus onto our feature, f1, extracting component s1.

Here's s1 and here are the spikes.

We use this long stimulus run to make a histogram of s1 here.

11:47

Hopefully, that distribution is different from the prior.

We take their ratio, as we see here, and

scale it by the overall rate as probability of spike.

12:09

Let's say that our neuron fires at random times, so when we build a histogram of our

stimulus, that is the prior distribution, and a histogram

of the special stimuli that trigger spikes, we'll find

that those stimuli are actually not so special since

the stimulus points associated with spike times suggest a

random sampling from the Gaussian prior, their distribution is

just the same as the distribution of the prior.

This could mean either that the stimulus

had nothing to do with the firing of the

neuron in the first place or else that we chose

the wrong component and we filtered out whatever it was

about the stimilus that this neuron is actually responding to.

12:54

What we want to see is a nice

difference between the prior and the spike conditional distribution,

which is going to result in an

input/output curve that has structure that's interesting.

So here our input/output curve tells us that

the neuron, as we saw previously, tends to fire,

so it has, predicts a high firing rate, when

the projections onto our, our identified feature are large.

This is success.

13:20

So now let's go back to the basic coding

model that we developed and think about what's missing here.

We managed to get our dimensionality all the way down to 1.

Was that necessarily a good idea? Let's relax that a bit.

And add back something potentially important.

The possibility of sensitivity to multiple features.

Now the need for this should be intuitive.

We base all our decisions on many input

features and here's one of the most important ones.

Unless you have a brain full of Pamela Anderson neurons, though

personally I think I only have a Pamela Anderson neuron, neuron.

Generally we choose a partner or a friend on the basis

of many characteristics, flexibility, generosity,

the ability to cook, political affinity.

There are also many characteristics that enter into the

description of a person that may not matter to you

at all for their suitability as a friend, maybe

their eye color or their height or their typing speed.

14:24

To express this in terms of the models that we've

been looking at so far, what we mean is that

now we want to consider that there's not just one but

several filters, each selecting a different component of the input.

The non-linear response function now combines the responses

of those different components in maybe non-trivial ways.

Let's take a simple auditory example.

Let's imagine we have a core detecting neuron.

So f1, the first feature,

selects frequency one, f2 selects the second frequency.

But only when both frequency one and frequency two are present in

the input will we get a large firing rate, given this nonlinearity.

15:12

Let's go back to our picture of the white noise

experiment to think how we could find these features in data.

So we saw that we could take the

average of the points and compute the spike-triggered average.

But we can extract more information from that cloud of points.

15:27

One could also, for example, compute the next order moment, or its covariance.

To do this, we apply a method something like principal component analysis, or PCA.

15:37

I realize that most of you probably aren't familiar with this technique, and we don't

really have time here to build up the tools that we need to derive it properly.

So I'll just describe a little bit about what it does.

Its job is to find low dimensional structure, in that cloud of points.

15:52

So PCA is a general, famous, and kind of magical

tool for discovering low dimensional structure in high dimensional data.

Here's an illustration of what it gives you.

Lets say you have a cloud of data where each data point

has an XYZ co-ordinate and we plot it in this three dimensional space.

But

16:13

in fact, unbeknownst to us, all the data actually lie on a two dimensional plane.

So if we run PCA on this data

we'll discover that there are two-so called principle

components and these comopnents correspond to an orthogonal

set of vectors that span that two-dimensional cloud.

16:32

So this feat of discovery doesn't look super-impressive when

all we're doing is reducing three dimensions to two.

We could have just

rotated our axes around and noticed that.

But what if when we start, as when we do generally, we have hundreds

of dimensions and that we're hoping that

our data has some low dimensional structure?

We'll never find it by plotting one coordinate against another.

As those dimensions that are important are some,

un, unknown linear combination of the original coordinates.

Here we had x, y, and z and our plane

is defined by some linear combination of our original axes.

17:05

Generally the dimensions that pick out the relevant structure in the data will be

some linear combination of our stimulus coordinates

in their original basis, perhaps time or space.

For those of you with some linear algebra, PCA gives us a new basis set in which

to represent our data; a basis set that

generally is a lot smaller than our original representation.

So we get a lot of compression. And also, it's a basis set that's well

matched to our particular data set, unlike a

standard basis set, like a Fourier basis, for example.

17:47

And most faces can be pretty well reconstructed from a small set, maybe

seven or eight of principle components, computed from a big bunch of faces.

So these are called eigenfaces.

If we have a new face that we want to fit

with this faces, we can construct that as a linear combination

of, of Fred, George, Bob, and Bill.

18:06

So if we can represent any new face in, in

terms of sums of these computed eigenfaces, instead of the

intensity values of each pixel in the image, we can

represent the face using seven or eight numbers instead of hundreds.

18:22

Dimensionality reduction using PCA has a lot

of practical uses in neuroscience experiments too.

For example it can be used to sort out spike wave forms

that were recorded on the same electrode from two or more different neurons.

Let's say one neuron would give a spike that

has a nice clean signal that looks like this.

18:54

PCA can pick out two components that capture

the largest amount of variance in the data.

Now you project each noisy data point, each

example of a recording onto these two components.

19:08

Usually, this will keep the two components that

span the wave forms of the two neuron spikes.

All the components that get thrown away are just noise.

You can then plot all of the different data,

data points that were recorded, project it onto those two

features, so now you're seeing every data point projected

into the space defined by feature one and feature two.

And in this new two-dimensional coordinate

frame, the wave forms from the two cells are now clearly separable.

19:40

So let's go back to white noise and neural coding.

Here's an example from neural coding where PCA was used to

find multiple features and where that turned out to be very helpful.

Here you're looking at a scatter plot of all

the stimuli that drove a retinal ganglion cell to fire.

Each stimulus, each blue dot was 100 time steps of a white noise flicker.

So just a scalar that varied in time.

But now we've reduced each

one of those stimuli to, to a point in two dimensions by projecting

it onto the two features that we found, feature one and feature two.

20:14

For this particular retinal ganglion cell, the spike-triggered average

was close to zero and this picture shows you why.

When we look at the stimuli that trigger spikes it turns out to be two

group stimuli that drove the neuron and the

average of the entire set is here approximately.

It's near zero.

21:15

So this neuron both likes it when the light goes

on, and it likes it when the light goes off.

If we average all of those stimuli together, we'd get nothing.

But if we use this technique, where we now could pull out two different

features, and plot our stimuli in that

two dimensional space, now that structure is revealed.

21:34

It's important to realize that the two features, f1 and f2, that

we found here are not themselves, the on and the off feature,

but the analysis allowed us to find a

coordinate system in which we could see that structure.

21:49

Okay.

I've been making a lot of use of your linear algebra neurons.

Let's give them a bit of downtime with

the relaxing view of a, of a little eigenpuppy.

Although we were not necessarily able to go

into the details, I hope you got the flavor

of the construction of these kinds of models, and

a sense for why multidimensional models can be useful.

There are a lot of good resources to learn more

about these techniques, and we will post them on the website.