0:00
In this section we'll be addressing how information theoretic ideas can help us
to understand how the neural code may be specially adapted to the structure of
natural signals. We'll briefly first look at some of the
special properties of natural inputs. And then some theories of how code should
behave. Finally we'll sum up with some
suggestions from the principles that may be at work in shaping the neural code.
So I'm going to show you some photos that we taken by one of our Post-Docs, Fred
Sue, as he was sitting in his apartment on one of our typical sunny Seattle
afternoons, looking out at the view. He tried to take a picture that both
encompassed his beautifully furnished apartment and the grand view outside.
You can see that he had to change his f stop over a wide range in order to be
able to capture information both about the scene inside and about the world
outside. Now this is something that our eye does
effortlessly. If you were sitting here at this table,
you would be able to see both the inside and the outside with perfect fidelity.
So looking even at this familiar example, we can see two properties that are
characteristic of natural inputs. One is that there's a huge dynamic range.
There are variations in light level and contrast that range over orders of
magnitude. We can see signs of another property by
comparing these two boxes. Because of effects of depth and
perspective, there's similar structure, similarly well defined shapes and objects
at very different length scales. This is reflected in the power spectrum
of natural images. If one computes the power in different
spatial frequency components this function has a, this function has a power
log form. That is it scales, like the frequency, to
the power minus two. This reflects the lack of any
characteristic scale. The similar structure are not.
Despite these scale differences and the very large variations in light and
contrast across the image, we'd like to be able to distinguish detail at every
point in it. Unlike this camera.
These basic issues arise for almost all of our senses.
Here's an audio track of a chunk of speech.
The signal is full of complex fluctuations that carry detailed
information about pitch and nuance. However, these fast variations are
modulated by the relatively huge variations in amplitude that make up the
envelope of speech. We're perfectly capable of understanding
all of these signal components regardless of the overall amplitude, even when there
are multiple speakers, or they're far away.
So how can a neural system, with a limited range of responses, manage to
convey the relevant information about details in the face of these huge
variations of scale? We found that the entropy, we found that
the entropy, reached it's maximum, when it was of the form of these two symbols
were used equally often. Now if we're thinking about maximizing
the mutual information. We also have to take into account this
noise term. But generally the amount of noise for a
given stimulus may not be something that's easily controlled.
While the total response entropy is something that's in the hands of the
coder. Let's see how.
Let's imagine that the stimulus that a system needs to encode Is varying in
time, this is s of t, it has some distribution, p of s over here.
Our job as an encoder is to map the stimulus onto the symbols that we have at
our disposal. Let's imagine that we're constrained to
use some maximal firing rate, so we have some limited range of possible symbols at
our disposal, say zero to 20 hertz. How should we organize that mapping so
that we end up with the most efficient code?
We'll get the most information by maximizing our output entropy.
That is, by using all of our symbols about equally often.
So what does that imply for the shape of this curve?
So what we should do is move along our stimulus distribution and encode equal
shares of that distribution with each symbol.
If we have 20 symbols lets count up 1 20th of our total area under this curve,
and assign that to symbol one. What this amounts to is a response curve
that's given by the cumulative integral of the stimulus distribution.
Another name for this is histogram equalization.
4:26
So this implies that for a good coding system, its input output function, this
function here, should be determined by the distribution of natural inputs.
So here's a classic study in which this idea was tested directly.
In the early 1980's, Simon Laughlin went out into the fields with a camera, and
measured the typical contrasts, that is deviations in the light level, divided by
the mean light level, that would be experienced in the natural world, for
example, by a fly. So, that's this distribution here.
If the response does indeed follow the distribution of natural inputs...
Then the response curve, here, should look like the cumultive probability
determined by integrating p of c. And in fact, that's a very good match to
what he did actually observe in the response properties of the fly large
mono-polar cells, the neurons that integrate signals from the fly's
photo-receptors. Now, a study like this poses a challenge.
While it makes sense that our sensory systems would, over evolution or
development, set up response codes that are adjusted to natural input statistics.
It seems that much more work is needed to handle the problems posed by this huge
natural variation, that stimuli take as one moves from indoors to outdoors or
even moves one's eyes around a room. The contrast distribution is varying
widely. Might sensory systems rather adjust
themselves on much shorter timescales to take these statistical variations into
account. So let's take a patch of the image, and
look at the, the variations in contrast in that image.
Here for example, that contrast distribution might take, might be narrow
like this. Wheras over here, it might be much
broader. What our code should do is take the
widths of these distributions into account in setting up a local.
Input, output curve, that accommodates this structure of the, currently measured
statistics of the input. So that's the question that we tested
here, in the h1 neuron. In this experiment, we took a white-noise
input, of the type that you used in the problem sets, so some s of t.
Looks like that. And we multiplied it by some time
varying, slowing time varying envelope. Call that sigma of t.
And that's what you see here. So we repeated the same sigma of t.
This is a 90 second long chunk of stimulus.
Repeated the same sigma of t. In every trial, but we changed the
specific white noise. Stimulus.
And that allowed us to pick out spikes that occurred at different time points
throughout this presentation of, of sigma of t, where in every trial the cell would
have seen a different specific stimulus. And to calculate the input output
function described by those spikes, in those different, in those different
windows of time. So now one, when one analyzes spikes
across these different windows, and pulls out their input output function using the
methods that we talked about in week two, one finds that for example, here in this
window, one gets a very broad input and output curve.
Where, when the stimulus is varying very little, one finds a very sharp input and
output curve. Now, it turns out that if one normalizes
the stimulus by its standard deviation, or by this envelope sigma of t, all of
these curves collapse onto the same curve.
What that says is that the code has the freedom to stretch its input access such
that it's accommodating these variations in the overall scale of the stimulus.
And it's able to do that in real time as this envelope is varying.
This is being seen in several other systems, including the retna and the
auditory system. But here's an example from rat barrel
cortex. This is somatosensory cortex of the rat.
In particular. The part that encodes the vibrations of
whiskers. So, from extracellular in vivo recordings
of responses to whisker motion, whiskers were stimulated with a velocity signal
again, s of t, that looked like this. So this is a slightly simpler experiment.
The standard deviation was varied between two different values.
And now one can pull out spikes that are generated in these two epochs that
presentation. The high variance case and the low
variance case. And one can compute input output curves
for spikes that occurred under these two different conditions.
9:00
So in the low-variance case, one sees this input output curve, in the
high-variance case, one sees this input output curve.
And hopefully you won't be surprised that if I now divide the stimulus.
By its standard deviation, we now see a common curve.
So now we see again that this input output curve has the freedom to stretch
itself such that its able to encode stimuli in their natural dynamic range.
So what I've shown you is that as one changes the characteristics of the
stimulus. In this case, in the cases we've talked
about, by changing its overall amplitude, changes can occur in the input output
function. So here we've found that if a stimulus
say, took on this dynamic range, it might be encoded with an input output curve
like that. Now you should be able to see that if one
increased the range of the stimulus and stayed with that same input output curve.
Most of the time, your stimuli would be giving responses that were even zero or
at saturation point. Similarly, if you now decrease the range
of the stimulus you'd be hovering at the central part of the curve.
So, ideally one would like to use one's entire dynamic brain by defined by this
input output curve. And so, one would like to match it to the
range of the stimulus. And that's exactly what we saw in the
experiments. Now this adaptive representation of
information is not confined to change us in the input output function.
It's also been seen that changes can happen in the feature as the statistics
of the inputs are changed. The feature that's selected by a neural
system can also adapt to changes in the stimulus statistics.
And information theory has also been used to explain the way in which this occurs.
For example it's been used to explain how the spatial filtering properties of
neurons in retina, and in LGN change with light level.
Joe Addick and his colleagues pose the following question: If we consider that
the retina imposes a linear transfer function, or a filter on its inputs,
what's the shape of that filter that maximizes information transmission
through the retina? The solution turns out to depend on two
things. The powers spectrum of natural images and
the signal-to-noise ratio. At high light levels, or high signal to
noise, one would predict a filter shape like the one we've seen already, the
Mexican hat shape. This acts like a differentiator, looking
for edges of the stimulus, but at low light levels, the predicted optimal
filter is integrating, and simply averages its inputs to reduce noise.
And indeed in retinal receptive fields it's seen that the surround becomes
weaker at low-light levels and the center braoder which qualatatively matches these
predictions. We can also use information theory to
find out what it is about a stimulus that drives a neuron to fire.
We looked at this method in week two. In this case, this is called the, the
method of maximally informative dimensions.
One can choose a filter, so one can extract from the stimulus some component
that maximizes the Colbeck-Libler Divergence between the spike conditional
and the prior distributions. This turns out to be equivalent to
maximizing the information that the spike provides about the stimulus.
One can use this method to search for the optimal feature that explains the coating
properties of a system. When it's being presented with stimuli of
a particular distribution. Distribution.
So for example if one initially starts with a Gaussian white noise distribution,
that's a Gaussian, that's vertical Gaussian, in this, in this
representation. One might find a particular feature.
But now if one changes the distribution to say natural images, which will have
some very different distribution. The filter that maximizes the, the
information between spike and stimulus maybe different and that's being shown to
be the case for cortical receptive fields among other systems.
13:01
So, finish up by discussing briefly an influential idea that Ragesh mentioned in
the first lecture. That might explain my cortical receptor
fields have the shape that they do. Many years ago, Horace Barlow proposed
that because because spikes are expensive, neural should be trying to
encode systems as efficiently as possible.
What does this mean for a popular of neurons?
If you consider the joint distribution of the responses of many neurons, here lets
just take two. Maximizing their entropy should imply
that they code independently. That is their joint distribution should
factor into the product of the two marginal distributions.
This is a strategy that would maximize their entropy.
Why is that? Because the entropy of a joint
distribution is always less then or equal to the entropy of the distributions of
the marginals added together. So this idea is known as redundancy
reduction. The neural system should be optimized to
perform as independently as possible. However in the past years, it's been
realized that correlations between neurons can have some advantages.
For one. Having many neurons that encode the same
thing may allow for error correction and more robust coding.
It's also been realized that correlations can actually help discrimination, and
indeed, neurons in the retina have been observed to be redundant.
That is, that their joint distribution is very different from the product of
independent distribution. More recently, Barlow proposed a new
idea, that neuron populations should be as sparse as possible.
That is that their coding properties should be organized so that as few
neurons as possible are firing at any time.
14:38
This idea was developed formally by Olshausen and Field, and also Bell and
Sejnowski. Here's the idea.
Let's say that one can write down a set of basis functions, phi i, with which to
reconstruct a natural scene. Then any image can be expressed as a
weighted sum, with coefficients ai over these basis functions with perhaps the
addition of some noise. Now this basis function should be chosen
so that as few coefficients ai as possible are needed in general to
represent an image. This is carried out by minimizing a
function that includes the reconstruction error.
So here, the root mean squared difference between the reconstructed image and the
image itself. So that one gets a good match to the
images, but that also includes a cost term, whose role, whose role is to count
how many coefficients are needed, so one simple choice of this cost function, is
just the absolute value of these coefficients.
[INAUDIBLE]. The coefficient lambda, weights the
strength of that constraint. The job of this term is to penalize
solutions that require too many basis functions to represent an image.
Too many coefficients ai, that are, that are different from zero.
A fourier basis for instance, represents the images as a sum of signs and cosines.
While the fourier basis is guaranteed to be able to represent any image.
One might already be able to guess that coding with such a basis is not sparse.
Because, as you probably recall, the power spectrum is broad, which means,
that many coefficients are needed. When one runs an algorithm to find the
best basis functions, the best values of phi i, for natural images, one finds,
instead, a set of functions that look like this, like localized oriented
features, like those that we see in v one.
So this implies that when we view an image using neuronal receptive fields
that look like this, this excites on average a minimal number of neurons.
This is called a sparse code. So we've touched upon several different
ideas about coding principles. The idea of coding efficiency, that
neural codes should represent input stimuli as efficiently as possible.
We've seen that this implies adaptation to stimulus statistics.
As one changes the statistics of the stimulus, one should see aspects of the
coding model changing to ensure that it remains efficient.
We've also brought up the idea of sparseness.
That it would be ideal if the neural code needed as few neurons as possible to
represent its input. And this brings us to the end of our
discussion of coding. I've shown you some classic and state of
the art methods for predicting how stimuli are encoded in spikes.
We've seen models for decoding stimuli from neural.
Responses. We've discussed information theory and
how it's used to evaluate coding schemes, and we've taken a very quick glance at
how coding strategies might be shaped by the statistics of natural inputs.
There's a lot that we've missed. In particular, let's just go through the,
the typical cycle of behavior of an organism.
Where we have invested some time is the idea, that we go from complex
environments, animals extract some features from that environment to solve
problems, and that's represented in neural activity.
What the brain is then doing is extracting that information and
synthesizing it to drive decisions. We talked about some examples of using
maximum likelihood methods that might in fact have neural implementation.
These decisions then generate motor activity which drives behavior.
Muscles work together to perform actions that drive behavior output.
And these actions effect subsequent sensation.
So, we didn't really address any of this part of the, of the behavioral feedback
loop. Next week, we'll be moving onto a new
topic. Rather than handling data analysis, we'll
be moving more into the realm of modeling.
And we'll start that with a brief introduction to the bio-physics of
coding. How do single neurons generate action
potential. We'll talk about neuronal excitability.
And we'll close up with some simplified models that capture neuronal firing
before moving on to the second part of the course where you'll be learning about
network modeling. So that's all for this week.
Looking forward to seeing you next week.