0:09

Hello. This lesson introduces

Â the kernel density estimation technique by using the Python scikit learn module.

Â This is an important module.

Â It's one of the most, if not the most,

Â important machine learning framework that is

Â in existence certainly within the Python world if not beyond.

Â And so this will be your introduction to

Â that particular module and we're going to use it to

Â build density estimates for one and two dimensional data sets.

Â The last thing we're going to do is,

Â we're going to take this ability of building

Â a density estimate function from data by using

Â the scikit learn module and apply it to generate new data from a density estimate.

Â Now, this is a really cool thing and it's a really powerful technique.

Â And hopefully, you'll get an idea of what I'm saying by the end of this lesson.

Â The notebook for this lesson is the Advanced Density Estimate notebook.

Â And what we're going to do here is,

Â we're going to build on the introduction density estimate

Â by doing more complex techniques.

Â So first we're going to set up our notebook just as we've been doing.

Â We say all plots in line,

Â we do our standard imports.

Â Ignore the warnings that might appear.

Â Set our seaborn style and load the iris data set.

Â Now, we saw in seaborn how we could construct a kernel density estimate for our data,

Â but that just sort of made the visualization easier of what was going on.

Â With scikit learn, we actually build a functional representation of the density estimate,

Â which means we can use that function to do new things.

Â So in this first code cell,

Â we actually do that.

Â We create a kernel density estimate of our data set by using

Â a bandwidth that we calculated

Â and the data that we read in and we turn it into a NumPy matrix.

Â We then get a function that represents that kernel density estimate.

Â We can then fit it to our data and then extract the results. So what does this do?

Â Well it does the same thing we did before with seaborn in

Â one line but we have many more lines in this particular code.

Â So here's our histogram,

Â here's our kernel density function estimate.

Â Now, if that's all we did,

Â that wouldn't be that exciting.

Â But what is exciting is this,

Â we can now sample from that model.

Â What we do here is we take the model,

Â this curve you see up here,

Â and we actually say give us 15 new samples from the model.

Â And if we were to over plot these on this plot you would see that they follow the model.

Â In other words, most of the data are going to be at

Â the central range here say between 5 and 6.5.

Â And if we look at the data you can see that.

Â Now, one choice when you're doing

Â kernel density estimate is what type of kernels do you use?

Â We often use the Gaussian or normal kernel,

Â but the second one is what is the bandwidth?

Â So generally, we've just been using what

Â seaborn defines as a standard but there's many different ways.

Â There's a lot of literature on how to choose them properly.

Â So here what we do is we step through three different,

Â or a number of different bandwidths,

Â in particular there's going to be four,

Â and plot the different kernel density estimates that come out and show the results.

Â So first if we have a really small bandwidth point one,

Â you can see how it captures that fluctuation in the histogram.

Â But as soon as we start getting bigger,

Â we start smoothing over those.

Â So the green one looks more like what seaborn did.

Â And then if we increase it, we get this red one,

Â which now you can see that it actually looks like it's a standard functional form.

Â It's almost a Gaussian and the purple one here is even more close to that.

Â This shows you that by changing the bandwidth you can smooth over fluctuations.

Â So if, for instance, you think that the fluctuations are due

Â solely to the fact that you don't have a large sample,

Â you might want a larger bandwidth size.

Â Now, so far we've looked at univariate or one dimensional kernel density estimation.

Â We can also do this in multiple dimensions.

Â The easiest way to understand what this means is to look at a scatterplot,

Â which we're going to do with the seaborn joint plot method.

Â The joint plot makes a scatterplot of two dimensions,

Â in this case sepal width versus sepal length and it

Â also adds in these univariate histograms.

Â So you can see the distribution of the data in each axis.

Â Now, there's lots of points here.

Â If you imagine if you had tens of thousands or even more points,

Â it would become hard to distinguish between where the points are.

Â Even if we added jitter,

Â there still would be too many points to visualize.

Â That's where a two dimensional density estimate comes into play.

Â And in this case, what we would do is actually generate

Â a smoothed version in two dimensions which is a contour plot.

Â And that's what you see here when there's a high region in the contour,

Â that's where there's lots of points.

Â So there's a cluster of points here and there's a cluster of points here,

Â and you could see how this falls off.

Â This should be familiar to anyone who's used

Â a map and seen topography on a map where you're

Â perhaps hiking and you want to know how can I go from

Â a valley over here to a valley over here the easiest?

Â Well, you would want to go through

Â the saddle point because it's in between the two peaks.

Â So that shows you hopefully how

Â a two dimensional kernel density estimate can be useful at least in visual terms.

Â You could also, of course, construct

Â a two dimensional kernel density estimate with scikit learn and end up with

Â a functional representation of this visualization

Â and then you could sample that and use that in subsequent calculations.

Â Now, in order to help you understand the true power of this density estimate,

Â we're going to switch to a different data set.

Â We're going to use a sample of hand-written digit data to create new fake digit data.

Â This originally was done in the scikit learn documentation and we've changed it

Â slightly for this particular course in this particular notebook structure.

Â So first, we're going to use some helper code that I wrote and

Â put in along with the notebook to first get the data.

Â And it's going to return it in terms of features or columns and

Â labels along with the actual original image data that we can use for plotting.

Â And that's what we do here, we simply read the data in and we plot.

Â And you could see each column is a different type of image data.

Â This is zeroes, ones, twos, et cetera.

Â There are 1,797 instances of all of these data,

Â and what we do then is we construct a kernel density estimate.

Â We just choose a bandwidth here of 1.5.

Â One of the things you should definitely try doing is

Â changing that bandwidth and seeing what happens.

Â And then we're going to create a kernel density estimate on these images.

Â So on these features that go along with all of those 1,700 plus images.

Â We then using that kernel density estimate, we sample.

Â In this case, we're only grabbing 60,

Â you could grab more if you wanted to,

Â and then we simply plot the data.

Â So first we plot real data,

Â one row of real data and then below that are fake digits.

Â Now, these are not organized in columns.

Â These are just random digits that come out.

Â Also, note that these are eight by eight pixel images

Â and we've actually blown them up for this particular visualization.

Â If you come up here you'll notice these are smaller.

Â But you can even still just with this data infer.

Â Here's a three that's been simulated,

Â this is perhaps a seven.

Â This is perhaps a two, this is a one.

Â And as you change the bandwidth,

Â you'll see how these clarity of these generated images changes.

Â So let me take a step back and just say this again,

Â these images that you're seeing were generated from a model

Â that we built on handwritten digit data that had been scanned into a computer.

Â So we took the original data and we built a model representation.

Â If you thought about this in terms of a matrix,

Â these images are eight by eight or have 64 pixels.

Â So if we were to do this in a spreadsheet,

Â we would have 64 columns in the spreadsheet.

Â Each column would represent one pixel in these images,

Â each row would represent a single image.

Â We've taken that data,

Â turned it into a model representation and then we can

Â sample from that model and say give us a bunch of images from that model.

Â Some of them will be ones,

Â some of them will be twos, et cetera.

Â And as you can see, sometimes the model makes pretty realistic looking images.

Â I hope you've sort of gotten the excitement,

Â the importance of this.

Â We're actually starting to move beyond simple data analytics into

Â more complex model building which really allows deeper insights into data.

Â If you have any questions on this,

Â let us know on the course forums. And good luck.

Â