Hello. I'm standing outside the Siebel Center for Computer Science which opened in 2004.

This center has nearly 225,000 square feet of research office and

educational space and is a fully interactive environment and intelligent building system.

Inside this building faculty,

staff and students conduct state of the art research in

many areas of computer science, including machine learning.

Now, so far on this course,

you have learned to generate machine learning models to classify or regress on data.

You've also learned about dimensional reduction as part of feature engineering.

In this module, we switch to the second type of

unsupervised learning task, cluster finding.

Cluster finding is an important task since it can

lend insight into data that might be hard to otherwise find.

For example, if you are running a business and have collected data on your customers,

you can employ cluster finding to group sets of similar customers together.

This can enable you to generate special programs

targeted at the different customer clusters to increase revenue.

You also might use a similar approach to prevent or minimize customer churn,

which is when customers leave your business for a competitor.

By using clusters, you can identify those customers who might be

thinking of leaving and offer them rewards or special offers to stay.

This module starts with several readings on how clustering can be used in business,

including for customer segmentation,

as well as a discussion of how clustering can be performed correctly.

After this, we will explore three types of cluster finding,

K-means, DBSCAN, and mixture models.

The first algorithm, K-means,

is similar to the k-nearest neighbor algorithm discussed earlier in this course.

The K-means algorithm attempts to find the best K clusters in a dataset.

This process is iterative as centers are chosen at random,

points are assigned to the nearest cluster,

new cluster centroids are computed,

and the process continues until convergence is reached.

This process is simple to follow and generally results in reasonable clusters,

especially if K is chosen well,

such as when a priori knowledge is applied.

One of the readings included in the K-means lesson provides

an interactive graphical demonstration of how this algorithm works.

Be sure to explore this site as it can be very helpful.

The second algorithm is DBSCAN,

which is an acronym that stands for

Density-Based Spatial Clustering of Applications with Noise.

This algorithm is different in that the number of clusters is determined from the data.

This algorithm is a density-based algorithm,

and when points are found to be near each other relative to the mean density,

they are grouped together into a cluster.

This algorithm can find arbitrary shaped clusters and is robust to outliers.

In fact, it can identify points as

outliers when they are effectively not part of any other cluster.

Thus, DBSCAN is a powerful algorithm

that nicely complements the K-mean cluster finding algorithm.

Once again, the readings for this lesson includes a link to

an interactive graphical demonstration of how this algorithm works.

By using this tool,

you will gain a better understanding of how

this algorithm is different than the K-means algorithm.

The final algorithm we will explore in this module is the mixture model,

which is a probabilistic model for representing subgroups or clusters in a dataset.

Essentially, this model works by assuming that the data

are generated by a mixture of N distributions.

For example, N different Gaussian distributions.

And these distributions have unknown parameters.

The algorithm estimates these unknown parameters from the data,

which results in a parametric representation of

the groups of points or clusters in the original dataset.

One nice benefit of this approach is that

a parametric model is generated for the clusters,

allowing new data to be created as needed.

Cluster finding is an important skill that when done right can lead to amazing insights.

Finding clusters enables large datasets to be reduced to

much smaller ones since we often treat all members of a cluster in the same fashion.

We also can gain new insights into our data by finding and studying clusters.

Once you have completed this module,

you'll have learned about the four major types of machine learning algorithms.

That will be a real accomplishment. Good luck.