If you think about it, we came a long way. We loaded a data set. We explored and extracted recency, frequency, and monetary value using SQL statements. We prepared and transformed the data to be ready for segmentation. So I'm just going to rerun everything here, since I started fresh, and now we are ready, and it is as if we were just prior that tutorial. The next thing we are going to do is compute the distances among customers. Knowing that the closer two customers are, the sooner they will be clustered together into the same segment. But here is an issue. Our customer's data set contains 18,000 lines, 18,000 customers. And so if you want to compute distances among these customers, we'll basically ask R to compute the distances between 18,000 customers to themselves, which is a total combination of 340,000,000 distances in total. So, it could work on your machine, maybe not. It works on mine. But on many, that would be just too much to handle in terms of memory requirements. So if you just compute the distance, using the dist function here, it might generate out of memory problems. So we are not going to do that, and we're going to take a sample of the data set. To create a sample that is not random, always the same, wherever it'd like to rerun, we're going to take one customer out of every ten customers in the data set. So we are going to create a sequence from one to the total number of customers and only taking one every ten customers. And that sampling mechanism will be stored in a variable that you call sample, and which looks like this. 1, 11, 21, 31. So we are going to take the first, the eleventh, the twenty-first customer only, in our data set. And we're going to use the sample suffix to make clear that you only analyze a sub-sample. You take the customers, and you take only the rows that match the sample mechanism. So we take only row 1, 11, 21, 31, and by leaving the second part of these brackets empty, we basically mean that we take every column, every variables available. So we take that, both for customer sample, which remember, contains all the original data in terms of recency expressed in days, frequency expressed in terms of number of purchases, and monetary values expressed in terms of dollar. And the same thing for the new data, which contains pretty much everything but scaled, and the login for the amount. Now, we are ready. We're going to compute the distance among these not 18,000 customers, but 1,800 customers which have been sampled from the larger data set. And store the distances among all these customers, computed on standardized data, in a variable called d. And we're not going to look at that variable specifically, we're just going to use it as a parameter later in the segmentation process. Okay, so we have a distance matrix here, which already contains 1.7 million elements, which is quite significant. And we need to use the distance in the hclust function. hclust means your [INAUDIBLE] clustering, that's the heart of the clustering method we are going to use. There are many methods available, the one that I suggest is to use ward.D2. You can look at the help functions and what it means exactly, but that's a pretty robust way of using distances to create clusters together. And the output of that cluster algorithm will be stored in a variable called c. And it's done already. Now if you plot c, you'll see a very specific plots that you probably know to dendrogram. Let me zoom on it. So here you have the 1,800 customers. Within them, obviously, its not readable. You can make it prettier if you'd like, and then you can see how all these individuals have been clustered together progressively step by step up to a stage where there is only big cluster here. So if you stop at four, you'll have that cluster over here, that smaller cluster over there, that pretty big cluster over here, and then a fourth cluster here to the extreme right. If you stop at nine, then basically you cut the clustering tree, the dendrogram, much lower, where you have nine clusters. Where should you stop? Well, here it becomes tricky because there are multiple criteria to use. Both in terms of statistical fit, in terms of modular relevance, and in terms of targeting ability. And just for the examples, we're going to cut at nine segments. So, you take the output of the hclust function here, which you see. You take that c, and you cut the tree at nine. And so the members here will contain the ID of all the individuals you have containing each and every segment it belongs to. If I showed the first 30 elements of that data, you probably remember that earlier, we took here the row_names and replaced them with the customer_id and removed that. Well, the reason we did that is because now the customer_id appear here, on top. And so you know that individual number 10 belonged to cluster 1, individual number 510 belonged to cluster 1 as well, and so on. If you run table, it will count how many individuals, how many customers belong to each cluster. And that's obviously a very important thing to know. So here, cluster number 5 contains only 49 individuals, while cluster number 7 contains 236. Of course, that's not very useful if you don't know what these clusters are all about, so what you'd like to do next is to compute the average profile of each segment. And here is the trick, we don't care about standardized variables any more. What you care about are the averages in terms of recency, in terms of those amount, in terms of dollars, frequency in terms of number of purchase. So, we're going to compute the aggregate of the actual original data set, customers or customer sample. The first few variables, which are relevant to us, meaning frequency, recency, monetary value. And we are going to group them by cluster membership, cluster membership which comes from where we decided to cut the tree, okay? And because we are going to run some kind of segment profiling, the function we like to use is mean. So, that line means that tag the vitals here recency, frequency, monetary value. Group them by cluster membership, and for each group, compute the mean. And if you do that, What you see, for instance, is that if you focus on cluster number 4, which contains 306 individuals, cluster number 4 has an average recency of 162 days, an average frequency of 2.4 purchase made in the past, and an average purchase amount of $41. While you can see the big differences, you have cluster number 6 is much, spends much more in the shop. Clusters number 2 spends much less. Cluster number 3 made a huge number of purchases in the past, whereas cluster number 8 made only one and a pretty long time ago, and so and so forth. So if you study that more carefully, you can see that the segmentation mechanism grouped people into clusters of Imogen's customers, all with different profiles and all that can be characterized in terms of managerial interest. And that's basically the ID. The ID is that we took 1,800 customers and found which were alike and should be grouped into clusters. And these nine clusters summarized pretty well the diversity of profile you have in your database. That's the core of segmentation. But as we'll see in the next module, many companies do not use that kind of segmentation exactly. But use what is called ad hoc segmentation, or module segmentation. That's the next topic.