[MUSIC] I have seen the factorization of data does now in Samsung to see the clusterization of the same segmentation of basically the clusterizing of data. So let's have a look at that. Take a look at an ideal clustering situation just to give an example of what really happens and how you can compare it to what happens in the real world. Take a look at this, the location of 24 people on a usage versus willingness to pay. That's WTP map in the cooking oil category, some category. So on the X axis you have usage in higher, the farther around the X axis you are that much higher is the usage of cooking oil in your household. And the higher up on the y axis you are that much more is your willingness to pay. You're willing to pay a premium for quality or brand or so on. Question, how many clusters do you see there? How many distinct coherent groups of customers do you see? And this one is very easy to answer. There are four distinct groups. Actually, just this can give you a lot of idea into who is who. If you look at the top right of the graph, that particular segment is probably a hired restaurant. The usage is high, and the willingness to pay is high. And I don't think a household can consume that much. And so on, you can profile these customers at emerged based on the dimensions upon which you have separated them. So that is basically the idea there. And clustering is depending on the distance between the clusters. Whereas back in factorization, we have a grouping variables based on core in the correlation between them. So the two are not the same. It's not like I transpose the matrix and I can draw them the same. I gotta do them again. Now, that was the ideal clustering situation. That was an easy example. They were four clear clusters emerging. In the real world, things are seldom that clear-cut. In the real world, this is the world you might see and now, migration would be, how many clusters are there? And don't say one, because it's probably more than one there. How do we know, how would be know where to start separating or drawing bound reasons these mass of points that we see there some questions of interest, how many clusters are there? What is their size? What are their other characteristics? Could it be that we are missing some variables? Why are we clustering them only on two dimensions? Suppose I added a third dimension? Which kind of is perpendicular to the other two? You might see that the top half would then go way back and the bottom half would come out. In some sense, we could see a separation, maybe, along the third dimension. The reason we are not seeing a separation now is because we are missing some important dimension, and so on. Suppose there were, instead of two or three, there were ten basis variables. Variables on the basis of which we are going to cluster. How would we even visualize a ten dimensional object and so on? So keep in mind these questions because we are going to return to them. Basic questions of segmentation. Segmentation is basically the marketing terminology. The customer analytic terminology for clustering. That's basically what it is. So what is segmentation at a conceptual level? Segmentation is nothing but grouping together certain units of analysis and this case customized. On what basis are we going to see that? Two, why would you segment? What is segmentation? Why are you segmenting? How do we segment? And what do we do after segmentation? These are the four basic questions that we in some sense can see now. What is segmentation, conceptually? It is nothing but grouping together customers who share certain characteristics which are your bases variables. Why would you segment? It is more effective and efficient to pitch a value proposition to a relatively homogeneous segment than to a heterogeneous mass. Well, I can brief each customer other segments and their own writer, I can brief the entire marketing with one segment only, probably by their middle way would be to updated by a handful of segments and then, teller marketing campaigns individually to their segments. How do we segment? Several bases can be considered, you can segment on a demographic basis, on a psychographic basis, and so on. The ideal basis is customer need. The problem with customer need it is latent, we can't really see it up front, so it's going to be hard to segment on that basis. What after segmentation, but once you are done with segmentation, then a lot of insight can be brought. A lot of things can be done. You can activate interventions based on segment membership, and so on. So we will see some of that. 4Ps in this case are the marketing 4Ps. Product, promotion, price, and place. Segmenting respondents based on their personality. Now, do you remember the factorizing example where we took these 208 by 45, and then we separated into 5 factors? Can I use those personality factors as segmenting basis? Yes, it's psychographic segmentation, it can be done. So what did we obtain there in that big file, a psychographic survey? We obtain available transformation in respondents course. It is possible to combine these two into a single two dimensional map that shows both respondents and variables in the same space. I want to show you what that will look like. And what we are going to try to look for is the emergence of natural groupings of respondents. What similar factors cause, this is what it looks like, and this will come through when you run the desktop up of clustering. What do you see? Well, basically you see the two dots and now you see that you have the red dots and the blue ones and they are separated on space, and then you have those ovals there which basically in some sense are the clusters. So not just the factors and the variables in some sense but we also have people responding and separating in that space. Segmentation applications. General applications and segmentations. Where all can it be used and applied? Well, basically there's people applications where it can re deploy customers, prospects, households all of them can be segmented in some sense. Things, products, product attributes, brands, firms, countries, places, stores, cities, regions PIN codes. All of that can be segmented. Ideas and entities can actually be clustered and in some sense even in tangible things. So we'll come to a bit of that. The people you could say the elements of the social graph can be studied and exploited. It gets a little complicated there because it's no longer an independent sample and I'm not going to be covering it in this course. But it would require network analytics in some sense when we are starting to segment people. Things, products, product attributes, brands. The very basis for recommendation systems, cross selling, up selling, all of that. How does it start? Well, the starting point is the segmentation of things. Places. Think about the heart of investment and location decisions. There it's basically where it is all coming from. And places are segmented quite often, actually. Ideas and entities, so text, image, audio, IP, strategy, and so on can be clustered. In principal they can be clustered. What do the clusters mean, etc.? It depends on how you interpret them. Where we are going next, in some sense, will be the segmentation of text and consumer opinion. That's basically where I'm heading. Here's a quick example. Segmentation of text data and ice cream example. So this is basically coming from a 2008 survey of flavor preferences in ice creams. 4,900 respondents in upstate New York, midsize regional retail chain. The question that was asked was this, Wows is basically the brand we talked about, the store brand. If Wows offered a line of light ice-cream, what flavors would you want to see? Please be as specific as possible. A dataset of 5,900 rows. Even if 20% are non-empty. And some of the are empties, some people didn't answer. Think of the manual analysis effort required, right, sifting through something this big. Now, we're going to use some standard Text-Analytic procedures for which I will provide the code. And the desktop app that you can use to replicate this example. So let's see how this works. So you can basically see some of the first few lines of the data set that I have. There are some empty lines, there people have written stuff. There are typos in there. Somebody has written vanilla with a single l. Somebody has forgotten the second o in chocolate and so on. So you can see all of that happening. This is the Corpus-level Wordcloud guide in some sense will come through. And this will show up in the first style of your app. And you can see the size of the font there is proportional to the frequency of the thumb. Vanilla and chocolate are clearly favorites. Strawberries up there, and so on. Segment-level Wordclouds. Now, what do I do? I basically segment work, people I recognize, segment people, based on what their flavor preferences are, and I basically got six segments. This is Segment 1, 28% of the sample. And this is Segment 2, 31% of the sample. My question to you, can you interpret these segments? It's not hard to do, quite clear. Prima facie, it seems like Segment 1, people are the vanilla people. Vanilla is important to them. Along with vanilla, you could in some sense combine or you could bring together any of the other flavors and they are okay with that. But vanilla is important to them. Segment 2 people are basically in some sense the coffee people. The coffee flavor is big for them. And along with that they are willing to try other things but coffee is important. Take a look at Segment 3. Size 10%. This are cookies and cream people. They have appeared as separate tokens, but there's actually one flavor. And Segment 4 people, a chocolate chip and mint. Actually, these are distinct flavors sold as such in the US. They're not three different words, they're actually one phrase. And so, basically simply by clustering text output I'm able to get some insight into who these people are and what their flavor preferences are. Segment 5, Segment 6. Segment 6 clearly is peanut butter. Segment 5 is hard to interpret because one cluster is going to basically in some sense like a vacuum cleaner collect together everything else. That didn't make sense and make a cluster out of it. That's basically what I got there. It will happen, nothing to worry about. You will have one odd cluster that is basically junk. Which brings me to a quick summary of descriptive text analysis. Just think back about what we accomplished, okay, with this desktop app with just elementary text analysis. One, we were able to rapidly, scalably, cheaply, crunch through raw text input. 5,900 rows of it within seconds. We're able to reduce this open ended mass of unstructured data into a finite dimensional object. I haven't gone through this in detail. I'll put out the notes for you that you can read and see. Able to apply standard and analysis techniques like k-means and so on to this object. Able to sense what might be major preference groups. All of that. Also something else I've been cover as a five degree leveraging the advantage of this survey method. Large sample sizes, segment sizing possibilities, market sizing possibilities. All of them coming through one procedure. [MUSIC]