[SOUND] In this video we will talk about Spark machine learning library. Now we've talked about other machine learning platforms. We've talked about and we've talked about some specific algorithms in the previous videos. Throughout the course, we've also introduced different aspects of the Spark system. Spark is a very nice system, lets you write a quick, simple program to achieve a lot of functionality and it has a lot of expressive power and it's scalable. Now, Spark machine learning library, ML LIB in short, is specifically designed for ease of use and scalability when you want to do machine learning on a big data sets. It provides different sets of libraries for specific things. And I would like to refer you to the Spark ML loop documentations. This one short video is not going to cover everything. It's almost just introducing this whole idea. We can even have a whole new Cocera course on this, so I'm just going to quickly just mention some interesting ideas in Spark Machine Learning Library collection and let you figure the rest out. It allows different types of classes of machine learning algorithms already implemented and lets you play with them. A couple stood out for me, and I'm putting them here. So one is Classification and Regression. Classification and Regression are hugely important machine-learning algorithms in industry, of course academia, but in industry they are used everywhere. The sort of logistic regressions, linear regressions support vector machines. These are very important algorithms, and you'll see that they're implemented in a nice way. So we can easily use them in Spark. Other famous algorithms in this type of machine learning algorithms, for example, decision trees. Decision trees are being used everywhere for a vast number of industries. You can use it in Nitty gritty computer architecture details. There are algorithms that use decision trees inside processors, all the way up to IT industries, other industries, oil and gas industry. Different sectors of industry use decision trees. Ensembles of trees are used there. Basically, use the idea of the decision tree and build on top of that to build random forests of trees. Easily implemented, and I'll have an example in the following slides to show you how easy it is to use MLlib. The other category that stood out for me, and these are not the only two categories, there are more algorithms implemented in smart mission learning. The second category is clustering. And clustering is, again, used very widely across computer science, everywhere you look around you can find instances of clustering. The most famous and kind of maybe simple but it could be quite complex if you want to really analyze it, is the K-means algorithm. In a previous video, Professor Campbell showed you how k-means algorithm really works, and I'll just mention it for those of you who may have skipped over that video for a few seconds. But I'll show you in upcoming slides how easy it is to use the k-means algorithm without writing the core algorithm is just say, Hey, I want to use k-means. And sparks says, Boom, go. We have this information for you. Other famous algorithms Latent Dirichlet allocation, Gaussian mixture, they're all there already waiting for you to just go and just say, hey I want to use latent Dirichlet allocation on this data set, boom okay fine, this is the result. The last thing is Dimensionality Reduction, that's also very important when you are dealing with data sets that have multiple aspects. So for example, sometimes you have data sets that have simple aspects like you have a number of people enrolled in a course. And the aspect that you want to look at is for example their age. Okay that's one number. But what if you want to look at their age and say their computer science background and the country they're coming from and how many hours they've spent on the course and how many other previous courses they've got. Each of these are different axes in a representation space. And sometimes you have data sets that have so many different axes, so many different properties, that if you want to just look at them and try to come up with some representation of them it's very hard to say, okay, I want people who are like this or that. So, Dimensionality reduction tries to look at all those dimensions and pick up the most important ones, at least for some specific problems. PCA, Principal Component Analysis is also very important. So let's take a quick look at an example. K-Means, you have seen this example in a previous video by Professor Campbell, so I will only take less than a minute to just refresh your mind. So assume that we have a lot of different points spread around in space, in this case a two dimensional space. We have green dots, blue dots and red dots. You and I can see that. The computer can't, right? I mean the computer can't see that. We have like a cluster of green dots and a cluster of blue dots and a cluster of red dots. So the algorithm that we can use, for example, is K-Means Clustering. I would like to refer you back to Professor Campbell's video. Basically you have to pick a certain number of initial centroids and then iterate on them, you can do the iteration in map reduce. Part of it is implemented in a map, each map takes the centroids, compares the distance from centroid, they randomly select the centroid to every point. And we go to the reduce, the reduce gets all the points for a given centroid. There is a lot of algorithm that you need to write, right? And we've covered that in a previous video. So now I want to show you how easy it is to implement this in Spark. The couple of lines that you see here are all that is required. And if you take out the first two lines that are importing libraries and take out the comment lines, what you're left with is basically nine lines of code that read data, perform the algorithm, boom, write the results out. Right? So let's take a quick look at this. First, what do we do? Well, of course we import the mllib clustering library back in the first line. And then specifically we take some linear representation of data, in this case, vectors, that we require to store the data. So we've seen Spark before, line number four. You see that? Okay, I'm creating an RDD using the Spark context, and reading my input dataset from a text file that could be stored in HDFS. The data size could be, what? A terabyte of data. See that? That's how powerful and expressive here Spark can be. Can just say, hey, read it, put it in RDD, everything else is handled by the framework. Second line. We want to apply some sort of a pre-processing algorithm so that when we read the data we parse it a little bit, split based on the like, for example space character in this case and get the parsed data out of it. Fine, one line. Data dot map. Map says that apply whatever function I'm going to give you, on every data element, what function we are dealing with here. It's that data function that says S goes to vector of S something something. So, given an input value S apply the dense function of the vector's library on the result of splitting the S based on space character. And then once you split, so it's a string, you split, get the string, turn it to double, and then apply to dense vector, and boom. You have all of your data stored in parse data RDB. Next, we have just two lines that say number of clusters is 2, number of iterations is 20. Fine. Now here's where the magic line comes, right? You say clusters equals kmeans.train using parse data, this is the number of clusters, number of iterations. All of the stuff about this k-means algorithm, the iterations, go figure out what the centroids are, randomly assign them. Find the distance, find the average of distances for each centroid, boom. They're all implemented in the library, all you need to do, is just say KMeans.train. Awesome. Next line, then it can say, now that I have it trained and stored in an RGD called clusters, now I can say, hey, clusters.computecost, based on that and print the output. I won't go into simple other examples of other things, but basically the ml lip is chalked full of these sort of algorithms already implemented for you. It's just waiting for you to load data and say hey, K-Means, single decomposition, PIC principle component, PCA principle component analysis, whatever. Just run this on my data, store it on the output in HTFS, for example. So, Spark machine learning is actually very active these days. It's one of the most active Apache projects on Gethub and as you can see, there is very good reason for that and there is very good reason for its popularity. All right, so I would suggest that you take a look at the documentation on their website sparkmachinelearninglibrarymllib. And see how you can use this in your projects or hopefully in your day-to-day work. [MUSIC]