[MUSIC] Hello, I'm Neil Clark and in this talk I'm going to give a brief introduction to the ideas of machine learning. So, at first I'll give the big picture, the broad ideas of the aims of machine learning. And then I'm going to define what I mean by regression and classification and then I'm going to give two of the building blocks for the more advanced machine learning methods, which are the linear methods of regression and classification. Then I'm going to go on to give a brief of sketch of some of the more advanced ideas in machine learning that are very often used in literature, neural networks and support vector machines. And finally I'm going to talk about how do we assess our machine learning models and estimate their errors. The field of machine learning is one of a number of interrelated fields and broadly speaking, it sits on the interface between statistics and computer science. As the name suggests, it's descended from the field of artificial intelligence. But broadly speaking, the aim of machine learning is to learn from data in a automatic way. So typically, the machine will take as input a training dataset and there will be an algorithm which automatically extracts meaningful relationships from the data. These relationships can then be used to make predictions about future data and we can also be interpreted to gain some kind of understanding of the system at hand. And broadly speaking, there are two main categories of machine learning methods. First, there are supervised learning methods. And this is when the data that we receive is of the form of an input and an output set. On the right here, I've got a figure to illustrate a specific example. Here, the input data is an image of a handwritten digit and the output, the corresponding output, is the digit that has been written by hand. Now, the aim here of a machine learning approach might be to find a way to predict what is the digit that is being written from an image of a handwritten digit. The results of this analysis could then be used to make predictions from future data. For example, we might receive an image and our machine could then predict which digit is being written in that image. The other broad class is unsupervised learning. And in this case, there is no output set. We only have input. So in this case, the job of the machine is to discover meaningful structure in the data itself. Now, an example of this is clustering. In the image on the right here, we're showing a heat map of gene expression profiles from a number of breast cancer tumors. And a clustering algorithm, which is a form of unsupervised learning, might be able to divide the tumors into breast cancer subtypes, for example. So now I'm going to try and give the general framework for machine learning. Suppose we're given an input datum which is a vector X, which may have p real-valued attributes. For concreteness, you might want to think about a gene expression profile with genes. Then let's say for each of these data vectors, we also receive some output value Y. This might be the measurement of a physiological parameter, such as survival, for example. Then the aim of this machine learning approach would be to try to find the function f which takes our input and optimally predicts the value of the output. One way to formulate this problem is in terms of a loss function. Now, the loss function quantifies how close the prediction is to the truth. So to make this concrete, a simple example of a loss function would be to take the value of the output, find the difference of the predicted value f(x), so Y minus f(x) is the difference between your prediction and the truth, and then square that. And the square means that the loss function has its smallest value when your prediction is exactly the same as the truth. So then the idea is to find the predictor function f which minimizes the loss. And by minimizing the loss, we're optimally predicting the output. When you minimize the loss function, the result is that your optimal predictor function is the conditional expectation of the output conditioned on the value of the input. We can write this as in this equation here. Now typically, the value of the input that we want will not be part of our input, our training data set. So the most direct way to approach this situation is to look at those values of your input data that are closest to the value that you want and then find the mean value of the corresponding outputs in that case. And this is called the nearest neighbor approach. Another very commonly applied approach is to make a limiting assumption on the form of the predictor function f to assume that it is linear. Now, when the value of the output variable is a categorical variable, then this process is called classification. Broadly speaking, what you're trying to do in a classification problem is to take some input data and predict from that a category. So for example, we might take a gene expression profile and predict what subtype of cancer this expression profile may have come from. When the output variable is a continuous variable, it's typically called regression. I'm going to talk about two of the most fundamental approaches to classification and regression next.