In this video, I would like to move on to an important topic in analysis dealing with missing values. Data sets often come with missing values where no data is available in the row or column. In many cases, we cannot simply throw away missing values because we need to have enough data for meaningful analysis. We often leave them in and fill the values with some smart estimates, in which case it is important to minimize or distortions. Furthermore missing values may themselves be informative, that is the fact that a data point is missing can have high predictive power. Let's look at a concrete example. The table on the slide shows the data sales for newspaper at a news stand. The first column lists the date and the second column shows the number of copies sold each day. Curiously there's no sales recorded on March 27th. This missing value can cause difficulty when the multi approach requires a data value for each column or row. For example, this is the case for linear regression. Although it is straightforward to remove all records with missing values, it can lead to significant loss in data. The issue is especially severe for data sets with many columns. Even if there is only a small fraction of missing values for each column. There will be many rows of at least one missing value. As a result too many rows will be removed in the process. The graph on this slide is a non-plot of the sales data, where the x axis is the date and the y axis is the number of copies sold. The plot shows the different relation in daily sales with the highest value of 50. On MArch 28th, there are no sales recorded on March 27th. We have a missing value on that day, which is filled with a value zero on the graph. The cells data appear to follow some pattern. The value is high initially and gradually decreases in the next few days, and then increases again. The pattern then repeats. This slide shows the same graph with weekday information. Which turns out to be quite useful. It is apparent from this graph that sales seem to follow a weekly pattern. The sales number is the highest on Mondays and then gradually decreases in the next few days but picks up over the weekend. Note that the missing value on March 27th is quite disruptive to this pattern. Without correcting for this value, our real impression can be misleading. Suppose that we would like to use this data for sales forecast, a common business task. Our forecast would be brought down by the zero sales value on March 27th. This again shows the importance of dealing with missing values as a data preprocessing step in predictive modeling. What happened on March 27? There are many possible causes of missing values. Sometimes a missing value is simply a result of negligence in data recording when we forget to record a value. In this particular situation, however, after some investigation, we discovered that March 27th is Easter Sunday. And the newsstand is closed. Although we have a perfect understanding of what caused the missing value on March 27th, we still need to decide on what to do about it. If we include zero sales in our data set, it will certainly distort our sales forecast. There are many possible ways to deal with missing values and here we discuss a few of them. The first one is to simply remove the data. As we mentioned earlier however it's not always feasible as we may throw away to much data. The second approach is to impute of guess the value. We can fill in the missing value with zero with average sales or with a smart guess from some interpolation. For example we can use sales on the same day last year to fill in the value. In general we can use observations from similar data points to intelligently the guess the value. Finally we can also make missing it's own category, such an approach is typically more appropriate for categorical data. Let's return to our little example and try to impute a value for March 27th. Filling in the value as zero, as is already done, is probably not a good approach. Were the newsstand not closed for Easter Sunday, it is quite unlikely that the sales will be zero. What about filling the value with average sales across all different dates? If we do that, we get a value of 37.23, which seems to be a reasonable value to use. Yet another approach, is to use some other interpolation approach. Since our data demonstrates on a weekly pattern, we can use the sales on the last Sunday as an estimate. In which case, we will fill in a value of 40. It can be argued that either of the two values, 37.23 and 40, are reasonable estimates for the sales on March 27th. The first one, 37.23 uses more data to come up with the estimate and therefore may be more reliable. The second one, 40, takes advantage of the weekly pattern in the data. However, it only uses on data point to estimate the value. If, for some reason the sales on Sunday, April 3rd is exceptionally high or low, for reasons we do not know, the same bias will certainly be carried through this estimate. Missing values should be contrasted to censored values, which are partially observed values and therefore are not accurate. Censored values also need to be carefully treated in data analysis. Going back to our latest example, the sales on Monday, March 28th is 50, which is also the largest sales number in our dataset. After talking to the store manager, we discovered that only 50 copies of the paper are available for sale on that day. This is the point is a censored value. The sale on this day is not fully observed because we run out of the unitary. A sales record of 50 suggests that the sales can be at least 50. However, you do not know exactly what it will be. Now let me briefly recap our discussion. We talked about the importance of dealing with missing values. There's no single accepted solution. It is often helpful to consider the problem context. And dig deeper in order to understand the causes and plan your remedies for missing values. It is important to minimize biases or distortions when dealing with missing values. We often come out simply throw way missing values because we need to have enough data for meaningful analysis. The pattern of missing values can sometimes carry important information and be highly predictive. Many software packages provide a wide range of options for handling missing values. It is important to understand and choose the right options since how missing values are treated can have dramatic impact on the modeling outcome.