All right guys, let's get started with our last example for this module, it's going to be yummy, and it deals with the chocolate sales forecasting example. As always, let's get started by setting our directory to the folder where we've downloaded the chocolate data set. Once it is done, we clean up the memory of our current R session. And we are ready to load our data set with the read.table function and the usual arguments that we use. Now to get familiar with our data set, we use the STR function. And we call it on the dataset that we've named data. What we find out is that we have 120 observations in this dataset and four variables. Namely the time variable which goes from one to 120 and which essentially means that we have ten years of monthly data. Then the cells variable which corresponds to cells in thousands of units. Then the year variable which tells us the year of the particular data point. And the month variable which tells us the month of the particular data point. Well, until now we've always called this summary function on the entire data set. It does not really make sense to have the mean value for the time. Or for the year, or the month variables. So, what we're going to do here is to call the summary function on the unshelf by typing summary(data$sales), and we find out that the minimum sale was about 37,000 units, the maximum was over a million units. On average 216,000 units were sold. Let's plot the sales as a function of time to get a better of our data. To do so we use the plot function. Time goes first because it will be on the horizontal axis and cells second because it will be on the vertical axis. Then we add a main title with the main argument and access levels with x lab and y lab. And we use the ylim argument in order to set a specific limit to the y axis wider than the default, so we multiply the max, the maximum value of cells by 1.2. And our last argument here is the type of loading that we want to do. And we said if equal to l in quotes, which is lined. Let's run the line. What do we see? We see that the data shows seasonality. And most importantly, that is regular over time. It looks like there is a peak somewhere around, let's say the 12 months, and here again at, it looks like it could be month 24. And so on, and so on. Now as you did in the lecture, let's build a simple linear regression model, with the sales as the dependent variable, and month as the independent variable. And then the data that we use to build the model is our data set, which is called data. If we take the output of the model, what do we see? We find out that the most statistically significant predictors of sales are February, August, November and December with three stars, and then June, July, and September with two stars. You remember that in the star system relies on the p value and for a p value below .05, which is usually deemed statistically significant, you get a dot. Below .01, you get one star. Below .001 you get two stars. And then the closer it gets to zero you get three which is extremely statistically significant. Now if we look at the T value and the sine of the coefficient we find that December followed by November and February have the strongest positive effect on sale. And as explained by Professor this mostly makes sense since a lot of people in the Northern hemisphere buy chocolate for the end of the year holidays. Now, another interesting information here would be to see the distribution of sales for each month. To see that information, you can create a box plot. Because the month variable is a factor, R will automatically represent the sales per month using box plots. Like this, we then add a main title, access labels, and use the XM argument like we did before. And we get this plot, which shows you, for each month, the distribution of all the sales that happened in January, no matter the year. Again, for February, March, April etc., etc. Now BOXPLOT is a very interesting visual way to understand if there's a lot of variation in the cells of a particular month. For example, in December you see that there is a lot of variation, while in July the data is very similar from year to year. Now how does the model perform on past data? Let's plot the actual sales as a function of time and add the sales as provided by the model. To do so, we use the plot function with time on the horizontal axis, and sales on the vertical axis. And again we add titles and access label that you can explore in your own time. In order to add the fitted values to the same plot we use the lines function were time is on the horizontal access and the fitted values on the vertical axis. Then we set the type argument equal to l in quotes because we want a line. The col argument equal to blue in quotes because we want this line to be blue. And the lty argument allows you to specify the type of line that you want to use, and 2, allows you to have a dashed line. Let's run the line. Now we see that the model does really well and it also allows you to reduce a lot the insurgency for stock management. Now to make your plot more explicit, it would be a good idea to add a legend by using the legend function. The first argument here said equal to top left in quotes, lets R know that you want your legend to be at the top left corner of the plot. We then use the C function to set the text that we want to use in order to describe our line. Then again in order to describe the type of lines that we want to use. And we maintain the same order for all the arguments. So the actual sales have an lty argument set to 1 because it's a complete line. While the sales by the model have an lty argument set to 2 because it's a dashed line. And then again for the col argument we set the actual sales equal to black and the sales by the model equal to blue. Let's run this line. We do have a legend here. That's it for our fourth and last example of this module. I hope you had a good time and learned a lot. And now it's time to go and eat some chocolate for real.