Hello everyone. In this lecture, we will use our tools that we have acquired until now. And we'll try to fit an ARIMA model into a real life dataset. Dataset is called daily female births in California in 1959. So we're going to look at the time series for whole year and the frequencies for every day. It's going to be daily time series. And we'll have a number of female births at each day in California in 1959. Objectives are to fit an ARIMA model to a real world data set. To judge various fitting tools, such as ACF, PACF, and AIC. And to examine Ljung-Box Q-statistics for testing if there's an autocorrelation in a time series. As you remember, this is our basically guideline for modelling a time series. We're going to look at if there's a trend in the data, if there's a difference in the variation, if we need transformation or not. And then we're going to look at ACF, PACF. Akaike Information Criterion, sum of squared errors, and also Ljung-Box Q-statistics. And at the end of the day, we're going to use R to estimate our perimeters in the model. Okay, so this data that is coming from Time Series Data Library which is TSDL, it is created by Rob Hyndman, Professor of statistics at Monash University, Australia. And there's a link here, you can go directly to the library. And the category that I chose was demography, so this dataset, daily total female births in California is under category Demography. It is a time series from January 1, 1959, to December 31, 1959, for 365 days. It's a daily time series. There's a direct link here that will take you directly to the time series. And what we're going to do once we are there, we are going to export the data as a CSV file. We're going to open the file, clean up the bottom row because there might be extra rows in the bottom we don't need. And then, either put this file into your working directory in R or directly read it from whatever it is downloaded to the R, using its path. If you use that link directly, we go to the daily total female births in California 1959, there's an export button here, so if I click this export, I can export this in the various format. I'm going to use CSV, and if I do that it automatically downloads, basically daily total female births in 1959. We can open it. Once you open the file, it will have a date and a daily total female births in California in 1959. These are the dates and these are the number of the births. If you go to the bottom of the file, there's extra row that we don't need so we might want to delete this and save it. Now we are ready to use this dataset and read it to R. And we'll start modeling this time series. So I have R Studio here up and running, and I have written a little code called Birth_California_1959. And this code is provided to you. There's the information we just talked about, and when we start with the library or some of you might need it, so you can run it. And now we're going to read, so read the time series we downloaded in the CSV file. And I put that file into my working directly, so I'm going to directly read it into a variable called birthdata. And from birth data I'm going to pull out the column that's actually number of births. And we're going to call it number_of_births. And we're going to format the date as month, day, and year. And we're going to plot the data on the right hand side here plus plot the data. Once we have the birth, let's plot our dataset. And when we plot it, you see that we have this Daily total female births in California. This is our time series. This is our Date, and this is the Number of births. As you can see, this is definitely not a stationary time series. There's some kind of trend going on. So maybe we should need to remove the trend later on, but lets first check Q-statistic. In other words, is there a correlation or after correlation among different lags of this time series? And we're going to use box test as we said. Box-test, this is the name of the dataset and this is the lag that we said we're going to use the logarithm of the length of that time series. If we use this Ljung-Box test, we obtain the following p-value and then p-value is actually very, very small, which is smaller than any significance level you would choose like 0.005 or 0.001. So there is definitely after correlations, so we can fit somewhere the model, ARIMA model may be here, some linear model, and try to pull the linear part of it. And then we're going to look again to its residuals, maybe there's other correlation there, maybe there's not. So we hope that we will get some white noise there. Okay, so, as we can see, there's maybe some kind of trend going on here. So, in order to remove a trend, we're going to use a differencing operator, which is DIFF. So that's what we are going to do. So let's plot our different time series. This is our difference time series. This is our date and we get some kind of maybe a stationary time series. There is a little bit peak on September 2nd if you look at the time series there's September 2nd which is exactly nine months after December 31st. We can attribute this to randomness, so we can say that, okay, maybe we can ignore this outlier. Then we basically have some stationery time series. And let's look at the auto correlation theories and the auto correlation here we're going to use, again, box test for the difference of the time series. And if you run this, we obtain, again, very, very small p-value, so there's definitely some autocorrelations. Let's look at ACF, I told you before that ACF will also suggest that there might be some autocorrelations. This is the ACF of the difference data. If I look at the ACF, we see that there's a significant lag at lag 1. And then there's one peak at lag 21 maybe. So maybe we can ignore this and we can refer to this randomness. Maybe this is a random peak. And if you can ignore that, then maybe we have two significant peaks. This is lag zero, which is always one, and we have some lag one here. Autocorrelation functions suggest order for the moving average process. If I look at the PACF or partial auto-correlation function. We obtain the following. We see a lot of significant peaks if even if we ignore lag 21. We can see that there is a lot of significant lags until lag 7. Okay, so we have various competing model we're going to have to try. So first we're going to try to fit ARIMA model, the to number_of _births with order 0,1,1 all of them will have to have 1, because we have the difference the data. So this time, I'm going to look at the model basically MA1 for the difference data. This is model MA2 for the difference data. This is basically ARIMA71 for the difference data. So ARIMA711, we again look at ARIMA712 because PACF tells us that there might be the order of [INAUDIBLE] process is 7. And, ACF actually tells that maybe order of MAQ processes is one or two. Okay, so this model one is our ARIMA 0,1,1, I'm going to pull out some of the squared errors. In other words, I'm going to look at the residuals square them and sum them up and we're going to call this SSE1. We want this to be as small as possible. We're going to also look at the residual. In other words, once we model our time series with Arima(0,1,1) let's say, then we have residual. We hope not to have any autocorrelation in the residual. We would like to have some white noise in the residual. So we're going to take Model 1 residuals. We're going to look at the box test and that's going to be called Model 1 Test. And we're going to look at its p-values. And so we continue this process for different models, Arima(0,1,1), Arima(0,1,2), Arima(7,1,1), Arima(7,1,2). And we have some of the squares of errors and we also have the R test. I'm going to define a data frame df, it's going to have the row names AIC which is Akaike information criterion. It's going to have some of squared errors, is going to have p-value. And then I'm going to take for example model1 AIC value, SSE1, sum of squares errors for the model1, and I'm going to take model one test and I'm going to look at the p-value of it. We're going to do this for every single model. So, let's do this. Let's go back here and run these commands. Model1, model2, model3, model4, and now we have our data frame. I'm going to call the column names are Arima(0,1,1), Arima(0,1,2), and so forth. And then I'm going to format it without using these scientific notation and at this point I should get my results. So as you can see here, my data frame without the scientific notation basically are, I have AIC values and I have sum of squared errors. And I have p-values. Remember what p-values are, p-values are trying to see if there is an autocorrelation in the residuals. As you can see from all of the p-values, none of the residuals have significant auto correlation, so we can assume that there is no autocorrelation. We cannot reject the null hypothesis, there's no autocorrelation in the residuals. Let's look at the AIC values. AIC values, as you can see, is 2462, 2459, 2464, 2466. The minimum value is 2459. So if you are going with the minimum AIC value, our model is going to be basically Arima(0,1,2). If we are looking at the smallest SSE values, sum of the squared errors, it seems like the smallest error is actually 17,574, which will tell us Arima(7,1,2) model. But if you have Arima(7,1,2) model, 7 plus 1 plus 2, you have 10 parameters in the model. So if you are using, trying to get the simplest model as we can, so in this example, you're going to use AIC as our criterion. And we're going to go with Arima(0,1, 2) model. Now that we have decided that we're going to fit Arima(0,1,2 model to our dataset, I'm going to use SARIMA routine from astsa package. Now that we have loaded astsa package, were going to say Sarima S stands for seasonal autoregressive integrated moving average models. Which is going to be our subject in week five. Sarima, you have number_of_births, and we have this 0, 1, 2 basically this is p, d, q, and for now lets just ignore this last three perimeters 0, 0, 0. If I want this command, it gives me a lot of things. For example, it gives me my call and my coefficients. And it actually gives me the coefficients here with their standard errors. So ma1 coefficient is negative 0.85, 11 and this is ma2 coefficient, this is our constants. If you look at this p-values it means that ma1 and ma2 coefficients are significant at the level of 0.05 maybe the constant is not significant but we can keep this our model. Now that we have fit ARIMA (0,1,2) model to our data set, we have the following. We have (1- B)Xt, this 1- B is the differencing, because of order 1, B = 1. And we have 2 moving average terms, which are right here, Zt -1 and Zt- 2 terms. We have no autoregressive terms, this is the constant, and Zt is the current noise. The numbers in the indices are shown here as the standard errors of these coefficients. Thus, if I put (Xt- 1) to the right hand side, we have the following Arima (0,1,2) model. Xt = Xt- 1, this is because of differencing, we have a constant, we have a noise right now for current time. And we have dependence on the previous two noises, Zt -1 and Zt-1. Now from Sarima routine we have seen that Zt which is assumed to be normally distributed, has expectation 0 and a variance 49.08. So what have we learned? We have learned how to fit an ARIMA model to a real world dataset using various fitting tools such as ACF, PACF and AIC. And we have learned examining Ljung-Box test for resting correlation or autoorrelation actually in a time series.