0:11
Welcome to the introduction to statistical forecasting module.
During this section of the course, we will explore fundamental statistical methods
which are useful in using data to develop forward expectations or forecasts.
Course participants are assumed to have had some previous exposure to statistics.
Though we will provide reference materials to concepts presented in this module.
We will explains the statistical concepts employed in the following analyses.
But encourage participants to further their study independently.
We will spend a bit longer introducing concepts in this module than we have in
the others.
Given the statistical nature of these concepts, it's important that participants
spend time to understand the statistical methods employed in this module,
as they are powerful, but nuanced.
In order to use data to produce statistical forecast,
we need to understand regression analysis.
Regression analysis is one of the most commonly used statistical methods to
produce data driven forecast.
Simple lineal regression analysis uses one variable,
the independent variable to explain another variable, the dependent variable.
For example, you might use a person's height to explain and
predict a person's weight.
A thorough explanation of linear regression is beyond the scope of
this module, but we encourage course participants to study this powerful
statistical method, or to take one of the many related courses on Coursera.
There are a few concepts that you should understand, at least at a high level
before we proceed to their application in our Excel problem sets.
We encourage you to spend time studying these concepts independently
if you are unfamiliar.
Standard deviation is a measurement of the average dispersion of values in
a data set around their average value.
That is how spread out the data are from their average value.
This is related to variance but
it is more frequently used to describe the average dispersion of a data set.
The implication follows that a variance in the highest standard deviation should
imply lower confidence in the outputs of a statistical forecast using the data.
2:27
Variance, as previously mentioned, also mentions how far
on average a set of data values are spread out form their average or mean.
Higher variance in your data should result in you being less confident and
the accuracy of your prediction because your data is so
widely spread out around their average value.
Again, variance has the same implication as standard deviation,
it is simple squared to amplify dispersion from the main value.
Covariance is a measure of how two variables change together.
2:59
Covariance is not normalized, meaning that there's no meaningful way to compare
covariances across different variables.
We need another measurement to compare the way two variables change together
in order to draw meaningful conclusions.
3:13
Correlation provides us this normalized measure of covariance that is,
how two variables change together.
Normalization of covariance results in a measure which we can use to meaningfully
compare how two variables move together.
3:28
This is a normalized measure, and so it results in a value between -1 and
1 and gives an objective indication of the relationship between two variables.
It also tells us the direction of that relationship, positive or negative.
Values close to zero indicate that the relationship is not very strong.
R-squared, or
the coefficient of determination Is a number that indicates the proportion
of the variance in one variable that is predictable from the other variable.
A higher R squared value indicates a better fit of our statistical
measurement of the relationship between the variables to the data itself.
The definition above is important to note,
it has a similar interpretation as correlation
though since it is squared the direction of the relationship cannot be determined.
4:23
Let's have a high level overview now of linear regression.
Using linear regression we can quantify the relationship
between changes in the independent, or input, variable and
changes in the dependent, or outcome variable.
For example, let's look at the relationship in the variables Y,
X, m and B as shown on the slide below.
We see here Y=mX+B this relationship could be read as,
Y is equal to m multiplied by X and added to B.
You may recognize this as the classic slope intercept form of an equation for
a straight line, as we are all taught in algebra.
Simple linear regression analysis of a dataset
may result in a similar quantified relationship for a dataset.
For example, our regression analysis may result that m=3 and B=100.
Using this quantified relationship,
we can input values of X to predict values of Y, which we don't have in our dataset.
For example, if we input a value of 100 for X,
we can use this quantified relationship to produce a value of 400 for Y.
We will explore regression analysis in a simplified analysis.
First though, we must develop a thesis, or hypothesis, for
our forecasting relationship.
A few additional statistical forecasting concepts that are important to
understand include the Y-intercept.
Which is the point where the graph of a function, or in this case,
the graph of our relationship between our two variables, intersects the Y-axis.
This is the value which cannot be explained by our regression analysis and
is constant despite our measured relationship between two variables.
This also represents the value of our dependent variable
when your independent variable is equal to zero.
6:45
We show two examples of two data sets with different standard
deviation variants, correlation, and R squared values.
Notice that higher standard deviation and variants results in
a much more dispersed or spread out data set around the mean value.
While lower standard deviation and
variance result in a more tightly dispersed data set around the mean.
Let's discuss our specific example.
It will be simple and explore the use of regression analysis to predict
visits to a website based on the number of social media mentions of the site.
Our thesis here is that it's reasonable to think that as mentions
of a website on social media increase.
The number of people who visit the site will also increase, and
result in increased web traffic or page hits to the site.
Starting with a thesis like this is fundamental to regression analysis.
7:45
This statistical method can be used to determine if
there is a correlative relationship between two variables.
In our case, the relationship between the number of social media mentions,
the independent variable and visits to our variable.
Dependent variable,
we have measures of the strength of this relationship which we will discuss later.
If these measures indicate a strong relationship we can conclude
that social media mentions and visits to our website are related.
A strong relationship detected via linear regression analyses does not,
however imply a causal relationship between the two variables.
Said differently, it does not imply that traffic
increased to our website because of social media mentions.
But, rather, that increased web traffic to our website
tends to increase with increases in social media mentions.
This is an important distinction,
though causal analysis is beyond the scope of this module.
In order to complete the exercises in this week's problem set
you'll need to enable the Analysis ToolPak in Excel.
We've included a link to instructions on how to do this in the reference materials.
Please take a moment to ensure that you have the ToolPak enabled
before continuing.
You'll know that you've successfully enables the ToolPak if you see it on
the Data tab of the ribbon, as in our image below.
[MUSIC]