Once you’ve collected and interpreted data, what do you do with it? In this module, you’ll learn how to take the next step: how to use data about actions in the past to make to make predictions about actions in the future. You’ll examine the main tools used to predict behavior, and learn how to determine which tool is right for which decision purposes. Additionally, you’ll learn the language and the frameworks for making predictions of future behavior. At the end of this module, you’ll be able to determine what kinds of predictions you can make to create future strategies, understand the most powerful techniques for predictive models including regression analysis, and be prepared to take full advantage of analytics to create effective data-driven business decisions.

Professor of Marketing, Statistics, and Education, Chairperson, Wharton Marketing Department, Vice Dean and Director, Wharton Doctoral Program, Co-Director, Wharton Customer Analytics Initiative The Wharton School

Peter Fader

Professor of Marketing and Co-Director of the Wharton Customer Analytics Initiative The Wharton School

Raghu Iyengar

Associate Professor of Marketing The Wharton School

Ron Berman

Assistant Professor of Marketing The Wharton School

Welcome to Customer Analytics.

So, as Pete mentioned,

there are broadly two ways in which we can think about quantifying data.

One is making predictions one period ahead.

The other is making predictions more than two periods ahead.

So, in this module, we'll talk about the first one,

making predictions one period ahead. How do we do that?

It's done through regression analysis.

So, what we're going do in this module is to talk about a simple example,

show how regression can be done,

show what predictions that can make,

and then we'll take it off with Pete who will talk more about two periods ahead.

So, let's start with regression analysis.

What is regression all about?

It's about quantifying the relationship between two or more variables.

Let's take a simple example.

Suppose you're looking at demand or data of people purchasing,

and you know how prices were changing.

What you'd like to do is to think about how you can

start thinking about how price is changing demand.

In other words, put some numbers behind it.

Let's look at some jargon of regression.

What we are trying to do is to explain a dependent variable, in this case,

sales or demand, as a function of independent variables,

in this case, price.

So, in other words,

all we're trying to do in regression is try to make

predictions of what would be the demand at different prices.

Regression is a technique that uses

simple linear additive model to make these kinds of predictions.

It'll become clear by taking a simple example.

Let's imagine this is the demand data for a particular firm at different prices.

What this firm was trying to do is to try and

understand how their prices might change demand.

So, they ended up changing the prices,

and they observed the demand.

The very first thing you should do when you start thinking about

quantifying the relationship is just plot the data.

So, let's plot it. Here's what the plot looks like.

What do we see here?

On the horizontal axis, we have price.

On the vertical axis, we have sales.

What we see here, which is what intuitively you would expect to see,

is that as prices go up, sales come down.

On the one hand, it's intuitive,

it makes a lot of sense,

and this is what you would call a demand curve.

Price is going up. Sale is coming down.

Where does regression come in?

Regression comes in to give some hard numbers.

You can eyeball it and see that as you increase price,

sales does come down.

But we would like to say specifically by how much.

In other words, we'd like to answer the following question.

If I increase price by one dollar,

how much does sales come down?

That's where regression comes in.

What does regression do?

It tries to fit a straight line to the data that we see

here and tries to put formal numbers behind this demand curve.

Broadly, what we're going to talk about in a simple example is demand analysis.

This is a specific example for regression.

You can think about doing it for many other types of data.

What we're doing here is sales as a function of price.

You can think about sales as a function of advertising.

You can think about a variety of

different variables that you'd like to see if they're connected together.

So, the first thing to do before we start analyzing any data is to plot it.

Look at it visually. This is what this plot is trying to do here.

In the vertical axis,

we have demand or quantity,

and on the horizontal axis, we have price.

As expected, what you see here is that as price goes up, demand comes down.

This is the standard demand curve.

In other words, at lower prices,

there is higher demand;

and at higher prices,

there is low demand.

So, this is visually looking at the data.

But at some point, when managers have to start thinking

about how price sensitive are their consumers,

what would be the change in demand if they change their prices?

One can't just look at it visually.

One has to put a more structured pattern on it.

That's what we're going to do with regression.

So, in the next two slides,

what you will see is we're going to take this data and run a regression analysis on

it to put a more quantitative approach

on how price and demand are connected to each other.

In the equation that you see above,

sales is a function of price.

So, the sales, which is on the left-hand side,

is the dependent variable.

This is what we're trying to explain.

Things on the right-hand side of the equation form the independent variables.

In this example, price is the only independent variable that we're using.

Now, if you focused on the right-hand side,

we see a few coefficients floating around. What are these?

The coefficient in front of the price variable,

b, is termed as price sensitivity.

It basically captures how sensitive your demand is to price.

So, for instance, as a manager,

if you'd like to answer the following question: if I change my price by one dollar,

how much would my demand change?

The coefficient b would be very helpful in determining that.

The next one is the coefficient, a.

This is termed as the intercept.

This basically captures the baseline level of demand.

Finally, we have the error term, e,

which is on the right-hand side. What does this capture?

This basically captures the idea that,

of course, demand might be varying,

might be going up and down for many other reasons besides just price.

For example, they could be promotions,

advertising, competitive actions, and so on.

If you notice, in this particular equation,

we are only capturing price.

So, the error term or e basically

comprises of all the things that you've left out of the model.

Now, of course, if the model is good,

the error term or the things that you've not put into the model should be small.

If unfortunately the model is not very good,

the error term might be quite large.

So, in the next slide, I'll talk about a metric called the R

squared that captures this idea of how good the model is.

Now, if you focus on the equation in the bottom of the slide,

you will see the more general form of what we just discussed.

This is termed as simple regression.

The idea here is that the dependent variable, y,

which is on the left-hand side,

is related to the independent variable,

x, which is on the right-hand side.

So, the equation that we just discussed,

which is sales as a function of price,

is one instance of the simple regression where there is only one independent variable.

So, how does regression work?

Basically, the idea is,

regression tries to fit a straight line, in this case,

the demand equation, to the data that we just saw.

Here's what it looks like. Once we do the statistical analysis behind the scenes,

in terms of fitting the straight line to the plot that we just saw,

what you will notice is the regression line,

which is plotted on top of the data points.

What do we immediately see here?

The first thing is that there is some sanity check.

In other word, face validity.

The regression line is downward-sloping,

which is very similar to what we had found earlier, which is,

at higher prices, the demand should be low;

and at lower prices, the demand should be high.

The regression line capture

this general trend that there is a downward-sloping demand curve.

Next, what we see are actually values of a and b.

So, a is 10.13 and b is minus 0.9.

Let me discuss first what is the price sensitivity.

If prices are measured in dollars,

which is the case here,

and the demand is in units,

what the coefficient b here says,

if I increase my price by one dollar,

my demand will come down by 0.9 units.

So, that's the idea of price sensitivity, which is captured by b.

Finally, you see that the R squared is 87 percent or 0.87. What does that mean?

That basically means the following: R

squared is a metric of how good your regression model is.

It basically varies from 0-1,

and higher numbers are better in the sense that the regression

is able to capture a lot of the variation in sales.

So, in this case, in words,

what it says is 87 percent of the variation in sales or demand is captured by prices.

What is the remaining 13 percent?

That's where the error term comes in.

So, in this case, 87 percent of

the overall variation is captured by the model, which is pretty good,

and 13 percent is unaccounted for maybe

because of variety of reasons like promotions and other kinds of things going on.

What is the typical threshold for R squared to determine whether it's a good model?

Of course, it depends from one context to another.

The typical threshold is about 70 percent,

or R squared equal to 0.7.

So, in this side, what do we see?

We see what the regression equation is,

we see what the R squared is,

and it basically tells us how good is

the regression line in terms of how well it fits the data.