0:00
In this video on visualizing numerical data, we will discuss scatter plots for
paired data and other visualizations for
describing distributions of numerical variables.
The data come from gapminder which pulls this information from a variety of data
sources.
We will be working with two numerical variables.
Income per person, that's in US dollars and life expectancy, in years, for
the year 2012.
Each observation in this data set in a country.
That data set contains data from most but
not all countries, since this information wasn't available for certain countries.
A common tool for
visualizing the relationship between two numerical variables is a scatter plot.
To identify the explanatory variable in a pair of variables, we identify which of
the two is suspected in affecting the other and plan an appropriate analysis.
Since we might suspect that economic wealth of a country might effect
the average life expectancy of it's people, we have set up our analysis with
income as the explanatory and life expectancy as their response variable.
Generally, in a scatter plot, we place the explanatory variable on the x axis and
the response variable on the y axis.
It's very important to note that labeling variables as explanatory and response
does not guarantee that the relationship between the two is actually causal.
Even if an association between the two variables is identified.
We use these labels only to keep track of which variable we suspect
affects the other.
In fact, since these data are observational and
do not come from a randomized controlled experiment, we know that we can only talk
about correlation and not causation between the two variables.
So what is the relationship between these two variables?
The best way to answer this question is to visualize a line or
a curve going through a cloud of the data.
So here I'm drawing a curve that first shows
a positive increase in life expectancy as income increases and
then the relationship levels up such that countries with income levels above
a certain point still have roughly 80 to 85 years of average life expectancy.
2:39
The shape of the relationship.
Is it linear, or does it follow some other form?
The strength of the relationship.
Is the relationship strong?
Indicated by little scatter.
Or weak, indicated by lots of scatter.
And any potential outliers.
3:08
Let's take a closer look at the outliers.
Some of them have pretty high income levels.
Luxembourg, a rich country with a small population and
has higher income per person level.
Macao, a special administrative region in China And
Qatar, a country with a small population and lots of oil.
Another potential outlier is Nepal, where the life expectancy is considerable
higher than what would be expected for the low income level compared to others.
These are countries that we would indeed expect to behave differently than
the majority of the countries.
So it's not surprising that they stand out from the rest.
One naive way of dealing with outliers in data analysis is to immediately
exclude them.
But we're calling that approach naive because it's often not the right approach.
This is a good example of when the outliers might be very interesting
in cases.
And handling them with careful consideration of the research question and
other associated variables is important.
Now, let's take a look at the distributions of the variables,
individually.
One good way of visualizing the distribution of a numerical variable
is a histogram.
In a histogram, data are binned into intervals and
height of the bars represent the number of cases that fall into each interval.
In other words a histogram provides a view of the data density,
higher bars represent where data are relatively more common.
For example we can see that majority of the countries have average life
expectancies between 65 to 85 years old.
histograms are also very useful for identifying shapes of distributions.
In this case the distribution of life expectancies
appear to be left skewed which is expected
due to the leveling off of life expectancies we've identified earlier.
There's a physiological limit to how long people live.
And in most countries, people live up to that time but
there are some countries with much lower life expectancies and fewer and
fewer of these countries with lower and lower expectancies.
Resulting in a long left tail.
The distribution of income on the other hand is right skewed.
Incomes can't be negative so we have a natural boundary at zero, but
there is no real upper limit to how high incomes can go.
However, as we go higher and higher we have fewer and fewer countries
with such high levels of personal income resulting in a long right tail.
A shared characteristic between these two distributions
is that they're both unimodel.
Let's focus on these statements on skewness and modality for a bit.
5:38
First off, skewness.
Distributions are set to be skewed to the left side of the long tail.
In a left skewed distribution, the longer tail is on the left on the negative end.
If no skewness is apparent, then the distribution is said to be symmetric.
And in a right skewed distribution,
the longer tail is on the right, the positive end.
As you can see, the best way to assess the shape of distributions is to step back and
imagine a smooth curve outlining the distribution,
instead of focusing on the jagged edges of the bars in the histogram.
6:30
The distribution that you will most closely work with, and
in an introductory statistics course is unimodal, the normal distribution,
that you may also know as the bell curve.
A bimodal distribution might indicate that there are two distinct groups
in your data.
For example here's a distribution of heights of individuals at a preschool.
The first peak might be the kids and the second might be the teachers.
A uniformed distribution means there's no apparent trend in the data.
That high and low values of the variable are equally likely to occur.
Here's a distribution of the last digits of a random sample of people's social
security numbers.
As expected, the data show no trend as just as likely to have a social security
number that ends with a zero, as a six or a nine.
7:15
Assessing modality like shape is also
best done by imagining a smooth curve outlining the distribution.
Here is a trick, think of the bars as the histogram as wooden blocks and
imagine dropping a limp spaghetti over them and try to imagine how the limp
spaghetti would fall over and between the wooden blocks.
Peaks that are further from each other will likely result
in differentiable prominent peaks and
peaks that are close to each other like the ones around zero and two may not.
Identifying the number of modes is not an exact science, and
not one that you should dwell on too much.
Usually all you need to do is to determine whether the distribution is uniform
Unimodal or something else.
7:57
We should also note that the chosen bin width of the histogram can
alter the story the histogram is telling.
When the bin width is too wide, we might lose interesting details.
When the bin width is too narrow It might be difficult to get an overall picture of
the distribution.
The ideal bin width depends on the data you're working with.
So you should try playing with it until you're satisfied with the visualization.
8:38
Yet another visualization technique that is especially useful for
highlighting outliers is a box plot.
A box plot also readily displace the median.
The mid point of the distribution, this is the thick line inside the box, and
the interquartile range, the width of the box.
According to this box plot, the median life expectancy is roughly 73 years, and
the middle 50% of countries have average life expectancies between 65 and
77 years old.
In addition, countries with life expectancies that are below 48 years old
are considered to have unusually low life expectancies.
A box plot of the income distribution
shows the same right skewed distribution we've identified before.
And the outlying countries with unusually high per person income levels stand out
in this visualization as well.
9:28
One way of determining the skewness of a distribution from a box plot
is to imagine what the histogram would look like.
The peak of the distribution will be roughly around the median, and
the tails will extend out to the tails in the box plot.
There's one more visualization method that we will discuss in this video.
An intensity map.
For certain types of data, like the one's we've been working with in this video,
it might be useful to view the spatial distribution.
These displays reveal trends in the data, that many of the others did not.
For example, we can see that both income and
life expectancy are lower in Africa, but higher in North America and Europe.