In this module, we're going to introduce what is exploratory data analysis.

Exploratory data analysis is an approach to analyzing

and summarizing data often using visual forms.

And so, the main objective of this module is to describe exploratory data analysis.

Exploratory data analysis is going to analyze data sets to

summarize their main characteristics and elements include data visualization,

residual analysis, how do we transform an aggregate data,

and how do we improve the resistance procedures.

In visual analytics, this is often referred to as

detect the expected and discover the unexpected.

We're trying to come up with different means and representations to aggregate data,

turn this into visual forms,

and display this for users to understand,

explore, and reason with.

Tukey often referred to this as plain data detective and so,

we're going to learn in this module is some basic tools

or your bag of tricks for the data detective work.

And one of the primary things I want to talk about is data visualization.

Data visualization is a key tool for

exploratory data analysis where we're going to facilitate advanced data analytics,

help users spot outliers by visualizing the data,

allow people to use their natural vision processing to discriminate clusters,

check different distributional assumptions,

compare means and differences,

examine relationships, and even observe time based processing methods.

So, thinking about how we can create pie charts,

bar charts, line charts,

and others as well.

And what we're really interested in thinking about is data distributions.

The type of data distribution is going to affect

really how we're going to analyze and visualize in the data space.

In previous lectures, we talked about data forms with nominal,

ordinal, categorical, and ratio data, and really,

the key step is thinking about how we can pre-condition the data to allow

people to explore this and visualize this in the most effective manner.

And perhaps, the most common type of distribution that people are familiar

with is the normal distribution or the Gaussian distribution.

And oftentimes, this is the type of distribution that we're trying to transform our data

to to allow for more powerful exploratory data analysis.

We can always take any data set and create a pie chart,

create a bar chart, create a line chart,

and do these sorts of things,

but a lot of the methods we're going to learn and a lot of

the visual properties often depend on what the underlying distribution of the data is.

And if we can somehow transform the data into a normal distribution,

then we can fully characterize the data with two parameters,

the probability is going to be determined by knowing the distance from the mean,

and then a lot of the different statistical measures and tests are

designed directly for working with the normal distribution.

And so, normal distribution,

we just need to calculate the mean and the standard deviation.

So given some population X,

we're going to define the mean as the sum of all the values in

that population divided by the total number of elements in the population,

and the standard deviation is very similar.

We're going to sum across all the samples.

So, X buys a sample from our data set minus the mean. We're going to square that.

So, take the sum of the squares,

divide it by the total number in the population,

and take the square root.

And ideally, what happens is when we do this,

we get that nice normal distribution curve we showed earlier.

But a lot of times, our data sets will have different skewness.

We can have positive skew, a negative skew,

and so we want to measure things like the asymmetry of the probability distribution.

And so basically, with all these mean standard deviation and skewness,

these are called taking moments with

the data and really more thinking about exploratory data analysis,

what we really want you to do is take a moment to think about the data and

then take a moment with the statistical analysis to start exploring,

analyzing, and summarizing the data.

And things like mean, standard deviation,

and skewness are often good metrics for you to begin talking about

and discussing sort of common properties within your data set.

So for skewed data, notice we have a power three here in the calculation.

We call the mean the first moment.

It's to the power of one.

We call the standard deviation the second moment to the power of two,

and skewness is the third moment.

And so for a sample of N values,

the sample skewness is again taking the sample minus the mean.

We take the sum of the cubes and divide by the total number of samples.

And then we follow the rest of the equation as seen below.

And so with all of these, this allows us to, again,

summarize the data, thinking about what is the underlying properties,

what are the mean standard deviation and skewness,

how can we describe the data mathematically,

how can we think about transforming this data to fit perhaps the normal distribution.

If it's not normal, we want to begin thinking about

what different visualizations and visual methods we have,

and begin our journey into visualizing and exploring the data process.