This course answers the questions, What is data visualization and What is the power of visualization? It also introduces core concepts such as dataset elements, data warehouses and exploratory querying, and combinations of visual variables for graphic usefulness, as well as the types of statistical graphs, tools that are essential to exploratory data analysis.

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

K. Selcuk Candan

Professor of Computer Science and Engineering Director of ASU’s Center for Assured and Scalable Data Engineering (CASCADE)

So, how many do we select?

Do we select just one?

Do we select just two?

Do we select just three?

Or do we basically take the industry's image object out to present each

of its pixels as a different vector basis?

And we have hundreds and thousands of vectors. How many do we need?

Now, this is a difficult question.

It's a very important question but it's a difficult question.

Number one, it is a difficult question because depending on the application that we need,

depending on what we are interested in during our exploration,

we might need to have small number or large number of vectors.

For example, if I am interested in the shapes of the images through my data exploration,

I need to count the shapes,

I need to identify and count the shapes.

If I am interested in colors,

then I need to somehow count the colors in the image.

This is actually means that the question is somewhat imposed

because the number of features that you need

is application dependent, application specific.

While the question, how many features do we need to represent the data?

Is an application dependent question, in general,

we can state that the fewer dimensions we use, the better.

This we call dimensionality curse.

Dimensionality curse essentially means that the more dimensions you

need or the more basis vectors you need to represent a vector space,

the less efficient will be your data structures and the less

effective will be your search and analysis tools.

It looks like a very general statement and at the first look,

it is not clear why this should be the case

but getting the insight is actually fairly straightforward.

Let's start from efficiency.

Why would it be less efficient if we have more features in the vector space?

But this essentially depends on the data structure that

you develop or you use to implement your search algorithms.

Usually, search algorithms are implemented with data structures like trees,

where if you are searching for a particular data element in your database,

you start from the root of your data structure.

You navigate down in the tree where at every step,

the tree is guiding you towards the data element that you are looking for and eventually,

after navigating on this tree from root to the leaf,

you locate the data object that you want.

Now, it turns out that these data structures such as trees

turned out to be more inefficient as you add more dimensions to your vector space.

Again, in this lecture,

I'm not going to go over why that is the case.

For now, take it as given but we discussed some of

these concepts like efficient search in other units in this course,

as well as other courses in the core sequence.

Now, there is a second aspect to this aside from efficiency which is effectiveness.

It turns out that the more dimensions that you have,

it also becomes more difficult to analyze the data aside from efficiency.

And why would that be the case?

Why having more dimensions make it harder to analyze the data?

Well, the reason for this is very simple.

If you have, say only one vector and say you have two objects in this vector,

and in this case these two objects are two flowers.

It might be easy to find a separator that separates

the two objects that you have from all the other objects in your database.

You have two objects, they are similar to each other.

There are other objects,

they are different from each other and it may be quite easy to find a separator.

However, as you increase the number of dimensions you have in your space,

this becomes more and more difficult because as you add more dimensions to your space,

you are adding more flexibility,

so the separation line in this case that separates your flowers from rest of the objects,

it may be more difficult to find that separation line.

We call this overfitting.

So, if you have more and more dimensions,

it becomes easier to overfit to the samples that you have.

So, to prevent overfitting,

you need more and more and more examples.

If I have only two dimensions,

I might be able to prevent overfitting,

say with five example images.

But if I have 5,000 dimensions,

I might need many more images to

properly find the separating boundary between in this example,

flowers and the rest of the objects.

Again, in this lecture,

we will not spend time on getting details about

efficiency of dimensionality curse and effectiveness impact of dimensionality curse.

However, as I mentioned,

we have other lectures and other videos in the sequence that gets into details of this,

and we provide more detailed examples and we also discuss solutions to these problems.

What we will discuss next is

the second aspect of vector spaces which is the distance and similarity functions.

So far, we have discussed how to take a set of

objects in the database and how do we map them into a vector space.

We know that we need to identify a set of vectors that we will use as basis vectors.

And we know that these basis vectors need to be

complete and we also know that they need to be non-redundant.

The second question, however,

is how do we use this basis vector to support search or to support exploration?

And to support search and to support exploration,

we need to define what we call distance and similarity measures on the vector space.

Now, it turns out that this is again a difficult problem.

It is difficult because the distance or

similarity functions that we need to use is also application dependent.

Having said that, there are

some commonly used distance functions or similarity functions.

And in the rest of this lecture,

we will go over them and we'll basically see what are

the key properties of

this similarity functions and why we use them and in which context we use them.

So, what's the distance between two vectors in a vector space?

Well, it sounds quite intuitive to define the distance between two vectors.

I can sort of given a vector space, in this case,

a vector space consisting of basis vectors X and Y and two input vectors V1 and V2,

I can intuitively say that the distance

between these two vectors can be measured essentially by finding

the difference vector between

these two input vectors and then measuring its length.

Sounds intuitive. And indeed,

this is often what we do.

Often, we measure distances between the vectors by

finding the difference vector and then measuring its length.