Data repositories in which cases are related to subcases are identified as hierarchical. This course covers the representation schemes of hierarchies and algorithms that enable analysis of hierarchical data, as well as provides opportunities to apply several methods of analysis.

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

K. Selcuk Candan

Professor of Computer Science and Engineering Director of ASU’s Center for Assured and Scalable Data Engineering (CASCADE)

Hi. Welcome back.

This is Selcuk Candan,

a professor of computer science and engineering at Arizona State University,

and we are learning about sequences and time series.

If you remember, so far we discussed what they mean by a sequence.

We learned about how to implement prefix search.

We learned about how to implement subsequence search.

We have learned about how to compute sequence similarity,

but we've also seen that the computation of subsequent similarity can be very costly.

Very expensive.

In this lecture,

we will learn several techniques W-Grams and

other approaches to implementing the sequence similarity search fast.

Why do we care about fast?

Because remember our goal is to use these algorithms in data exploration

which means that these operations need to be implemented close to real time,

so that they can be provided during an explanatory process,

they can be provided to the user efficiently and fast.

So, let's quickly revisit how do we compute sequence similarity.

If you remember the last lecture,

we learned about an algorithm called Edit distance.

In this algorithm, we were given two sequences,

the sequence p of size n,

and another sequence q of size M. We were

also given a possible sequence of edit operations,

that were converting sequence P to the sequence Q.

And we had said that we can compute the edit distance

from the first String P to the second string Q,

by finding the sequence of

edit operations that convert the first string to the second string,

with the minimum total cost.

Thought this was the definition of edit string,

we have also learned an algorithm to compute that.

The problem however, was that this algorithm could be very costly.

Exact computation of edit distance as a quarter the cost.

That is, if the length of the first string is say 1,000,

and the length of the second string is also 1,000,

the total cost of the edit operation could be million comparisons.