Data repositories in which cases are related to subcases are identified as hierarchical. This course covers the representation schemes of hierarchies and algorithms that enable analysis of hierarchical data, as well as provides opportunities to apply several methods of analysis.

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

K. Selcuk Candan

Professor of Computer Science and Engineering Director of ASU’s Center for Assured and Scalable Data Engineering (CASCADE)

So far basically, we now have to use

this finite set automata idea to be able to match exact patterns.

But what about approximate matches?

We have seen at the beginning of this video that in general,

in real world, event sequence may not exactly match.

Sometimes it might be okay to have a few symbols in errors,

and I might still want to be able to identify

whether the match is approximately correct or not.

For example, if our query pattern is CANDAN,

I might be okay if the match is CNDAN.

It is not an exact match,

but there's only one simple mistake,

where A is dropped between the data pattern and the query pattern.

Or I may have CAANDAN.

Note that in this case, I, again,

have one errors, and the error is insertion of a symbol,

rather than a drop of a symbol.

So, then the question is how should we come, well,

can I match a data pattern to a query pattern if I have some errors?

Can I use this idea of finite set automata even in those conditions?

Well, it turns out that we can do that and let's see how we can do.

Let's take a look at our example,

query pattern is CANDAN.

And as we have seen given the query pattern CANDAN,

we can basically easily construct a finite set automata that looks like this.

Note that this is going to match CANDAN only if my data pattern,

data sequence is exactly CANDAN.

Now, let's assume that we want to allow one insertion.

That is basically if the data is CAANDAN where there is one insertions,

I still want to be able to match that.

Well, what we can do, is we can basically create

a non-deterministic finite automata that supports that one insertion.

So, in this case,

my non-deterministic finite state automata looks like this.

It is very similar to the original finite state automata,

but it has one more layer.

And these layers are connected with a star edge,

well, it's actually a star edge matches any symbol in my alphabet.

So, how can this be used?

Let's see our data sequence CAANDN.

So, how will this match?

Well, lets start from C. So,

C from the beginning is going to bring us to this state.

At that position, we have the next symbol A.

So, the next symbol A, in this case,

is a non-deterministic finite state automata.

It's not going to bring us only to this position where we have seen CA,

but also bring us to this position where we have seen CA.

So, for the non-deterministic finite state automata with the prefix CA,

I am at two states,

not only one state.

Then I see another A.

Here it fails because if to you follow at that layer,

the next symbol should have been N, so it fails.

This means that I go back to the start.

However, at this point position,

I can actually follow A and I can come to the state where I have seen CAA.

And note that if you follow over the next sequences,

N is going to bring you this state,

D is going to bring you this state,

the next A is going to be in this state,

and the final N that you will consume will bring you to the success state.

Note that, what happens that I have matched CAANDAN.

If the input sequence was CANDAAN,

I could also match that. How would that work?

In that case I would've started at C,

then I would have consumed an A,

then I would've consumed an N,

then I would have consumed a D,

then I would have consumed an A,

then I would have consumed another A,

and then finally, I would consume N and I will match that.

So, this two-layer non-deterministic finite set automata,

is actually is going to match any one inversion exactly.

Note that I can also support replacements.

Replacement means, let's assume that instead of CANDAN,

my sequence is CBANDAN,

that is the A is replaced with a B,

so it's not an exact match,

but there is one symbol that is replaced with another symbol.

Can I match that? Well, it turns out that yes you can if you add to

your finite state automata, these replacement edges.

So, let's see how that would work.

So, we receive a C,

C bring us to this state.

Then we receive a B, B doesn't go this way,

B also doesn't go this way.

But following this star edge,

I can come to this state here.

At that point, I have basically seen C, A and B.

Then I receive a N,

then I receive a D,

then I receive another A and then finally I receive another N. Well, I matched CBNDAN.

So, this extended non-deterministic finite state automata

essentially matches not only one insertion,

but it also matches one replacement.

What about deletions? Can I do deletions?

That is if my sequence is CADAN,

that is N has been removed or deleted. Can I match that?

Well, yes. You can do that if you also introduce

these extra N edges that allow state change without consuming any input symbol.

So, let's see the example.

So, once again we start with a C,

then we receive an A.

So, we are at this state.

Then we receive a D. This doesn't go this way.

But the extra N edge can take us to this state.

We have not really seen N,

but since this N it's on edge,

it brings us to the state where we have either seen CAN.

And then, I consume a D,

and then I consume an A,

and then I consume a N. And I have matched CANDAN with one deletion.

So, CANDAN has been found with one deletion.

That is this revised non-deterministic finite state machine

enables us to match our query symbol,

query sequence CANDAN with one insertion,

one replacement or one deletion.

What for? But what if basically I also allow up to two insertions,

or two replacements, or two deletions,

or one insertion and one replacement, can I do that?

Well, you can do that very easily by creating another layer of your data structure.

So, the first layer here,

the initial layer essentially matches

any sequence which doesn't include any insertion, replacement or deletion.

The second layer here allows you to match sequences with one insertion,

one replacement or one deletion.

The third layer here enables you match any sequence that has two errors.

Either two insertions, two replacement,

two deletions or one insertion and one replacement or one insertion and one deletion.

And depending on whether you match at this state,

which correspond to no errors,

or at this state which correspond to one error,

or you match at this state which is two errors,

you also know how many errors you had during the matching.

So, this will be

a very high trigger position because you match the sequence without any errors.

This may be a four-four trigger condition because you matched a sequence but not exactly.

There was one error.

So, it may or it may not correspond to your trigger pattern.

And this state is going to be less important, because,

yes you matched the sequence,

but you must matched with two errors.

So, there is higher chance that it may not actually correspond to your trigger pattern.

So, this is essentially will give you the priority of visualization.

Any sequences that match at this level need to be visualized with high priority,

any sequence that match at this level should be visualized with

a lower priority because it had more errors in its match.

Great. So, let's basically have a summary.

In this segment, we focused on sequence data.

We looked at sequence exploration,

and we have learned trie data structure to support prefix basic exploration.

We learned suffix trees and suffix arrays

for subsequent search and subsequent exploration.

We have learned finite automata to support pattern,

or trigger matching to support real time data exploration.