0:03

Now, first we're going to look at the process of simulating the operation of an

NFA. And in order to do that, we have to look at the representation. so for state

names we're just going to use the integers from zero to N and those are just indexes

into the regular expression strain with one extra one for the except state. So, if

RE has M characters, we've got M plus one states. and then, we forget the

match-transitions. so for with this, we'll just keep the regular expression in an

array. and then for characters that are not metacharacters that are just

characters from the alphabet there's match-transition just to the next element

just to add one, so that's an implicit representation of the match-transitions.

and then the next thing is the epsilon transitions. So, since there might be

multiple edges leaving from any given state and it is directed, it always says,

go from this state to another one we'll use directed graph. So we just have a

bunch of edges with given vertex names. And that's convenient. That's what we use

for our graph processing algorithms, was indexes for vertex names. so, this is

totally well-suited to building a digraph using the basic digraph processing methods

that we talked about when doing graph processing. so, that's our representation

and then given that representation it's very, you know, straightforward to find

out for any state what are the possible next states and if it's got a character in

a possible next state is to match a character in angle up one otherwise it's

the vertices that they are pointed to by the edges and its adjacency list does have

epsilon transitions. So, let's take a look at just given that basic idea, that we can

always get to the transition from any given state how we're going to do a

simulation? and so, the idea is summarized by this diagram here. what, what we're

going to do is at every step each text character that the machine does read,

we're going to keep track of the, the set of all possible states where the NFA could

be after reading in the i, I text ch aracter.

So, say, we've done that and, you know, there's this is just a schematic. So,

there's a bunch of states that we could get to after reading i characters. Now in

some of those states they, they are labeled with characters. and if that

character matches our text character, they can take us to another state. So, say, in

this case, there is three of them that matches and the other one has some other

character or different character and it couldn't take us to another state. So,

that gives us a all the possible states that you could be in just after reading

the i plus first symbol. And now, what we do is take a look at the possible null

transitions that we could go to from those states. And that might take us to lots of

other states. but that's all the machine could do. It could read a character. And

then it can take a bunch of null transitions. and but that's all it knows

how to do. And that will give us all the possible states that it could be in after

reading i1 plus one symbols. And then, we just continue in that from every one of

those states. some of them match the character match the character, and then,

look at all the epsilon transitions. So, that's the basic idea is to simply keep

track of all possible states the NFA could be in after reading the first i text

characters. so the only thing that's complicated is how do we get all the

states that we could reach by epsilon transition. and we'll look at that in just

a second. It's actually very straightforward. But let's look at a demo

first. so this is our machine for A<i>B or A, or AC followed by D. and let's suppose</i>

it has this input. So, let's do this demo. Okay. So, we're going to, to check if the

input is matching the pattern. we're going to start the machine in state zero and so

the zero is one of the states, you could reach from the start. and then now we want

to get to all the states that you could reach by epsilon transitions from the

start. and so there's a bunch of them so we could go to one and from one, we could

go to either two or six. from two, we could go to three. From three, we could go

to two or four. so we could get to all those places from the start without

scanning a single character. So, zero, one, two, three, four, and six the machine

could be in any one of those states before it scans a single character. so, that's

where it could be after zero characters. But what about now after the first

character? Well, out of those states only two and six involve matching an A. So the

only thing that could happen next to read, the machine could do next, is to read the

A in either state two or state six. I know there's going to be match-transitions. So

in the first case, it goes to three and, and the second case, it goes to seven. So,

the stead of states, it can be reachable just after matching the A, is just three

and seven. But now from those states it might get somewhere with epsilon

transitions. We have to look at the epsilon transition graphs and what states

could it go to from epsilon transitions from these two. well, from three, seven

has nowhere to go, and from three it could go either to two or four. from two, it

could go back to three so the total number states that it could be after matching A

is two, three, four, and seven. and so that's one step. We've matched one

character, and we have kept track of all possible states the machine could be in

after matching that character. Now, what about the next character? You can see,

these four states take us all different ways. You could have an A, a B, or a C

next, and it could get somewhere. But it happens, we have an A, and the only one of

these states that matches an A is two. So, after matching the second A the only place

it could get to is three. and so that now, we have only one state to work with. But

where could we get with via epsilon transitions, from three? and so, well, you

can go from three to four or two. Well, we did this before. And then, from two back

to three. So, we could be in two, three, or four after matching two As. so that's

all the possible states we could be in after mat ch, after matching the two As.

now, what's next is B, only state four matches to B, so the only place we could

be right after matching the B is in state five. And that's after matching the B, now

what about epsilon transitions? Well, state five has an epsilon transition to

state eight. so we could take that one and eight has got one to nine. So it could be

an either five, eight, or nine after matching AAB and then following all of the

epsilon transitions. but its really important to keep in mind, there's no

other state the machine could be and it doesn't have any other way to get to after

matching AAB it couldn't get state seven or six or two. those are the only possible

states it could be in. since we're, each time, progressing through the input we're

8:37

making progress to the end for sure. So now to finish up this example the only

state out of those three that matches D is state nine. and so, that's a

match-transition that reads the D. and the only place you could be after matching

AABD is state ten. and then now, we follow epsilon transitions. and that Epsilon

transition could take us to eleven. so, the only place that machine could be after

matching ABD is ten or eleven. Now, our condition on whether we accept the string

or not is whether we could get to the accept state, and in this case, we could.

it is possible for the machine to get from zero to the accept state and read all the

input characters so that simulation is approved, that the machine accepts the

input AABD when simulated its operation, all possible ways and we managed to find

the accept state. Of course, if we had tried some input that the machine doesn't

recognize, we'd get stuck somewhere and either not get through the input or have

no possible states it could be in. and that would be a proof that it does not

accept since we try all possibilities. So, the only thing that's complicated in this

computation is reachability. but actually from our study of digraphs this is the,

one of the simplest problems that we discussed. what we really discussed was

single source reachability. That is given the source, can you find all vertices that

are reachable from that source? And that was Depth Firts Search. so we have very

simple depth first search but also in that API we put vertices reachable from a set

of sources, so an iterable of sources and so can we get the, the, what we really

need is all vertices reachable from a given source from its source set of

vertices. and it's easy using DFS. You just run it from each of the sources and

you don't unmark the ones that you get to and that stops DFS from revisiting any

vertices and it marks all the vertices that you could get to from that set. So,

its just a simple extension of our DFS implementation. it's going to run in time

proportional to the number of edges plus number of vertices in the digraph and it's

a very simple computation digraph reachability. so given that capability

which we discussed in graph processing the implementation of in a, in the NFA

stimulation is very straightforward. so this is a, a data type that implements the

NFA. so, we have a constructor that takes a regular expression as its argument. and

it's going to build the NFA and its got a, a method to build the digraph that we'll

talk about later but, but its also got a, a client method recognizes where the

client can after the NFA is constructed, it can that they could text string and

return true or false by simulating the operation. so we'll take a look at the

building the digraph in a second but the one that we're talking about now is

simulating the operation of the NFA once it's built for a given text string. and

this is the complete code and it's expresses in code what we talked about in

English during the demo. it's amazingly compact implementation of this idea of

simulating the operation of an NFA. So we keep a, a bag of integers or set of

integers called PC, that's kind of like program counter. So, that's the set of all

possible states that the NFA could be in. so and we build a DFS from our epsilon

transition graph. so the first thing that we do is do a, a, a for we put on to the,

the PC all the states that you could get to from state zero. so that's build a DFS

that marks all the states you could get to from state zero and then go through and

put all of the states on, on to the PC. So, that's our starting point, is all the

places that you could get to via epsilon transitions from state zero. So now during

the execution of this for loop PC is the thing that we have all the states that you

could reach after scanning past the ith character, so we initialize for the 0th

character. And then all we do is for everyone one of those states well, first

thing we do is test if we reach the accept state. If we reach the accept state we're

going to have nothing, we're, we have nothing left to do. if we have a match,

that is, if the a character in the RE if that state position matches our ith text

character, then we're going to keep another set of possible states called

match. So, those are the ones you could get to just after matching a text

character. And so, we'll just add V. plus one. So, if we find a matching character

at V, we just go to V plus one, that's the implicit match-transition. and here, we'll

throw in, don't care, cuz it's so easy to do if the regular, regular expressions

says, I don't care what character matches then throw that one to. so this just adds

don't care to the implementation. and so we do that for all states in our PC in the

states we can, can be in just before, just while looking at the ith character. If we

have a match then we put the next states into a match. So now, what we have to do

is follow the, epsilon transitions from match. So now, we build another DFS which

gives us marks all the states that you could reach by starting with any of the

states in match. and now when we build a new PC, we'll just set PC to a new, new

bag, and put all the marked states for that search. So, those are all the ones

that you could get to via epsilon transitions right after a match, and put

that in the PC. Now. we have a new PC. and we've r ead the ith, character and we go

to the i plus first character, and simply continue. so, when we're done with all of

the text then we can test if we made it to the accept state. that is, if any of the

states in our current set of states is the accept state, we return true. and if we

didn't get to the accept state after all of that, we return false. A very compact

code that takes advantage of our implementation of reachability for

digraphs in a, a reasonable way to simulate the operation of an NFA by

keeping track of all reachable states. so, what about the N analysis? Well it's easy

to see that the not difficult to see that if our text is N-character and we have an

M-character pattern. the worst that can happen is that we take time proportional

to M times N. and that's just because for everyone of the characters we have a set

of states, a set of all possible states. And we iterate through that set of

possible states a few times. we even run DFS on it, but that's linear. and there's

the extra point that the number of edges that we have is less than 3M. No node has

more than three edges leaving it. So the total amount of time for each character is

proportional to M and there's N-characters, so there are any

proportional to M times N. So, that's the, the cost of the simulation. Of course a

lot of times, the size of the set of states is much smaller than that. so in

many real world problems, it's closer to linear. that's NFA simulation.