0:00
Hi. In this lecture you will finally learn how
to build suffix tree of a string, given its suffix array and the LCP array.
And we will do everything on this example.
We have our string, we have our sorted suffixes in the order, and
we start building the tree from just the root vertex.
We consider the first suffix, and
we create a leaf in the tree corresponding to the suffix.
And we connect it with edge to the root node.
0:27
When we go to the next suffix, we see that the longest common prefix of
the suffix with the previous one Is empty, we know that from the array.
And so, the only common part of the path from root to the new leaf and
to the previous leaf is the root note.
So we don't need to create any new nodes, other than the leaf node for
the new suffix, and we create an edge from root directly to this new node.
And we write down the corresponding suffix on this edge.
When we go to the next suffix however,
there is already a common prefix of length one, which is a, so
we'll need to create a new node in the middle of this last created edge.
And we'll divide it into edges with letter a and with letter $.
And this is what happens.
So now we have a new node which is connected with letter a to the root and
from which, there are two outgoing edges, one with the letter $,
and another with string a$, corresponding to the last considered suffix.
Now when we consider the next suffix,
the longest common prefix with the previous one is again just a.
So we don't need to create a new node other than the leaf node.
So we create a leaf node for the new suffix abaa$ and
we connect it with the yellow node which corresponds to the longest
common suffix prefix with the previous suffix, with an edge where we write baa$,
everything that is left in the suffix.
2:06
When we consider the next suffix the longest common prefix with the current one
is of length 3, so we'll need to subdivide the edge again.
And this is what we get.
The yellow node is the node corresponding to the longest
common prefix aba of this suffix and the previous one.
And then we create a new leaf node for the suffix ababaa$ and
the node with number 4 corresponding to position 4 in the suffix array.
Then we can consider the suffix baa$, it doesn't have any common prefix with
the previous ones, so it starts from the root and goes into the new node.
And then the next one has a common prefix with it, ba.
So we create a new node after ba, subdivide the edge, and
create a new leaf node for the last suffix with edge baa$.
So this is basically what is going to happen.
How do we implement this creation of new nodes and subdividing of edges?
3:08
The following way.
When we build an edge to the leaf for some suffix, we go and
sit in that leaf node and when we consider the next suffix,
we go up from that node using the pointer to the parent node in the tree.
Until we are high enough, so that the longest common prefix is below us.
As soon as we jumped to the longest common prefix or higher, we stop.
How do we whether we're higher than the longest common prefix or not?
We need to also store the depth in the nodes and
the depth is the number of characters on the path from the root to this node.
This is easy to keep during the suffix tree construction,
so I'll just assume we have it.
So we go up from the leaf until we are in the longest common prefix or above it.
If we're exactly in the longest common prefix with the previous suffix,
we don't need to build any new nodes.
We just build a leaf for the new suffix, and connect it with the current node.
However, if we are higher than the longest common prefix,
then we'll need to create a new node in the middle of the current edge,
going down from our node in the direction of the longest common prefix.
So we divide our edge in the middle.
We create a new node which is corresponding
to the longest common prefix with the previous suffix.
And we create a new leaf node as usual for the suffix and
connect it with its new note.
So this is the whole algorithm.
4:39
To repeat.
To build the suffix tree from scratch, we first build suffix array and
then build LCP array from that.
We start building the tree from only root vertex.
We grow the first edge for the first suffix just from the root.
And then for each next suffix, we're sitting in the leaf we just built for
the previous suffix,
we go up from the leaf until we jump higher than LCP with previous suffix.
And then we build a new edge and a new leaf for the new suffix and maybe,
depending on where we are in the tree, we need to subdivide the current edge.
I state that this algorithm runs in linear time and that is pretty easy to see.
We know that the total number of edges in the suffix tree is linear from
the previous modules.
And during this process for each edge we go at most once down when we do this edge,
and then we go at most once up when we go to find the l sub e.
And then we go down, we don't go through the same edge again.
Because we've already been there.
So, for each edge we go through this edge at most twice,
disregarding maybe additional one per iteration,
and we have only linear number of iterations of the whole algorithm.
And the time to create a new edge or a new leaf or subdivide an existing
edge's constant, so in total this algorithm works in linear time.
6:10
So now you are fully equipped to build suffix structures,
such as suffix array and suffix tree.
And you can do that pretty fast in time SlogS where S is the length of the string
and that's cool, because you can solve very complex problems using
these data structures which are basically impossible to solve without them.
And some of them will be in the programming assignment.
So see you there.