0:00

Hi, in this lesson, you will learn how to build suffix tree of a string given its

suffix array in linear time.

At first we'll explore some connections between suffix array and suffix tree, and

then we'll learn to compute some additional information to

the suffix array.

And then finally we will use suffix array and

the traditional information called LCP array to build a suffix tree.

0:22

First recall the problem.

It's very simple.

You're given a string S and you need to compute its suffix tree.

And you already know how to do that actually.

But the algorithm you know works in square time, and so it will work only for

short strings, maybe up to 1,000 or 10,000 characters.

And if you want to build suffix tree for strings of length of millions or

billions, you will need a much faster algorithm.

And after you learn this lesson, you will know how to build suffix tree in time,

length of string times logarithm of these lengths,

because you can build suffix array in this time and

then construct suffix tree from the suffix array in linear time.

So the general plan is to construct suffix array in time as log S,

then compute some additional information called LCP array in the linear time.

And then given both suffix array and this additional information,

construct the suffix tree in linear time.

First, let's explore how suffix array and suffix tree are connected.

Here we have a string S, ababaa$.

And again, we insert $ in the nth which is smaller than any of the characters

of the string both to build suffix array and then to build suffix tree from it.

And on the left, we have in the column.

All the suffixes of the string S sorted in lexicographic order.

So that is basically the suffix array.

1:54

And on the right, we have the fully built suffix tree of the string,

which is already compressed so that you see that on the edges we have not

single letters but whole sub strings of string S.

And by the way, interesting question is how do we store suffix tree?

We shouldn't, of course,

store the sub strings that are written on the edges directly because that

could lead to quadratic memory usage and we want linear memory usage.

So instead of storing the sub strings themselves, we just store the index of

the start, and index of the end index of the corresponding sub string.

So for each edge, we store two indexes,

the start of that edge in the string and end of that edge in the string.

And to store the nodes, we just store, for example,

an array of pointers to the children nodes.

And that array is indexed by the first character of the edge

outgoing from this node into the child.

3:10

The important thing is that you shouldn't store edges as substrings.

So what corresponds in the suffix tree to the suffix array elements?

Let's take the first element of the suffix array.

Actually, it is corresponding to suffix in the string S and

that is corresponding to a leaf in the suffix tree and also to the path

from a root vertex to the corresponding leaf vertex in the tree.

So the first element of the suffix array corresponds to this route

highlighted in blue, and then if we go to the next element of the suffix array,

we get another route from route vertex to the leaf number 1.

And then if we go to the next element,

we get route from the route vertex to the leaf number 2.

And note that the indexes of the leaves, and

the indexes of the suffixes are just in the sorted order.

So, those are not positions in the string S.

Those are numbers of the suffixes in the increasing order,

from 0 to number of the suffixes minus 1.

So, each of the elements of the suffix array corresponds to some

path from root to leaf in the suffix tree, that is what we know.

That is unfortunately not yet sufficient to build the tree from the suffix array

because there are many ways to create some paths from root to different nodes.

Which corresponds to suffixes of the suffix array.

So we will need some additional properties.

And this additional property we will need is call longest common prefix, or

often it is just said as lcp.

5:05

So lcp of two strings S and T is the longest such string u,

that it is both a prefix of S and of T.

And we denote by big LCP(S, T),

the function which returns the length of the lcp of strings S and T.

For example, LCP("ababc" and

"abc") is 2 because their longest common prefix is ab, and it's length is 2.

And LCP("a","b") = 0 because their longest common prefix is empty.

5:39

Now let's look again at the suffix array and suffix tree and

also take into account LCP between the neighboring elements of the suffix array.

So when we look at the first element of the suffix array,

we just have an edge corresponding to it in the suffix tree.

And when we have the next element, we have a path from root to another vertex.

But if we compute the longest common prefix of this element of the suffix array

with the previous element of the suffix array,

we'll see that this longest common prefix is empty.

And that corresponds to empty intersection,

between the previous path and the new path.

The only common node is the root node,

and they don't have any edges in the intersection.

However, if we proceed to the next suffix,

6:35

it has a common prefix of length 1 with the previous suffix.

And it corresponds to the common path in the tree highlighted in yellow,

starting in the root node and going through edge a to another node which

is still a common node for the current suffix and the previous one.

So this is how LCP corresponds to the tree.

If you go to the next suffix,

it again has the same longest common prefix with the previous suffix.

So we have the same common path from root to the next node by edge a and then

the part of the path is different from the current suffix and from the previous one.

7:28

consisting of three nodes and two edges.

A root node, next node by edge a and next node by edge ba.

And the rest of the path is unique to the current suffix.

If we go to the next suffix, it again doesn't have any common prefix

with the previous one so the only common in the path is the root node.

And the next suffix has longest common prefix of ba,

and that's why we see this path from root to another node via edge ba.

And this is the common part of the path for

the current suffix and the previous one.

So we see that basically all the nodes but the leaves are corresponding

to the longest common prefix of the neighboring suffixes in the suffix array.

And this is how we can actually build the suffix tree by first computing

the longest common prefixes of the neighboring elements in the suffix array.

And then building those internal nodes.

And then, in the way of that, we will also build the leaves as the ending

points of the paths corresponding to the suffixes from the suffix array.

So this is the plan of what we'll do.

But first, we'll need to compute those longest common prefixes for

the elements of the suffix array.