0:15

We want to know is a given string an element of a given set of strings or

is it in a language, a given language?

So here's an example from computational biochemistry.

One way to represent an amino acid is with one of

the characters CABLI that are listed there.

And each one of those is a standard representation of an amino acid.

A protein is a string of amino acids, so

it's just a string consisting of those letters.

So there's a particular type of protein that is referred

as a C2H2 type zinc finger domain signature.

And that's well defined by these specific rules.

It's a C followed by 2, 3, or 4 amino acids, that could be any one of them,

followed by another C followed by exactly three amino acids,

followed one of these 9 specific ones L, I, V, M, F, Y,

W, C, or X, followed by 8 more amino acids,

followed by H then followed by 3, 4 or 5 amino acids followed by H.

It's a pretty complicated set of rules, but, well, life is complicated.

1:31

So we might have the question how a given protein is.

Does it follow these rules?

Is it in the C2-H2 type zinc finger domain?

So there's a protein, so let's see if it actually follows these rules.

Whether it actually follows these rules or not.

Is it in the language defined by those rules?

Is it in the set of strings defined by those rules?

And the answer in this case is yes.

Starts with a C, then there is 3 amino acids, so that is the first bullet point.

Then there is another C, and then exactly 3 amino acids.

And then the next letter is Y, so that's one of the,

makes the last bullet, that's one of the ones listed.

And then there is 8 amino acids.

And then the next one is H, and then there's 3.

That's okay the last bullet point 3, 4, or 5.

And then finally an H.

So in this case this string is in the language.

And clearly we want to be able to answer this type of question for

a broad variety of types of proteins.

2:44

Here's another example from commercial computing.

Suppose you want to define what an email address is.

It's a sequence of letters followed by an @,

followed by another non-empty sequence of letters followed by a dot.

May be any number of occurrences of that.

And then ending up with edu or com or gov or some other legal suffix,

where you would expect to find in the end of an e-mail address.

So you can ask then, which of the following are e-mail addresses?

So this one's got a sequence of letter, at sign, some more letters a dot,

some more letters a dot, ending in edu.

So that one is in the language defined by these rules.

So this phrase, not an email address, it's not an email address.

It doesn't have the at sign or any dots.

3:35

There's another one that fits.

This one doesn't end in edu or com, so it's not an email address.

This one doesn't fit according to those rules, but

maybe we need to fix the description to allow for some numbers, say.

In actually specifying precisely what's a legal email address and

what is not a legal email address can get pretty complicated.

So this is just to indicate that it's not always so easy to specify the rules.

So the challenge is we want to develop a precise description of the set of

strings in question.

And that's true for many of these types of problems.

And we'll look at a way to do that soon.

4:28

So nucleic acid is one of the letters A, C, T, or G.

And we've seen examples before with this representation.

A genome is a string of nucleic acids.

So there's a thing called a fragile X syndrome pattern.

And so that's a genome that is defined by the following rules.

It's got gcg somewhere inside, followed by any

number of cgg or agg triplets, followed by ctg.

And, in fact, this is important because the number of triplets correlates

with this fragile X Syndrome, which is a common cause of mental retardation.

So if you have a genome,

you might want to know this in order to deal with this illness.

6:02

So regular expression, an RE, is either the empty set or an empty string,

or it's a single character or a wildcard symbol that represents all characters.

Or it's an RE enclosed in parenthesis, or it's the concatenation of two or more REs.

So here's examples of these rules.

So aabaab, that's a concatenation of a bunch of

single characters that specifies the set consisting of that one string.

So only that string, aabaab is in the set specified by that regular expression, or

in the language defined by that regular expression.

And every other string is not in that set.

And if we use a wildcard character, then we can specify multiple strings.

So we say .u.u.u., that's all those seven letter words with

alternating u's, starting at the second character.

And there maybe lots of words in the dictionary,

this are the ones in the dictionary but

of course it specifies any string of several letters with those use in there.

So those are examples of strength that are in the sets specified by that

regular expression.

And then if you don't have the u strictly alternating like that,

then it's not matching.

So that's simple examples of regular expressions using concatenation and

wild card.

And then we can have more complicated operations for

the union of two or more regular expressions.

That's just one or the other.

And we represent that with a vertical bar.

So in this simple example, only the two strings aa and baab are in the language

defined by that regular expression, and every other string is not in the language.

And then there's what's called the closure.

And that's any number of occurrences of BRE.

So ab*a means a followed by zero or

more occurrences of b followed by a.

So aa, abbba, a followed by any number of b's followed

by a is in the set defined by that string, that regular expression.

But these other strings are not in the set defined by ab*a.

First one doesn't end in an a, the second one's got an a in the middle there,

that's not allowed by the definition.

Just interesting to note that the closure operation already takes us to infinity,

that's specifying an infinite set of strings.

The number of b's is not specified, could go over that bound.

And then we can use parenthesization to make up complicated

expressions describing specific sets.

So for example, a followed by parenthesized a over b, aab.

Again, there's two strings that are in that set, aaaab and abaab.

And every other string is not in the set defined by that regular expression.

Or if we say ab*a, that's a way of saying alternating ab's ending in H.

And if you have a two a's in a row or two b's in a row anywhere,

it says not in the set defined by that regular expression.

Again, an infinite number of strings.

10:03

So the raspberry and

crispbread are in the language defined by that regular expression, but

these other ones do not have spb in that order contiguous anywhere inside.

And the .* just allows matching of any number of characters,

so wherever the spb is, that would be a match,

as long as it's somewhere in the string.

Here's a much more complicated one, a* or

(a*ba*ba*ba*)*, whole thing starred.

10:54

aaa, we can take the star on the right to be zero occurrences.

So it's got zero b's, we'll count that a multiple of three.

But if you have a multiple of three b's anywhere in there,

you can match it to that regular expression.

It's in the language defined by that regular expression.

If you don't have a multiple of three b's, then it's not in the set.

Already we're starting to see a little bit of computing

just with this simple notation.

12:37

We can specify a class of characters by putting a range.

So little a to little z would be lowercase characters.

Capital A to capital Z is capitalized characters.

And inside the brackets it says, take a character from that sequence.

And with that we can check capitalization, or

we can specify languages that have proper capitalization.

Rather than + or *, you might want to say exactly a certain number of occurrences.

So that's within curly braces.

So this says take a five-digit number, followed by a four-digit number.

So maybe that specifies whether the string is a plus-four code.

And there's all kinds of other strings that are not in the language, but

at least you can test whether it's legal according to the rule.

Or maybe between j and k occurrences.

So that's {2,4}, that says two, three or four occurrences of dot,

which would mean a wildcard character.

So that's a and b separated by two, three or four characters.

14:08

So a typical convention is to have \s match any whitespace character like space,

tab, newline, and other things like that.

And that's useful when processing text.

Now, it's important to note that in applications we're going to want to

use these things all the time.

From the point of view of the theory, they're just shorthand.

They're not essential to understanding the fundamental theoretical points that

we're going to talk about.

So you can always replace one of these shorthand notations by a regular

expression that follows the rules, using parentheses or star concatenation only.

15:08

developing the generalized RE for that is pretty straightforward.

So we have a C and then we have a .2 through 4, and

we have the proviso that, in this case,

only the allowed letters representing amino acids are in the alphabet.

So that's 2 to 4 of them and then a C followed by exactly 3.

And then something from one of the characters L, I, V, M, F, Y, W, C, X.

16:01

Here's a real world application having to do with proteins, and

actually scientists, biologists use this website as an integral part of their work.

And searching in there has to do with typing a regular expression in

that window and getting the result.

And more and more other search type tasks are involving regular expressions,

because it's an easy to use and compact notation for

specifying what it is that you want to see.

16:42

And so this is, again, a generalized regular expression,

it says at least one lowercase character followed by @ sign.

Followed by a sequence of at least one string of lowercase

characters, followed by a dot followed by an edu or com.

And again, as an exercise you can figure out how to add other types of extensions,

or numbers, or whatever else.

17:10

Again, a very expressive way of specifying a set of strings

that you want to consider to be legal, or

a language that you want to know is a given string in that language.

So now what we want to do is, given that set up, now we have a formal and

specific way to specify sets of strings or languages.

And now we want to know how we can determine whether or

not a given string is in the language, or matches the given regular expression.

That's going to be our next task.

17:46

Here's just a couple of pop quizzes, and again, worthwhile to study these

types of questions to make sure that you understand the basic idea.

So in any kind of quiz or exam on this material,

you might be asked which of the following strings match a*bb(ab|ba)*?

Well, you have to figure out a way to get through this question by just

working through.

In this case, what do you think about abb?

Well, looking at the RE, the star says take any number of a's.

I can certainly just take one, and then I take the bb, and

then the rest of it, I'll take zero occurences of.

So that one certainly is in the set it describes.

What about the second one?

Second one doesn't work because every string in that language

has to have two consecutive b's in it.

What about the third one?

Again, after this one,

after you get the bb there is no way to get just a single a at the end.

You have to have ab or ba.

So that one doesn't work, that one does.

No a's, two b's, then take a ba and

an ab and that works.

There's no c anywhere, so number 5 is easy.

And this one is more complicated reasoning to figure out why that string is not in

there, don't have an even number of characters after bb.

So you can see immediately, not so

easy to know whether given string matches a given regular expression.

Okay, here's another one type of question that you would have to answer

in any kind of quiz on this material.

19:48

So think about how you would create a regular expression that implements

those rules, describe the set of strings that satisfy those rules.

Here is the answer, at the beginning there has to be an atg.

At the end there has to be one of the three codons, tag, taa, or ttg.

In the middle you have a, or c, or t, or g, three times any number of occurrences.