So now let's take a look at regular expressions. Regular expressions are relevant to an important problem known as the pattern matching problem. We want to know is a given string an element of a given set of strings or is it in a language, a given language? So here's an example from computational biochemistry. One way to represent an amino acid is with one of the characters CABLI that are listed there. And each one of those is a standard representation of an amino acid. A protein is a string of amino acids, so it's just a string consisting of those letters. So there's a particular type of protein that is referred as a C2H2 type zinc finger domain signature. And that's well defined by these specific rules. It's a C followed by 2, 3, or 4 amino acids, that could be any one of them, followed by another C followed by exactly three amino acids, followed one of these 9 specific ones L, I, V, M, F, Y, W, C, or X, followed by 8 more amino acids, followed by H then followed by 3, 4 or 5 amino acids followed by H. It's a pretty complicated set of rules, but, well, life is complicated. So we might have the question how a given protein is. Does it follow these rules? Is it in the C2-H2 type zinc finger domain? So there's a protein, so let's see if it actually follows these rules. Whether it actually follows these rules or not. Is it in the language defined by those rules? Is it in the set of strings defined by those rules? And the answer in this case is yes. Starts with a C, then there is 3 amino acids, so that is the first bullet point. Then there is another C, and then exactly 3 amino acids. And then the next letter is Y, so that's one of the, makes the last bullet, that's one of the ones listed. And then there is 8 amino acids. And then the next one is H, and then there's 3. That's okay the last bullet point 3, 4, or 5. And then finally an H. So in this case this string is in the language. And clearly we want to be able to answer this type of question for a broad variety of types of proteins. That's example one. Here's another example from commercial computing. Suppose you want to define what an email address is. It's a sequence of letters followed by an @, followed by another non-empty sequence of letters followed by a dot. May be any number of occurrences of that. And then ending up with edu or com or gov or some other legal suffix, where you would expect to find in the end of an e-mail address. So you can ask then, which of the following are e-mail addresses? So this one's got a sequence of letter, at sign, some more letters a dot, some more letters a dot, ending in edu. So that one is in the language defined by these rules. So this phrase, not an email address, it's not an email address. It doesn't have the at sign or any dots. There's another one that fits. This one doesn't end in edu or com, so it's not an email address. This one doesn't fit according to those rules, but maybe we need to fix the description to allow for some numbers, say. In actually specifying precisely what's a legal email address and what is not a legal email address can get pretty complicated. So this is just to indicate that it's not always so easy to specify the rules. So the challenge is we want to develop a precise description of the set of strings in question. And that's true for many of these types of problems. And we'll look at a way to do that soon. Here's the third example from genomics. So nucleic acid is one of the letters A, C, T, or G. And we've seen examples before with this representation. A genome is a string of nucleic acids. So there's a thing called a fragile X syndrome pattern. And so that's a genome that is defined by the following rules. It's got gcg somewhere inside, followed by any number of cgg or agg triplets, followed by ctg. And, in fact, this is important because the number of triplets correlates with this fragile X Syndrome, which is a common cause of mental retardation. So if you have a genome, you might want to know this in order to deal with this illness. So given the genome because it have the pattern. So what about this one, does it have this pattern? This example is to Illustrate the challenge of the pattern matching problem. The answer in this case is, yes, after awhile you can find a gcg. And then there's three cgg and agg triplets, and then it ends with ctg. So that's three examples of fundamental computational problem with all kinds of applications. So the tool that we use to address these types of problems is called regular expressions. It's a notation for specifying a set of strings otherwise known as a formal language. So regular expression, an RE, is either the empty set or an empty string, or it's a single character or a wildcard symbol that represents all characters. Or it's an RE enclosed in parenthesis, or it's the concatenation of two or more REs. So here's examples of these rules. So aabaab, that's a concatenation of a bunch of single characters that specifies the set consisting of that one string. So only that string, aabaab is in the set specified by that regular expression, or in the language defined by that regular expression. And every other string is not in that set. And if we use a wildcard character, then we can specify multiple strings. So we say .u.u.u., that's all those seven letter words with alternating u's, starting at the second character. And there maybe lots of words in the dictionary, this are the ones in the dictionary but of course it specifies any string of several letters with those use in there. So those are examples of strength that are in the sets specified by that regular expression. And then if you don't have the u strictly alternating like that, then it's not matching. So that's simple examples of regular expressions using concatenation and wild card. And then we can have more complicated operations for the union of two or more regular expressions. That's just one or the other. And we represent that with a vertical bar. So in this simple example, only the two strings aa and baab are in the language defined by that regular expression, and every other string is not in the language. And then there's what's called the closure. And that's any number of occurrences of BRE. So ab*a means a followed by zero or more occurrences of b followed by a. So aa, abbba, a followed by any number of b's followed by a is in the set defined by that string, that regular expression. But these other strings are not in the set defined by ab*a. First one doesn't end in an a, the second one's got an a in the middle there, that's not allowed by the definition. Just interesting to note that the closure operation already takes us to infinity, that's specifying an infinite set of strings. The number of b's is not specified, could go over that bound. And then we can use parenthesization to make up complicated expressions describing specific sets. So for example, a followed by parenthesized a over b, aab. Again, there's two strings that are in that set, aaaab and abaab. And every other string is not in the set defined by that regular expression. Or if we say ab*a, that's a way of saying alternating ab's ending in H. And if you have a two a's in a row or two b's in a row anywhere, it says not in the set defined by that regular expression. Again, an infinite number of strings. These are simple rules, easy to understand but, turns out to be a very expressive notation for specifying sets of strings. So for example, .*spb.*, that's all strings that have spb inside them somewhere. This type of thing is useful for solving a crossword puzzle, you're often interested in answering questions of this sort. And we could also add the condition that it must be in the dictionary and describe our set that way. So the raspberry and crispbread are in the language defined by that regular expression, but these other ones do not have spb in that order contiguous anywhere inside. And the .* just allows matching of any number of characters, so wherever the spb is, that would be a match, as long as it's somewhere in the string. Here's a much more complicated one, a* or (a*ba*ba*ba*)*, whole thing starred. With a little thought you can convince yourself that this specifies all the strings consisting of just a's and b's where the number of b's is a multiple of three. So bbb, just take all the stars to mean zero occurrences. aaa, we can take the star on the right to be zero occurrences. So it's got zero b's, we'll count that a multiple of three. But if you have a multiple of three b's anywhere in there, you can match it to that regular expression. It's in the language defined by that regular expression. If you don't have a multiple of three b's, then it's not in the set. Already we're starting to see a little bit of computing just with this simple notation. So here's another one just using .*0...., that just matches all strings that fifth to last digit is 0. And here's one for our fragile X syndrome pattern. .* and then we put that gcg followed by any number of occurrences of cgg or agg*, ending with ctg and then finish with .*. So if the genome has the fragile X syndrome pattern inside it, that's the way to specify that language. And the ones who don't, does not match. So again, very expressive. And actually, it's so useful that many extensions have been developed. Just additional operations that make regular expressions even more useful. Because sets of strings is something that actually we wind up wanting to specify in all sorts of applications. So for example, rather than *, we can use +. And that just excludes the possibility of zero occurrences. So one or more occurrences. We can specify a class of characters by putting a range. So little a to little z would be lowercase characters. Capital A to capital Z is capitalized characters. And inside the brackets it says, take a character from that sequence. And with that we can check capitalization, or we can specify languages that have proper capitalization. Rather than + or *, you might want to say exactly a certain number of occurrences. So that's within curly braces. So this says take a five-digit number, followed by a four-digit number. So maybe that specifies whether the string is a plus-four code. And there's all kinds of other strings that are not in the language, but at least you can test whether it's legal according to the rule. Or maybe between j and k occurrences. So that's {2,4}, that says two, three or four occurrences of dot, which would mean a wildcard character. So that's a and b separated by two, three or four characters. Negation, so that's specify a set of characters and say, anything but those set of characters. So this generalized regular expression says six characters that are not vowels, and that's a word in the dictionary. Six character word in the dictionary without a vowel. And then there's special characters, like whitespace. So a typical convention is to have \s match any whitespace character like space, tab, newline, and other things like that. And that's useful when processing text. Now, it's important to note that in applications we're going to want to use these things all the time. From the point of view of the theory, they're just shorthand. They're not essential to understanding the fundamental theoretical points that we're going to talk about. So you can always replace one of these shorthand notations by a regular expression that follows the rules, using parentheses or star concatenation only. So here's just an example of the applications that we considered. How to write a generalized expression that specifies these rules? Well, it's pretty straightforward, these rules are quite specific and with the generalized notations that we've talked about, developing the generalized RE for that is pretty straightforward. So we have a C and then we have a .2 through 4, and we have the proviso that, in this case, only the allowed letters representing amino acids are in the alphabet. So that's 2 to 4 of them and then a C followed by exactly 3. And then something from one of the characters L, I, V, M, F, Y, W, C, X. Followed by 8 of them, followed by an H, followed by 3, 4 or 5 of them, followed by another H. A pretty compact notation of those rules that is a useful to specify it. Not hard to learn these rules, and is something that, everyone who's going to use computers effectively, learns to some extent eventually. Here's a real world application having to do with proteins, and actually scientists, biologists use this website as an integral part of their work. And searching in there has to do with typing a regular expression in that window and getting the result. And more and more other search type tasks are involving regular expressions, because it's an easy to use and compact notation for specifying what it is that you want to see. So here's just another example with our email address idea. And so this is, again, a generalized regular expression, it says at least one lowercase character followed by @ sign. Followed by a sequence of at least one string of lowercase characters, followed by a dot followed by an edu or com. And again, as an exercise you can figure out how to add other types of extensions, or numbers, or whatever else. Again, a very expressive way of specifying a set of strings that you want to consider to be legal, or a language that you want to know is a given string in that language. So now what we want to do is, given that set up, now we have a formal and specific way to specify sets of strings or languages. And now we want to know how we can determine whether or not a given string is in the language, or matches the given regular expression. That's going to be our next task. Here's just a couple of pop quizzes, and again, worthwhile to study these types of questions to make sure that you understand the basic idea. So in any kind of quiz or exam on this material, you might be asked which of the following strings match a*bb(ab|ba)*? Well, you have to figure out a way to get through this question by just working through. In this case, what do you think about abb? Well, looking at the RE, the star says take any number of a's. I can certainly just take one, and then I take the bb, and then the rest of it, I'll take zero occurences of. So that one certainly is in the set it describes. What about the second one? Second one doesn't work because every string in that language has to have two consecutive b's in it. What about the third one? Again, after this one, after you get the bb there is no way to get just a single a at the end. You have to have ab or ba. So that one doesn't work, that one does. No a's, two b's, then take a ba and an ab and that works. There's no c anywhere, so number 5 is easy. And this one is more complicated reasoning to figure out why that string is not in there, don't have an even number of characters after bb. So you can see immediately, not so easy to know whether given string matches a given regular expression. Okay, here's another one type of question that you would have to answer in any kind of quiz on this material. Given RE for genes, and genes are described this way, characters are a, c, t or g. Every gene starts with atg, the length is a multiple of 3. And it ends with either tag, taa, or ttg. So think about how you would create a regular expression that implements those rules, describe the set of strings that satisfy those rules. Here is the answer, at the beginning there has to be an atg. At the end there has to be one of the three codons, tag, taa, or ttg. In the middle you have a, or c, or t, or g, three times any number of occurrences. We could have used the .modify.notation saying just anything from the alphabet if we had wanted, ...star. So that's a second example, and it's worthwhile to go to the book site and practice on other examples of regular expressions and whether or not strings match them.