0:03
Next we're going to talk about loops.
Loops are probably one of the most fundamental operations in any
programming language.
Loops are what allow us to write a little bit of code, and have it execute over,
and over, and over again on a large set of data.
This is particularly important for
a lot of DNA sequence data sets which are very large.
And where we want to execute fairly repetitive operations over and
over again, on the same data or on different data, as you read the data in.
So you can think of loop operations as what you might do if your read in a file
one line at a time.
And on each line, you did some computation, or some analysis.
So loops generally have this overall structure, where we test for
some condition.
And if the condition is true, we execute the loop.
And the loop is a whole block of code, could be very large.
And if it's false, then we're done.
And we move on to the next thing.
So there's two kinds of loops I want to talk about next.
While loops, and for loops.
Let's first talk about while loops.
Most programming languages have a while loop.
A while loop basically says, the condition is something we want to test.
And only execute while the condition remains true.
That's why we use the word while.
So here's a problem.
Given a DNA sequence,
find the positions of all canonical donor splice site candidates in the sequence.
1:15
So we'll look at a little program to do that.
We'll ask the user to input the sequence.
Splice sites, by the way, always start with the letters gt.
So we call those donor sites.
So we'll use the method find on our string DNA,
to find the position of a donor splice site, a gt.
1:36
And then we want to continue to go through our string, and find the next gt.
And the next gt, and the next gt.
And spit them out, print them out as we go.
So here's a little while that does that.
It says while the position is,
the variable position is greater than negative one.
That is while position has some value greater than negative one, zero or bigger.
We're going to print out the position of the next GT candidate.
2:01
So our first call to the find function gave us our first position.
And by the way,
if their were no gt's in the sequence, we'd get back the value of -1.
So this while loop would never even begin.
But if we got any value at all, then position would be greater than -1.
And we print, with this print statement,
we print the position that was returned, which is in the value pos, P-O-S.
And then we move on and we would call the find method again on our DNA sequence.
Starting one position later, position Pos + 1.
Again DNA find is going to give us negative one if we don't find the gt.
And it will give us a legal position if we do.
So if it gives us a value of -1, then the while test will fail.
And will leave our while open.
But as long as we keep finding more gts this while loop will keep executing.
2:50
So notice by the way that all the statements after the while condition
itself are indented.
It's important in this, as in many other parts of Python,
in this construction that we indent all of our statements the same amount.
If you don't, then Python's going to have a problem.
And you're code won't work.
3:07
For loop is the other major type of loop.
It's very similar to a while loop in that we have a test at the top.
And we have a block of code, all indented,
that we're going to execute if this test is true.
But usually we use a for
loop when we want to iterate over the items in a sequence or a set of numbers.
3:23
So now let's start off with an example.
We're going to iterate through a list of items.
So here's a list of motifs, of DNA sequences called motifs.
So, we'll give it three items.
And we can use the n operator to iterate through that list.
And each time we iterate through,
we're going to assign the next element to the variable m.
So in the for statement we'll say for m in motifs.
And that will cycle through this list successfully assigning the variable m,
the next value in the list.
So we're just going to print out the value that's getting assigned, and its length.
4:05
So the while block of code is simply one print statement here.
And you can see here that what it prints out is the first motif and
its length which is seven.
The second motif which is length 13.
And the third motif, which is length five.
Now sometimes you want to iterate over a sequence of numbers.
There's a special built in function called range to do that.
So we want to iterate, make some variable equal to say, zero, one, two, three.
We could assign those guys by calling a range function with a number like four.
And range will give us back all values less than four, starting with zero.
So if I say, for i in range(4): print(i),
it'll simply print out zero, one, two, three.
I will of course do much more interesting things than that.
But this is how you set up an iterator, or a count in this explicit way.
There's other ways too we'll talk about another time.
If we want to count through these values by more than one,
we can also specify with the range function where the range starts,
where it ends, and how much to count each time.
So we could say for i, in range 1 to 10, counting upward by two print(i).
And what that will do is first initialize the value of i to 1.
And then each time through the loop, it will add 2 to that value.
And it will stop when the value is less than 10.
So it will not let i get to be 10, but it will stop at 9.
So in this case, it will print out 1, 3, 5, 7, 9.
So let's look at another problem.
5:38
Let's try to find all the characters in a given protein sequence, and
see if they're valid amino acids.
So here's our pseudocode for this problem for
each character in our protein sequence.
Now remember protein sequence is comprised of letters from a 20 letter alphabet.
Because there's 20 amino acids.
So we're going to check each of the characters to make sure it's one of
those 20 letters.
And if it's not one of those 20 letters, we're going to print out the invalid
character, and we want to print out where we found it.
So let's suppose that someone input a protein sequence like the following.
I'm not going to read it out, but some long 30 or 40 character protein sequence.
Now we're going to iterate through that.
The way we do that is we call our range function, and
the range function gets as its upper bound the length of the protein.
So we'll just take our variable string protein.
We get its length, and we make that be the range.
So now we're getting a range from zero to the end of the protein sequence.
And i is going to get successfully assigned to the value
of each position in this string called protein.
So then I want to do this if test, so if protein[i].
So the way we access a particular
position in our string is we just use i as the index on protein.
So protein with the letter i in square brackets gives us the ith position.
Remember starting from zero always.
So we're testing to see if i is not in this alphabet of legal letters.
So I'm going to explicitly provide the 20 letters that are allowed.
So if it's not in those, then we want to do something.
We want to print out where we are, because that's where the invalid amino acid is.
So we'll print out a message saying protein contains invalid amino acids, and
we'll print the amino acid at that spot.
And then give the position.
So we give that print function the value and
the position in our string, protein of i and i.
So if we ran it on this particular input,
we'd see what printout protein days of valid amino acid U at position 8.
Another U at position 9.
And J at position 20.
Those are not allowed as amino acid codes.
7:36
Suppose we're only interested in finding out if a protein sequence is valid.
Not where all the invalid characters are.
So then we can break out of the loop prematurely if we like.
So let's look at our code again.
We'll take the same protein with some invalid characters,
the first one being that first U.
And now we want to iterate through our protein and
looking to see if we found any invalid character at all.
So again, we'll iterate for
i in range from zero to the length of the protein using the range function.
Then we'll test if protein i.
That is the amino acid at position i is not in our list of legal letters.
Then we want to print out just simply a statement saying this is not
a valid protein sequence and we want to stop right there.
So you can stop by using the special keyword break, which says terminate
the nearest enclosing loop, which is a for loop, or a while loop.
So that's going to leave our loop entirely and
won't continue going through our amino acids sequence anymore.
So I'll only print this statement once,
even though there are several invalid amino acids in our sequence.
So, if you run this you'll see when it prints out it's just not
a valid protein sequence.
And even though there are three illegal characters in the sequence,
as we saw in the previous example, it's only going to print it once.
Because as soon as it finds and illegal character, it calls a break.
8:52
Another similar thing we can do, but
not quite as severe, is the continue statement.
Which causes a program to continue with the next iteration of the loop,
skipping the rest of the code within the loop, so
you don't break entirely out of it, but you just move on to the next iteration.
9:06
So, here's an example where we might want to do that.
We might want to delete all the invalid amino acid characters from
a protein sequence and print out the sequence without them.
So, we'll start with our same sequence.
Which, remember, has three invalid amino acid characters in it,
and we're going to set up a new corrected protein sequence.
Initialize that to be the empty string.
Which we can do with single quote, followed by single quote.
So we're going to call that a corrected protein.
So again, we iterate with our for loop,
from i equals zero to the length of the protein.
And we ask if the position of protein i, if the character that
9:39
we see at protein position i is not in our list of legal characters.
Then we don't want to do anything, because we're not going to print it out.
So we'll just say continue.
Continue says go back to the beginning of the for loop and iterate again.
Meaning you increase the value of i in this case to the next value.
9:55
So if that doesn't happen, we don't continue.
So we don't break out of the body of the for loop.
We just keep going.
Then what we'll do, is we'll add the next letter in our amino acid sequence
to our corrected protein sequence.
So corrected protein, which was initially the empty list,
is now going to get one more character added to the right end of it.
And that character will be protein of i, which is a legal character.
10:19
So if at the end, we print out our corrected protein sequence,
it should be the right thing.
It'll be the same sequence that we had at the beginning with the three illegal
characters removed.
And this is a new sequence which was constructed by adding all
the legal characters to the ends of this empty sequence.
10:37
So using the continue sequence
can sometimes improve the readability of your code.
If you have some kind of nasty complicated logic, like what I'm showing here,
where you are going through a long list of numbers for
i in range(n), and you are testing some condition, and calling a function.
And then testing within that if statement, testing for another condition, and
calling another function, and then within that testing, another condition and so on.
Your code might be nested very deep, and it might be very messy.
So a way to fix that, is to use the continue statement,
which sometimes lets you clean up your code.
And this way, you can say for the same type of loop.
We can say if the first condition is not met, then just keep going and
then run function one.
So we would jump back up to the top of the loop,
where the first condition is not met.
Now we don't have to continue indenting our code.
So you might want to study this example a little more to understand exactly what's
going on.
11:32
So there's an unusual statement that Python
allows that most programming languages don't allow.
Which is an else clause that goes with loops.
Else clauses, of course, are part of if statements.
But we also have else clauses available at the end of a for loop or a while loop.
11:47
So, if you use it with a for
loop, the else statement is executed when the loop is finished with the iteration.
In a while loop, the else statement is executed when the condition becomes false.
The else statement is, you might think, well,
why not just put an extra statement after my loop?
In the main line of my program well the difference is that the else statement is
not executed if the loop is terminated by a break statement.
So it looks like it's going to be executed.
Looks like something you execute almost every time.
But if you have a break statement within your code it will prevent the else from
being executed.
Another way to say this, is that if you don't use a break statement inside of your
for, or while loop, there's no reason to use the else statement.
12:23
Now let's look at an examples of using else with a for loop.
We're going to use a slightly more complicated example for this one.
So we're going to take the problem of finding all prime numbers smaller than
a given integer.
Let's let that integer be ten.
So we'll say forr y in range (2, N),
which means 2 to 9 because the number 10's not included in the range.
Then we're going to iterate within that, in a nested for loop.
We're going to look at another value of x,
which is going to be in the range from (2, y).
And again, it'll go from 2 to 1 less than y.
Whatever y happens to be on this iteration.
We're going to test to see whether y can be divided by x.
We do that with remainder function.
So if y % x == 0.
So if that condition is true, then x is a factor of y.
So we wannna print that out.
We'll say y equal x times whatever that factor is.
13:19
But we're going to have an else condition, in case we don't find any factors of y.
Then we'll fall through without finding a factor.
And we'll print y as a prime number.
Now that break only breaks the inner most loop.
There's two nested loops here.
So, either we're going to print out all the factors of y,
or we're going to fail to find any factors of y and we'll print it out.
And we'll do this for each y from two to n So for
example the first time through the loop we'll set y to be equal to two.
And then we'll go to the inner loop, and
say for x in range through two to y but y is two.
So that will be range two to one, so that won't execute at all.
So we'll stop right there and go back.
But we'll print out that two is a prime number.
And we'll go on to the next iteration.
The second time through, we'll look at y equals three, and
x is going to be in the range from two to two.
So there will be one value.
But three can't be divided by two so we'll print out that three is a prime number.
And finally when we get to that third iteration, for,
when we check x equals two.
We'll find that, oh, y divided by two does give us no remainder.
So it will print out that four equals two times two, and so on.
And we'll get the output you see here where we print out the factors
of each number.
Or we print out, if it's prime, that it's a prime number.
14:34
So the pass statement is a placeholder, something that does nothing.
And you might think, well, I don't really ever need that.
And maybe you won't ever need it in your code,
but you use it when there's a statement required syntactically.
So something has to go somewhere, but
you don't want to put any command in that place.
So for example,
in an if statement you might want to test if a motif is not in some sequence DNA.
Then just say pass, and go on the the L statement and write your code there.
The most common use for
this might be that you're writing some complicated body of code.
And you know what you want to do, but you haven't decided exactly how to encode it.
So you can just put pass instead of a block of code.
And Python will let you run the rest of the code.
And you can go back and fill in what particular syntax you wanted later.
In our next lecture, Ella is going to tell you a lot about functions in Python.