0:42
So for this exercise, we're going to write the code to build a dictionary containing
all the sequences from a FASTA file.
So basically, we want to read the FASTA file in.
Where it looks like this example here,
where each of the header lines has an id, and then a space.
And then some free-form description of what that id is all about or
what that sequence is all about.
And then this sequence itself on multiple lines, could be any number of lines,
could be one line, could be a million lines.
So we don't really want to have to know how may lines that is.
We want our program to figure that out and then store with each id,
we want to store the sequence, so that's what our dictionary will contain.
It'll contain the identifier for each FASTA sequence and
then the sequence itself, which is associated with that identifier.
So let's go through a little pseudo-code for how we would do this.
So first we want to open our file.
1:44
If it's not a header line and we're in the middle of a sequence.
Then we want to update that sequence by adding whatever DNA sequence is on
the current line to our dictionary.
To the end of our entry for this current sequence that we're building.
And then when we get to the end of a particular DNA sequence, we want to ask,
are there any more lines in the file.
If there are, we want to continue.
So we're just going to loop to the file reading each FASTA entry one at a time and
then storing it in our dictionary.
2:09
And then finally we'll close the file at the end.
So let's see how we implement this code.
So first when open the file.
So we are going to try opening the file and
catch any exceptions in case the file doesn't exist.
So lets write F=openmyfile.fa, where fa is an extension indicating a Fasta file.
And if it doesn't exist we will catch that by saying accept IO error.
Remember these are special commands in Python.
And if it doesn't exist, we'll print out an error message to the user.
Otherwise, we've now got the variable f is now pointing to
the beginning of this file.
Okay, now I'm going to go through the main loop of our program.
So first we're going to set up out dictionary which we call seqs, and
make it empty.
So seqs=, and then we put squirly brackets with nothing in between them,
just initializes that dictionary to have nothing in it.
Now we're going to go through the file one line at a time and
do something different with each line, depending on what's on it.
So, we say for line in f, now that's our command for
reading the file one line at a time.
3:25
By saying line equals line.rstrip with no arguments.
All that does is get rid of the new line at the end of a line of text.
So now our line variable still contains the next line we've read without
the special new line character.
Now we want to figure out whether we're at a header.
So that was our first test in our code was, is this line a header
line indicating we're at the beginning of a new sequence or FASTA entry.
So the FASTA entries are always indicated by a little greater than character,
it's the very first character of a line.
So I can look at that, a line as a string.
So I can get the first character, which has the index zero by
4:23
I could use the special method called starts with.
Which just is another way of checking the very first character of the string, and
then you give it the argument greater than.
But here I've specifically referred to the 0th position of the string line.
All right, so that test is true, then we're going to execute this body of code.
Which is going to be set up a new dictionary entry.
So now we need to split the headliner into words, which we do with word.linesplit.
Which splits the header line on any white space characters.
So the very first element of that new list is going to be the name of our sequence.
So that's what we want to assign to the variable name.
So we do that with name equals words of zero.
So zero is the first element of our list.
Now we also have this slice that we've taken which is one colon
end square brackets.
And that just means we're going to slice out of that first element,
we don't really want the greater than sign to be part of the name.
So this has started index one of the string and go to the end.
Meaning we ignore index zero, which is the greater than sign.
So that gets us the name from the first position in our list where we've
stripped off that little greater than sign.
So now we have the name, and we want to now initialize a new dictionary entry for
that name.
Our dictionary member is called seqs, so to initialize a new entry we say,
seqs of name equals.
And then we put a quote, followed by another quote with nothing in between.
That just says, okay we've got a name for
a sequence and the entry corresponding to that is empty for the moment.
And we're going to fill that in by going through the remaining lines in our
fast entry which all have our DNA sequence.
So that's what we do if the if statement is true, meaning we're at a header line.
Else, that is if we're not at a header line, we're at the sequence.
So we write else colon and
then we write the code we're going to execute if we're not in a header line.
And for that, we're simply going to append the sequence on the current line.
To the end of our FASTA entry the end of our DNA sequence that we're building as we
read through the file.
So, we write seqs of names that looks up in our dictionary.
It looks up the entry for name.
And we're going to get the old entry.
We write equals, remember that's assignment, not mathematical equals.
We're going to get the current entry, seqs of names, or whatever it has in it.
It could be empty, it could have some DNA sequence in it already.
And we're going to append with a plus sign,
that works on strings by appending the current line.
So line is the sequence that we've got.
Remember, without that trailing carriage return, or new line characters.
We don't want that to be part of our DNA sequence.
So that's all done with a simple one line command here.
Seqs of name equals seqs it name plus line.
So by doing that we've now appended the current sequence
to the end of our FASTA entry.
So that's it, we've now tested to see whether we're on a header line.
If so we create a new dictionary entry.
If we're not on a header line,
we then add the current line to the end of our existing FASTA entry.
And we keep reading through it until we get to the next FASTA entry.
So that loop now, will go through our whole file reading all of our