Hello everyone, this lesson is mainly to talk about BLAST, which is the very well-known tool both in the Molecular Biology and bioinformatics areas.
The full name is Basic Local Alignment Search Tool,
I mainly focus on the core idea of the BLAST and the process of itâs working.
We also want to emphasize many conceptual problems that we need to pay attention when actually used.
Why do we use BLAST ? What makes BLAST so well-known today?
I quote the words of David Wake, âHomology is the central concept for all of biologyâ,
which means that Homology, he thinks, is a core issue of biological field.
Studying the sequence homology, what king of tools can be used today, BLAST is available.
Frankly, It is developed and aimed at researching sequence homology.
When we use BLAST, it is mainly carried out by the search of the database and get the query sequence,
and then study some some features of the query sequence,
such as the conservation of a nucleic acid or protein sequence and its structure, regulation,
which can be well studied by compared to known sequences in other systems.
This means that BLASTâs function is very clear basic and powerful,
and provides a very strong foundation in the sequences analysis.
What king of matters can be done using BLAST and how to do it?
First, I want to emphasize several concepts. The first one is Identity.
Identity means the consistent of the nucleic acid or amino acid sequence compared with the known sequence.
Then Similarity means the nucleic acid and protein sequences of the degree similarity with known sequence.
And then there's a very important concept is Homology.
Homology is a concept during evolution, and it means they have a common ancestor if the two sequences are homologous.
In fact, the most of homologous sequences mainly present a high primary structural similarity,
but we can not infer whether they are homologous or not by similarity.These is a difference between two terms.
There is a question that what kind of homologous sequences are conserved.
The previous lessons have mentioned two algorithm, the Needleman-Wunsch algorithm and Smith-Waterman algorithm Alignment ,
which are mainly making pairwise alignment and it has a high accuracy.
The method based on dynamic programming.
And if you use the this algorithm to search the database, the resources and time is a great bottleneck.
In order to solve the problem, FASTA and BLAST came into being.
These two tools are based on approximation algorithm.
FASTA and BLAST are able to guarantee the accuracy and does not decrease its accuracy , still has good performance.
At the same time, searching speed increasing ensures that we have time to complete such a task during search sequence from database.
The primarily workflow of BLAST is like that, which contains four aspects.
First of all, it should be filtered, and then Seeding.
Its core processes is Seeding and Extension, seeds and extensions is a cyclic process.
And finally conduct an evaluation.
There is a core concept that is the High-Scoring Segment Pairs, which is the very important element when performing BLAST .
The first step of BLAST is filter.
For avoiding too much seeds, BLAST filter low complexity sequences in the nucleic acid sequence , which is composed of four codes.
If they are not removed, then the back process will produce too much seeds,
and then the back of the process will Scan excessive signals
that are statistically significant, but they may not have any biological meaning, so they must first Mask off it.
The nucleic acid sequence is masked by the letter "N",and amino acid sequence is masked by letter "M".
Then there are a lot of optional parameters in the program of BLAST , like -F parameter.
Then, the first step of BLAST is seeding.
For instance, we want to seed a sequence of length n, and we choose a seed of length w.
For nucleic acid, we usually let the length of seed be 11, and for amino acid, the length is 3. This is a default setting.
Then, we will get n-w+1 seeds of a sequence of length n.
In BLAST program, the âW parameter is used for choosing the length of seed. We can also set it by ourselves.
Next, the second step is searching word hits.
The so called word hits is like searching the database using the seed. So we will use the scoring matrix we mentioned in the last lesson.
lesson. In amino acid sequence alignment, BLAST uses BLOSUM and PAM scoring matrix, and the default is BLOSUM62.
And in nucleic acid sequence alignment. If itâs a match, the score is +5. And if itâs a mismatch, the score is -4, or +2, or-3.
In the process of searching word hits and scoring, we need to set a threshold value of T,
and it can be changed. Set the threshold value as T, all word hits whose scores are higher than the threshold will be kept.
In BLAST version of 1990, it didnât allow Gap to appear in this step.
But in BLAST 2.0 of 1997, it allowed Gap to happen. This is Gapped BLAST.
The second step is very important. Let me put an example here. The seed is PQJ. We search the word in the database.
There must be a word, at this position, it has the highest score, exactly consistent with PQJ.
Then, others, like this, the middle amino acid changed, its score may be lower than PQJ,
this word, but still higher than the threshold value, which is set as 13, so we keep it. These are called neighbourhood words.
Then, the third step is a database scanning process, called Scanning.
This step, in the paper, it mentioned two approaches.
The first is Hash table. It means a word can be used as an unique index . Itâs a direct addressing method, and it improves the speed of scanning.
Then, another approach is the use of deterministic finite automaton or finite state machine.
This method is a transition, through that the system can convert each word to a state, and search it in the database.
This method yields a program that run faster. Thatâs why BLAST is faster than the previous Local Alignment and Global Alignment.
The fourth step is extending.
It extends both forward and backward in the word site we just found.
It extends both forward and backward in the word site we just found. Here we need to set a cutoff S.
In the extending process, if the sequence score is lower than S, it will be stopped. If the score is higher than S, the sequence will be kept,
and this alignment sequence is HSP (High Scoring-segment Pairs). This is the most important concept in BLAST.
Last, we will evaluate the results of BLAST.
Since BLAST is based on similarity statistics, so it adopts three level evaluation parameters.
First is raw score. In fact, raw scores have little meaning, because they donât involve the parameters of the scoring system used.
So BLAST normalizes the raw score. This normalization brings in the two parameters we have used in the score matrix, λ and K.
Another significance evolution is E-value, which it the most important parameter in BLAST data analysis procedure.
It brings in the length of the sequence we queried and the size of the sequence number of database, m and n.
m is the length of the query sequence, and n is the length of database. S is the score of HSP.
This formula indicates the score of HSP yields to Poisson distribution.
The E value is a possibility means that the rest sequences in the database are more similar with query sequence than the sequences in query
So the less the E value is, the better it is.
E>1 indicates the blast result is a random result. In other words,we got random sequences from our blast and the result is not reliable.
When E<0.1 or 0.05, the result is significant in statistical analysis.
If E<1e-5, blast results should have a very high similarity with query sequences.
This is the meaning of E value.So when you get the blast result, you should check the E value to ensure your blast result is reliable .
Here are all the blast programs in NCBI,We can choose certain programs based on our query sequences.
Nucleotide sequences should use nucleotide blast, amino acids sequences should use protein blast.
I mentioned that in 1997, Blast improved to 2.0 edition.
Blast 2.0 mainly improved alignment sensitivity and query speed with two new functions online, the GAP blast and PSI blast.
Gap blast improved the similarity in alignments through solving the gap problem in sequence alignment
while PSI is a iterative Blast focused on specific location.
The innovation of PSI is that it scores every highly similar locus after the first round Blast and generates a score matrix.
Based on this matrix, PSI goes through the second Blast and repeat scoring at specific locus.
After several rounds, PSI can achieve a result with highly similarity.
This is the advanced tool, PSI, mainly used to protein blast.
A very conserved amino acids sequence would win a very high score in PSI .
and to the unconserved ones, their scores are near zero.
As blast is just a tool in bio-research based on similarity statistics,
we should notice that the statistical result may not represent the truth in biology.
When you analysing your blast result, the sequence specificity of your query sequence should always be kept in mind.
If your query sequence can be translated into peptides, analysing peptide sequence by blast may be better to understand the homology.
I have to emphasize that, the concept of similarity can not equalize to homology.
The sequence which is highly similar according to blast result is not necessarily homologous with query sequence.
Also, a "percent homology" is not a proper word in biology and we should not use it in academic area.
Here are references,I strongly recommend the tutorial about Blast on NCBI.The information on Dr. Luo's website is quite helpful too.
These are members in our group and some classmates from CBI who helped a lot.And special thanks to Dr. Gao and Dr. Wei.
Thank you all!