Assistant Professor, Principle Investigator Center for Bioinformatics, School of Life Science
Liping Wei 魏丽萍, Ph.D.
Professor, Director Center for Bioinformatics, School of Life Sciences
Today I am going to talk about comparative protein structure modelling of genes and genomes.
Comparative modelling prediction is base on homology.
The secondary structure of a protein is determined by its amino acid sequence.
That is to say, the secondary structure such as coil and β-sheet can be predicted by the amino acid sequences.
Protein folds can also be determined by their secondary structure.
The aim of comparative or homology protein structure modelling is to build a three-dimensional model for a protein of unknown
structure on the basis of sequence similarity to proteins of known structure.
he 3D structures of the prediction targets are being determined by X-ray crystallography or NMR methods.
Not all of the proteins can be crystallized now and NMR with comparatively low resolution canât be used in all situation.
To characterize function of a protein, the very first thing is to find out its structure and how to react with other proteins and ligands,
especially for proteins with catalytic activity, for the sequence in active center is much more conserved than that around.
In order to find such active center, amino acid was mutated. Site-directed mutants with altered or destroyed binding capacity could test
hypotheses about sequence-structure-function relationships.
For example, many membrane proteins are studied by predicting its transmembrane regions, cytoplasmic regions and extracellular regions.
Steps in comparative protein structure modelling is shown in graph.
The starting point in comparative modelling is to identify all protein structures related to the target sequence,
which means comparing the similarity of amino acid sequences between target sequence and sequences of the protein with known structure.
The protein with highest similarity to the target sequence is most likely belong to the same family. In PDB or other protein databases,
proteins were classified by their family which indicated what kind of domains they contain.
It is of a great help for the following research. In an ideal situation,
blast was used if a protein following Michaelis-Menten equation with more than 50%-60% similarity to our target sequences was found.
The alignment becomes difficult when the overall sequence identity is under 40%.
In such difficult alignment cases,
it is frequently beneficial to rely on multiple structure and sequence information
such as steric hindrance, hydrophobicity and hydrophilicity.
For template selection, proteins in the same family and high similarity with the target sequence,
with known structure and well-studied is preferred.
The third class of methods are the so-called threading in which solvent, pH, ligands are considered to determine protein structure.
There is a potential problem when distantly related proteins are used as templates (less than 25% sequence identity).
The evaluation methods will predict an unreliable model.
So, you can acquaint these softwares they are all described in the literatures
The second step is to align the target sequence to the template accurately.As the previous step is
to find the structural domains.Align the target amino acid sequence with each structural domain of the template accurately
In general, the two sequence alignment can be done correctly when the sequence identity is over 40%.
When the target-template sequence identity decreases below 30%, miss-alignment may occurs
If you ever listened to Dr Gaoâs lectures, you may remember frequently-used matrix BLOSUM, BLOSUM62.Its sequence identity is 62%
And this article referenced a literature which present this conclusion.The referenced literature also does the evaluation via BLOSUM62
And found that the lowest sequence identity should be 40%
I think it can be understood easily
Corresponding with BLOSUM62ï¼the identity also can be 45% or 80%.That are BLOSUM45 and BLOSUM80
But when establishing a score matrix ,We usually use the portions which can be easily aligned Such as α-helixï¼Î²-pleated sheet,
While the lower identity sections we usually encountered are ruleless, like loops This case may not be considered in BLOSUM45
It needs a separate consideration.For example,in a structure-known protein sequence.Make the sequence of loops as a separate category
I believe such softwares will be developed
Then alignment the sequences.That is aligning with the template accurately
In this step The popular softwares are also used for multiple sequence alignment
Since the third step, we can construct a model
First, according to its known conformation in other proteins.We place these domains on a working table, as the parts
Then segment matching.Because the identity of some disquisitive parts is lower,and some different proteins has their additional sequences
In this case,we do some local deformation or regard them not as α- helix domains α- helixâs freedom degrees of chemical bond is high,
which can change the bond angle drastically
This can be the basis for the second step of reconstruction.
The space steric hindrance should be mainly considered like what we learn from the organic chemistry before
An amino acid with a hydrogen in side chain.It can move freely .But if with a benzene ring.Mobility would be limited
Then we place these front domains of the template on the working table.Consider these domains are fixed relatively
Then, other connected regions ,as its degree of freedom can change in a large degree
These are the relevant softwares
There are three methods for modelling.One is Ab initio methods.One is database search techniques.The last one is combining the two
Ab initio methods is used in miniâprotein,as no initio methods have developed for long protein sequences
Then searching the Database,as what I said before.For those motif,find known construction In the database.
The comprehensive application of both methods is popular
The part of side chain,apart from the main motif is a small part and relative specific ,when modelling the side chain
Some folding part of dihedral angles should mainly be considered.And the rotation of these angles.Next, I will talk about a rotational isomer
The main potential problem is about the prediction accuracy. Only half of the predictions are accurate in terms of Ï1 angles,
and one third for both Ï1 and Ï2 angles, due to the present calculating accuracy.
For the relatively short side-chains or segments, it cannot be well computed if they're more than 9 residues.
As I just said, we cannot compute accurately for de novo predictions, especially the long chains.
However, we can evaluate the predicting results in spite of these problems. The evaluating aspects are similar to what we just mentioned.
Firs of all, the accuracy of side-chain packing. Errors usually exist especially when no templates are available.
It's not easy to have the templates for the similarity less than 30% , as predictions are not reliable.
Of course, we may still get the wrong alignment for 30% - 40% similarity.
The incorrect templates mean that errors of the templates might come from inaccurate experiments.
Having the correct fold or not. While we are doing the prediction, de novo predictions in particular, we examine the energy between the chemical
bonds from the view of physics or quantum chemistry, and assume that the state with lowest energy is the most stable.
The commonly used test is Z score test based on normal distribution.
And again, sequence similarity below 30% is not a good predictor.
The third aspect is the environment. We get A,B,Z conformations of DNA from different environments, as is the same for proteins.
For example, the hydrophile of one protein might be different in traffcking vesicles and the cytosol.
So in general if it's about de novo predicting, it would be better.
The fourth, stereochemistry, in which we analyse if the bond lengths and bond angles are consistent with lowest-energy principle.
When analysing if it's in the right range for the bond length and bond angle,
we refer to the range inducted by previous researchers from lowest-energy principle, and see whether predictions are in the range or not.
As for the many spatial features, for example, if there's a larger side-chain, the angle that includes the side-chain cannot be formed.
Indeed, the best evaluation is to get its crystal structure, construct the protein structure with real X-ray results, and compare our prediction with it.
We see many immature aspects in modelling at present. Even if it is immature, it already has great impact for our research.
For example,low accuracy modelling , in situation of 30% sequence identity, Less than 50% of their Cα, that is carbon atom in the main chain,
atoms within 3.5 Ã of their correct positions. That is more than half of the accuracy is 35A.
As for middle accuracy modelling, that is 30%-50% sequence identity,
85% of the carbon atoms in peptide chain is within 3.5A of their correct positions.
As for high accuracy modelling, almost all are ok.
The application of low accuracy modelling is to generally make sure whether two proteins can interact with each other actually.
For the accuracy, most carbon atoms actually swing at the double space.
In this situation, we can generally judge whether proteins can interact with each other. After all, protein is made up of amino acids.
In this situation, its structure is far greater than a carbon atom. So it is ok.
For middle accuracy modelling, it can be used to do some functional predictions and guide you to do point mutation.
In this situation, the positions of most carbon atoms are accurately, that means that the estimation of the active center is accurately.
Because when we consider the active center, every active center is no more than 100 amino acids, generally dozens of.
It is relatively more if it can reach about 20 amino acids.
In this situation, in this scale, we think the middle accuracy modelling can do some research about point mutation.
Then high accuracy modelling is that most carbon atoms are accurately in atomic accuracy.
And it shows that we could do some research about ligands as well as protein and ligand interactions.
Although it is not far from the past time, we have already achieved the methods which are enough to do structural prediction based on molecular
biology and biochemistry.
That is to do structural modelling based on sequence homology, no matter whether it is low accuracy,
middle accuracy or high accuracy has achieved its goals. Of course, it is only reached to some degree and has not been finally solved.
After all, it is very few for high accuracy modelling.Thank you for all.