Today I am going to talk about comparative protein structure modelling of genes and genomes. Comparative modelling prediction is base on homology. The secondary structure of a protein is determined by its amino acid sequence. That is to say, the secondary structure such as coil and β-sheet can be predicted by the amino acid sequences. Protein folds can also be determined by their secondary structure. The aim of comparative or homology protein structure modelling is to build a three-dimensional model for a protein of unknown structure on the basis of sequence similarity to proteins of known structure. he 3D structures of the prediction targets are being determined by X-ray crystallography or NMR methods. Not all of the proteins can be crystallized now and NMR with comparatively low resolution canât be used in all situation. To characterize function of a protein, the very first thing is to find out its structure and how to react with other proteins and ligands, especially for proteins with catalytic activity, for the sequence in active center is much more conserved than that around. In order to find such active center, amino acid was mutated. Site-directed mutants with altered or destroyed binding capacity could test hypotheses about sequence-structure-function relationships. For example, many membrane proteins are studied by predicting its transmembrane regions, cytoplasmic regions and extracellular regions. Steps in comparative protein structure modelling is shown in graph. The starting point in comparative modelling is to identify all protein structures related to the target sequence, which means comparing the similarity of amino acid sequences between target sequence and sequences of the protein with known structure. The protein with highest similarity to the target sequence is most likely belong to the same family. In PDB or other protein databases, proteins were classified by their family which indicated what kind of domains they contain. It is of a great help for the following research. In an ideal situation, blast was used if a protein following Michaelis-Menten equation with more than 50%-60% similarity to our target sequences was found. The alignment becomes difficult when the overall sequence identity is under 40%. In such difficult alignment cases, it is frequently beneficial to rely on multiple structure and sequence information such as steric hindrance, hydrophobicity and hydrophilicity. For template selection, proteins in the same family and high similarity with the target sequence, with known structure and well-studied is preferred. The third class of methods are the so-called threading in which solvent, pH, ligands are considered to determine protein structure. There is a potential problem when distantly related proteins are used as templates (less than 25% sequence identity). The evaluation methods will predict an unreliable model. So, you can acquaint these softwares they are all described in the literatures The second step is to align the target sequence to the template accurately.As the previous step is to find the structural domains.Align the target amino acid sequence with each structural domain of the template accurately In general, the two sequence alignment can be done correctly when the sequence identity is over 40%. When the target-template sequence identity decreases below 30%, miss-alignment may occurs If you ever listened to Dr Gaoâs lectures, you may remember frequently-used matrix BLOSUM, BLOSUM62.Its sequence identity is 62% And this article referenced a literature which present this conclusion.The referenced literature also does the evaluation via BLOSUM62 And found that the lowest sequence identity should be 40% I think it can be understood easily Corresponding with BLOSUM62ï¼the identity also can be 45% or 80%.That are BLOSUM45 and BLOSUM80 But when establishing a score matrix ,We usually use the portions which can be easily aligned Such as α-helixï¼Î²-pleated sheet, While the lower identity sections we usually encountered are ruleless, like loops This case may not be considered in BLOSUM45 It needs a separate consideration.For example,in a structure-known protein sequence.Make the sequence of loops as a separate category I believe such softwares will be developed Then alignment the sequences.That is aligning with the template accurately In this step The popular softwares are also used for multiple sequence alignment Since the third step, we can construct a model First, according to its known conformation in other proteins.We place these domains on a working table, as the parts Then segment matching.Because the identity of some disquisitive parts is lower,and some different proteins has their additional sequences In this case,we do some local deformation or regard them not as α- helix domains α- helixâs freedom degrees of chemical bond is high, which can change the bond angle drastically This can be the basis for the second step of reconstruction. The space steric hindrance should be mainly considered like what we learn from the organic chemistry before An amino acid with a hydrogen in side chain.It can move freely .But if with a benzene ring.Mobility would be limited Then we place these front domains of the template on the working table.Consider these domains are fixed relatively Then, other connected regions ,as its degree of freedom can change in a large degree These are the relevant softwares There are three methods for modelling.One is Ab initio methods.One is database search techniques.The last one is combining the two Ab initio methods is used in miniâprotein,as no initio methods have developed for long protein sequences Then searching the Database,as what I said before.For those motif,find known construction In the database. The comprehensive application of both methods is popular The part of side chain,apart from the main motif is a small part and relative specific ,when modelling the side chain Some folding part of dihedral angles should mainly be considered.And the rotation of these angles.Next, I will talk about a rotational isomer The main potential problem is about the prediction accuracy. Only half of the predictions are accurate in terms of Ï1 angles, and one third for both Ï1 and Ï2 angles, due to the present calculating accuracy. For the relatively short side-chains or segments, it cannot be well computed if they're more than 9 residues. As I just said, we cannot compute accurately for de novo predictions, especially the long chains. However, we can evaluate the predicting results in spite of these problems. The evaluating aspects are similar to what we just mentioned. Firs of all, the accuracy of side-chain packing. Errors usually exist especially when no templates are available. It's not easy to have the templates for the similarity less than 30% , as predictions are not reliable. Of course, we may still get the wrong alignment for 30% - 40% similarity. The incorrect templates mean that errors of the templates might come from inaccurate experiments. Having the correct fold or not. While we are doing the prediction, de novo predictions in particular, we examine the energy between the chemical bonds from the view of physics or quantum chemistry, and assume that the state with lowest energy is the most stable. The commonly used test is Z score test based on normal distribution. And again, sequence similarity below 30% is not a good predictor. The third aspect is the environment. We get A,B,Z conformations of DNA from different environments, as is the same for proteins. For example, the hydrophile of one protein might be different in traffcking vesicles and the cytosol. So in general if it's about de novo predicting, it would be better. The fourth, stereochemistry, in which we analyse if the bond lengths and bond angles are consistent with lowest-energy principle. When analysing if it's in the right range for the bond length and bond angle, we refer to the range inducted by previous researchers from lowest-energy principle, and see whether predictions are in the range or not. As for the many spatial features, for example, if there's a larger side-chain, the angle that includes the side-chain cannot be formed. Indeed, the best evaluation is to get its crystal structure, construct the protein structure with real X-ray results, and compare our prediction with it. We see many immature aspects in modelling at present. Even if it is immature, it already has great impact for our research. For example,low accuracy modelling , in situation of 30% sequence identity, Less than 50% of their Cα, that is carbon atom in the main chain, atoms within 3.5 Ã
of their correct positions. That is more than half of the accuracy is 35A. As for middle accuracy modelling, that is 30%-50% sequence identity, 85% of the carbon atoms in peptide chain is within 3.5A of their correct positions. As for high accuracy modelling, almost all are ok. The application of low accuracy modelling is to generally make sure whether two proteins can interact with each other actually. For the accuracy, most carbon atoms actually swing at the double space. In this situation, we can generally judge whether proteins can interact with each other. After all, protein is made up of amino acids. In this situation, its structure is far greater than a carbon atom. So it is ok. For middle accuracy modelling, it can be used to do some functional predictions and guide you to do point mutation. In this situation, the positions of most carbon atoms are accurately, that means that the estimation of the active center is accurately. Because when we consider the active center, every active center is no more than 100 amino acids, generally dozens of. It is relatively more if it can reach about 20 amino acids. In this situation, in this scale, we think the middle accuracy modelling can do some research about point mutation. Then high accuracy modelling is that most carbon atoms are accurately in atomic accuracy. And it shows that we could do some research about ligands as well as protein and ligand interactions. Although it is not far from the past time, we have already achieved the methods which are enough to do structural prediction based on molecular biology and biochemistry. That is to do structural modelling based on sequence homology, no matter whether it is low accuracy, middle accuracy or high accuracy has achieved its goals. Of course, it is only reached to some degree and has not been finally solved. After all, it is very few for high accuracy modelling.Thank you for all.