Homology Modelling

Due to the well-known fact that amino acid sequence homology at a given level leads to similar 3D structure of proteins, several databases are interrelating the databases of sequences and structures. However, the term homology, a fundamental concept in bioinformatics, is often used incorrectly . Sequences are homologous if they are related by divergence from a common ancestor (as a first consequence, the search for homology in the sequence database is used to determine indications for function of proteins). Conversely, analogy relates to the acquisition of common structural or functional features via convergent evolution from unrelated ancestors . Homology is not a measure of similarity, but rather an absolute statement that sequences have a divergent rather than a convergent relationship. Among homologous sequences we can distinguish orthologs (proteins having the same function in different species) and paralogs (proteins performing different but related functions within one organism).

The model building of a target structure based on the comparison with the data extracted from homologous sequences with known structures (parents or templates) is named comparative modelling. Besides, this can be extended to homologs with low percentage of identity. All current comparative modelling methods consist of four sequential steps :1) fold assignment and template selection; 2) template-target alignment; 3) model building; and 4) model evaluation.

STEP 1: Fold assignment

To start the modeling process, we have to identify the template and define an alignment (residue-by-residue equivalences between the target and the template sequences. In homology modelling the stretches to be built are chosen according to their sequence alignment, consequently this is the most crucial step in a modeling process. Any errors at this stage are usually impossible to correct later . The sequences of the fold having the larger similarity with the target sequence will be taken as parents or templates. Currently, around 40% of all protein sequences can have at least one domain modelled on a related known protein structure . In particular, some proteins can have very low sequence identity and yet all share the same fold and a closely related function . The current theory of evolution would hold that such structures, having diverged from a common ancestor, often retain some functional and sequence similarity . In addition, divergent evolution has been recently reported on the basis of a biochemical pathway evolution for some proteins with a common (ba)8 barrel fold for which sequence similarity was not detected .

Originally, searches of homologous sequences to the target were done with local alignement programs as for example: FASTA ; SSEARCH or BLAST that are able to find identities shared between pairs of related sequences. With the high rate at which new sequences become available from genomic initiatives the importance of the sensitive methods of recognizing distant homologies has increased. Such methods are the main source of annotation, hence in the last decade very sensitive approaches have been developed to recognise fold. They have succeeded in different degrees of identification of relationships between remote homologues. These methods include:

Moreover, any additional information about the structure can improve the recognition by only sequence. As an example, secondary structure prediction can help to validate the alignment and the identification of related proteins with divergent sequences and it permits an increase in the number of potential templates . In recent studies on the comparison and evaluation of searching/aligning methods it was shown that for an E-value set to 10, the percentage of true positives (3D structure similar) ranged from 64.7% (SSEARCH) to 96.1% (BLAST), whereas the percentage of false positives ranged from 35.3% to 3.9% . On the other hand, using the well known position specific alignment method of PSI-BLAST, this succeeded to find remote structural homologues in 21% out of 246 searches . In general, PSI-BLAST correctly aligns 40% of the residues when the sequence identity is larger than 15% . Consequently, PSI-BLAST is aknowledged as one of the most powerful tools for detecting remote evolutionary relationships by sequence considerations only. The reasons explaining the success of the profile methods are the following: The difference of profile methods with respect to ISS is that those sequences with high similarity are aligned and the profile is used on the next search. The distribution of local alignement scores of random sequences is used to determine the significance of the alignment which is the crucial step to find the next related sequences. Going further, Rychlevsky et al. developed a new procedure with profile-profile searches (FFAS) that according to the authors gave better results than psi-blast, because of being more sensitive and accurate due to the use of Smith-Waterman dynamic programming routine to obtain the optimal alignment.


STEP 2: Template selection and alignment

For the template selection, one or more templates can be used. The use of multiple templates is not justified when the sequence spread between parents, relative to the target, is not appropriate for the level of expected model error. If both the average level of sequence identity between target and parents is larger than 40% and the sequence spread is too small between parents, then a single parent is used . The search on the database produces several local alignments according to the best score that correlates both target and template sequences. However, this is not necessarily the best alignment to identify residue correspondences and construct the target protein conformation, because the procedure was tuned to find remote homolgues and not the best alignment. Therefore, although target and templates are likely to be correctly aligned if sharing more than 40% identity, they need to be realigned if they are in the "twilight zone" sharing less than 30% identity.

The optimal alignment between homologous proteins, one of them with known 3D structure (template), is further used for constructing a model of the spatial structure of the target. However, after superposition of protein cores, amino acids from loop regions can be significantly displaced . At least 2/3 of the comparative protein modelling cases are based on less than 40% sequence identity between target and templates. To obtain a reasonable level of accuracy, the models must be based on alignments with few errors. Such alignments can usually be obtained when the sequence identity between the modelled sequence and at least one known structure is larger than 30% . A remarkable improvement is obtained by using multiple alignments of global sequence plus additional structural informations instead of the pair sequence local alignments used on the search of likely relatives. Several alignment programs ( MULTIALIGN ; MULTAL ; CLUSTALW ) have been tested against a database of correctly aligned multiple sequences ( BaliBase) . After all, the recent approaches that include local and pre-processed alignments, like those already found by using PSI-BLAST (i.e. DbClustal ); or those recalculating the local ( i.e. using Lalign ) and pre-processed alignments for segemnt pairs (i.e. using Dialign2 ) as for example the program T-Coffee ; or by iterative refinement of the multiple alignement like the program Prrp have obtained extraordinary good results.

Nevertheless, all these alignements loose the structural information given by those templates for which the conformation is known. On superimposing very similar structures upon one another, one is immediately able to distinguish regions of higher conservation; these are commonly referred to as structurally conserved regions (SCRs), whilst those regions that present the largest differences in conformation are referred as structurally variable regions (SVRs). In order to avoid the lost of structural information we suggest the following re-alignement between the target a sequence and the template:

STEP 3: Model building

Methods of model building

    Two main methods are used to built the 3D structure in homology modelling that differ on the definition of function F transforming sequence space in structure space. The first method is based on rigid body superimposition and the second in geometric restraints, with analogy to the molecular replacement and distance geometry methodologies decribed for Xray and NMR structure determination, respectively.

    Several algorithms have been developed in order to obtain a rigid body superimposition between sequences no directly related (JIG-SAW , COMPOSER , among others). SCR construction follows the original approach of Greer using sequentially similar SCR from homologous proteins to define the new core from a multiple alignment: 1) superimposing the known structures of homologous proteins (parents) using the SCRs to construct a framework; 2) superimposing the closest template sequence to the target sequence in the averaged main chain of framework; 3) building the SVRs main chain conformations by fitting compatible structures in the anchored stumps of the framework (see section on SVRs modelling for identification of the stretches to use); and 4) completing the target structure by modelling the side-chains of the target sequence.

    The methods based on the satisfaction of spatial restraints (like MODELLER ) are based on generating as many constraints (or restraints) as possible from the structural alignments of the parents and building the target structure like in the NMR methods (using additional energetic restraints according to the correct stereochemistry of the protein polymer). It is clear that regions where the structure of the homologous templates can not be structurally aligned, or where an alignment between the target and the multiple alignment of the templates is not given, will have to be built with an additional function. Most of the structural changes are produced in the loop regions, but occasional secondary structures may also be involved in variable regions . In the case of multiple superimposed parents the coordinates are separated into conserved secondary structure elements and conserved loops.

Model building of SVRs

    SVRs modelling can be seen as a mini protein folding problem, consequently the number of methods for predicting loop conformation are twofold: ab initio methods and adopting database searching techniques or knowledge-based approaches

    1. The ab initio prediction is based on a conformational search guided by a scoring or energy function: (f,y) space sampling ; minimum perturbation random tweak method ; systematic conformational search ; global energy minimization , local energy minimization ; molecular dynamics simulations ; genetic algorithms ; Monte Carlo and molecular dynamics ; Monte Carlo sampling ; multiple copy sampling ; searching discrete conformations by dynamic programming ; self-consistent field optimization ; among others (for a review see )

    2. The database approach to loop prediction consists of finding a segment of main chain that fits the two stem regions of a loop. The procedure has improved since the early works on modeling and in the last few years instead of a single conformation a number of loop conformations are selected for each gap that is as uniformely spread as possible . Hence, the remaining loops from the multiple parent modelling and all loops in the single parent modelling are modelled from database searches in three different databases: 1) homologous structures ; 2) cluster database of loops ; and 3) nonredundant database of proteins with less than 25% homology and accuracy higher than 2.5 A.

    The requirements of the chosen loop cluster of conformations are twofold: 1) the fitting between the two bracing secondary structures, and 2) a sequence pattern presented in the target loop to model. This procedure is based on the successful work on canonical loop structures of immunoglobulin complementary determining regions (CDR) by Chothia et al.. Nevertheless, the database search is valid only for short and medium sized loops or for special cases where homologous proteins share some structural commonalities on the loops although still being considered variable regions (as is the case for immunoglobulins ). Up to date classifications of long loops have failed, and it has been demonstrated that a correlation between the geometric variables describing the loop stems is needed in order to obtain such classification. This was only asserted for short and medium sized loops .

Side-chain construction. The side chains of the components need to be changed to those of the target structure. The side-chain packing problem is concerned with obtaining the arrangement of side-chain conformations on a given fixed backbone. Vasquez reviewed on various approaches to side-chain modelling , the major problem for predicting side-chain conformations being again of combinatorial nature. The strategy to model side-chains is also to reduce the dimension of the problem by incorporating as much empirical information as possible. Heuristic procedures either forego any attempt to solve the combinatorial problem, or conduct some degree of combinatorial optimization in a solution space that has been reduced as a result of local analysis. For example, significant correlations are found between side-chain dihedral angles and backbone that go beyond the dependence on the secondary structure . Therefore, the conformation of the side-chains are copied from a homologous template in homology building: a single rotamer for each side chain is built that traces as far as possible the path of the original side-chain. Nevertheless, there is a rapid decrease in the side-chain packing conservation when the sequence identity falls under 30% which implies the need of other strategies for dimensional reduction. An important piece of information is that side-chains can be grouped in representative sets of rotamers with specific distributions. Consequently, the library of rotamers taken from the database of protein structures can be used as an alternative to model the side-chains. First, additional internal coordinates to complete the side chain are taken from a secondary structure dependent rotamer library . Second, the side-chain is chosen by optimization procedures derived from the mean field theory approximation from additional rotamers representing high population densities in PDB. Energy-based procedures rely on the assumption that lower values necessarily correlate with more accurate positioning . This puts the burden on the quality of the particular energy function used. There are several limitations on the potential energy function for structure prediction in vacuum. When modelling side-chains on the surface of the protein it is not possible to calculate its interaction with solvent, because water molecules can not be included with the rotamers from the library. Karplus and cow. have obtained an accuracy of around 70% on de modeling of side-chains by testing the accuracy of new force fields . They demonstrate that the absence of solvent introduces an error in the hydrogen-bonding pattern of polar residues, being necessary the inclusion of electrostatic and solvation effects. The success in the solution of the rotamer-packing problem has enabled incorporation of strategies that solve this problem in docking procedures that evaluate protein-protein interactions.
STEP 4: Model evaluation

The source of errors in comparative modelling is mainly due to the lack of templates and the decrease in sequence identity between the target and the templates. These errors are split in five categories:

The evaluation of a model is critical for testing and suggesting the best and most accurate model or models. Additionally, the environment can have an important influence on the accuracy of the model, particularly if the protein structure is coordinated to metals or the template used is involved in a complex with other molecular compounds . Two criteria are used to filter the models : 1) based on energetic approaches; and 2) based on experimental data. On the first step, the model is checked to preserve the correct stereochemistry of a protein polymer. This is done with programs like PROCHECK , AQUA , SQUID or WHATCHECK and it can be fixed by using optimization programs based on molecular mechanics like CHARMM , GROMOS , AMBER , X-PLOR or WHAT IF. This implies a final refinement step on the modelling that has to be taken cautiously, mainly because the optimization is done in the wrong environment (i.e. with no solvation, no ions and not necessarily meaningful conformation for side-chains). This refinement is meant to simply remove drastic and local clashes and is done by a few cycles (100-1000) of steepest descent or conjugate gradient minimization runs until achieving convergence . The next step on the evaluation is the assessment of the fold which includes the order and length of the secondary structure elements and the use of energetic profiles introduced by statistical criteria extracted from the structure domain classifications. This implies that the structure will have a particular Z-score calculated by means of fold prediction methodologies indicating those regions wrongly modelled (according to statistical means). The programs VERIFY3D , PROSAII , HARMONY or ANOLEA are among those implementing this approach. In summary, these methods compare the modelled conformation with respect to the expected or standard structure on the X-ray solved protein structures. Although some criticism is introduced at this point, it is reasonably that individual contributions of each residue to the overall energy vary widely. Therefore it seems that there should not be a correlation between wrongly modelled regions and the amount of mean force potential on the region. Still, some applications have proved the use of this method by combination with additional information (secondary structure) to refine the models. The work of Aloy et al. is a clear example where mean force potentials detect wrongly modelled regions and suggest a method to improve the model building by: 1) distinguishing the wrongly modelled regions; 2) selecting the best model between several candidates; and 3) selecting a candidate refined structure after inclusion of additional information (i.e. secondary structure).

Finally, the recent work of Lazaridis and Karplus , shows the improvement on the classical molecular mechanics calculation of the energy by including solvation (environmental) terms to detect wrongly modelled regions. Consequently, the criticism on the potential of mean force can not be applied to this approach that did perform as well as statistical functions in discriminating correct and misfolded models .

The experimental evaluation of the model can only be done by site directed mutagenesis or additional information which is not commonly obtained. One way to escape the experiment is by using the knowledge obtained from a highly spread multiple alignments of related sequences introducing the following conditions:



N. Alexandrov and R. Luethy. (1998). Alignment algorithm for homology modeling and threading. Protein Sci 7, 254-258.

B. Al-Lazikani, A. Lesk and C. Chothia. (1997). Standard conformations for the canonical structures of immunoglobulins. J. Mol. Biol. 273, 927-948.

P. Aloy, J. Mas, M. Martí-Renom, E. Querol, F. Avilés and B. Oliva. (2000). Refinement of modelled structures by knowledge based energy profiles and secondary structure prediction: Application to the Human Procarboxypeptidase A2. J Comput-Aided Molec. Des. 14, 83-92.

S. Altschul, T. Madden, A. Schaffer, J. Zhang, Z. Zhang, W. Miller and D. Lipman. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389-3402.

T. Attwood. (2000). The Babel of Bioinformatics. Science 290, 471-473.

A. Bairoch and R. Apweiler. (1997). The SWISS-PROT protein sequence data bank and its supplement TrEMBL. Nucleic Acid Res. 25, 31-36.

G. Barton and M. Sternberg. (1987). A strategy for the rapid multiple alignmentof protein sequences; confidence levels from tertiary structure comparisons. J. Mol. Biol. 198, 327-337.

A. Bateman, E. Birney, R. Durbin, S. Eddy, K. Howe and E. Sonnhammer. (2000). The Pfam protein family database. Nucleic Acid Res. 28, 263-266.

P. Bates and M. Sternberg. (1998). From Sequence to Structure. Protein Structure Prediction: A practical approach (M. Sternberg, Ed.), Oxford Univ. Press, Oxford,UK.

P. A. Bates and M. Sternberg. (1999). Model building by comparison at CASP3: Using expert knowledge and computer automation. Proteins: Struct., Func. and Gene. Suppl. 3, 47-54.

D. Bowie, J. U. Luthy and D. Eisenberg. (1991). A method to identify protein sequences that fold into a known-3D structure. Science 253, 164-170.

B. Brooks, R. Bruccoleri, B. Olafson, D. States, S. Swaminathan and M. Karplus. (1983). CHARMM: a program for macromolecular energy minimization and dynamics calculations. J. Comp. Chem. 4, 187-217.

R. Bruccoleri and M. Karplus. (1987). Prediction of the foldingof short polypetide segments by uniform conformational sampling. Biopolymers 26, 137-138.

A. Brünger. (1992). X-PLOR: A system for X-ray crystallography and NMR. Yale University Press, New haven.

V. Collura, J. Higo and J. Garnier. (1993). Modeling of protein loops by simulated annealing. Protein Sci. 2, 1502-1510.

R. Copley and P. Bork. (2000). Homology among ba8 barrels: implications for the evolution of metabolic pathways. J. Mol. Biol. 303, 627-640.

C. Chothia, A. Lesk, A. Tramontano, M. Levitt, S. Smith-Gill, G. Air, S. Sheriff, E. Padlan, D. Davies, W. Tulip, P. Colman, S. Spinelli, P. Alzari and R. Poljak. (1989). Conformations of Immunoglobulin Hypervariable Regions. Nature 342, 877-883.

S. Chung and S. Subbiah. (1996). A structural explanation for the twilight zone of protein sequence homology. Structure 4, 1123-1127.

C. Deane, Q. Kaas and T. Blundell. (2001). SCORE: predicting the core of protein models. Bioinformatics 17, 541-550.

R. Dima, J. Banavar and A. Maritan. (2000). Scoring functions in protein folding and design. Protein Sci. 9, 812-819.

F. S. Domingues, W. A. Koppensteiner, M. jaritz, A. Prlic, C. Weichenberger, M. Wiederstein, H. Floeckner, P. lackner and M. Sippl. (1999). Sustained performance of knwoledge-based potentials in fold recognition. Proteins: Struct., Func. & Gene. Suppl. 3, 112-120.

L. Donate, S. Rufino, L. Canard and T. Blundell. (1996). Conformational analysis and clustering of short and medium size loops connecting regular secondary structures. A database for modelling and prediction. Proteins Sci. 5, 2600-2616.

M. Dudeck, K. Ramnarayan and J. Ponder. (1998). Protein structure prediction using a combination of sequence homology and global energy minimization: II. Energy functions. J. Comp. Chem. 19, 548-573.

S. Eddy. (1998). Profile hidden markov models. Bioinformatics 14, 755-763.

K. Fidelis, P. Stern, D. Bacon and J. Moult. (1994). Comparison of systematic search and database methods fro constructing segments of protein structure. Protein Eng. 7, 953-960.

D. Fischer and D. Eisenberg. (1996). Protein fold recognition using sequence-derived predictions. Protein Science 5, 947-955.

A. Fiser, R. Do and A. Sali. (2000). Modeling of loops in protein structures. Protein Sci. 9, 1753-1773.

I. Friedberg, T. Kaplan and H. Margalit. (2000). Evaluation of Psi/Blast algnment accuracy in comparison to structural alignments. Protein Sci 9, 2278-2284.

D. W. Gatchell, S. Dennis and S. Vajda. (2000). Discrimination of Near-native Protein Structures from Misfolded Models by Empirical Free Energy Functions. Proteins: Struct., Func. & Gene. 41, 518-534.

C. Geourjon, C. Combet, C. Blanchet and G. Deleague. (2001). Identification of related proteins with weak sequence identity using secondary structure information. Protein Sci. 10, 788-797.

O. Gotoh. (1996). Significant inprovement in accuracy of multiple sequence alignments by iterative refinements assessed by reference to structural alignments. J. Mol. Biol. 264, 823-838.

J. Greer. (1990). Comparative modeling methods: application to the family of the mammalian serine proteases. Proteins: Struc. Func. and Gene. 7, 317-334.

W. v. Gunsteren, S. Billeter, A. Eising, P. Hünenberger, P. Früger, A. Mark, W. Scott and I. Tironi. (1996). Biomolecular Simulation: The GROMOS96 Manual and User Guide. Verlag der Fachvereine, Zürich.

R. Hooft, G. Vriend and C. Sander. (1996). Verification of protein structures: side-chain planarity. J. Appl. Crystallogr. 29, 714-716.

X. Huang and W. Miller. (1991). A time-efficient linear-space local similarity algorithm. Advan. Appl.Math. 12, 337-357.

J. Irving, J. Whisstock and A. Lesk. (2001). Protein structural alignments and functional genomics. Proteins: struc. Func and Gene. 42, 378-382.

L. Jaroszewski, L. Rychlewski and A. Godzik. (2000). Improving the quality of twilight-zone alignments. Protein Sci. 9, 1487-1496.

A. Jennings, C. Edge and M. Sternberg. (2001). An approach to improving multiple alignments of protein sequences using predicted secondary structure. Protein Eng. 14, 227-231.

D. Jones. (1999). GenTHREADER: an efficient and reliable protein fold recognition method for genomicsequences. J. Mol. Biol. 287, 797-815.

T. A. Jones and S. Thirup. (1986). Using known substructures in protein model building and crystallography. EMBO J. 5, 819-822.

K. Karplus, C. Barrett, M. Cline, M. Diekhans, L. Grate and R. Hughey. (1999). Predicting proteins tructure using only sequence information. Proteins: Struc. Func. and Gene. Suppl 3, 121-125.

L. A. Kelley, R. M. MacCallum and M. Sternberg. (2000). Enhanced genome annotation using structural profiles in the program 3D-PSSM. J. Mol. Biol. 299, 499-520.

A. Kidera. (1995). Enhanced conformational sampling in Monte carlo simulations of proteins: Applications to a constrained peptide. Proc. Natl. Acad. Sci. USA 92, 9886-9889.

P. Koehl and M. Delarue. (1995). A self-consistent mean field approach to simultneous gap closure and side-chain positioning in protein homology modeling. Nat. Struct. Biol. 2, 163-170.

P. Koehl and M. Delarue. (1996). Mean-field minimization methods for biological macromolecules. Curr. Opin. Struct. Biol. 6, 222-226.

R. Laskowski, M. MacArthur and J. Thornton. (1998). Validation of Protein models derived from experiment. Curr. Opin. Struct. Biol. 5, 631-639.

T. Lazaridis and M. Karplus. (1999). Discrimination of the native from misfolded protein models with an energy function including implicit solvation. J. Mol. Biol. 288, 477-487.

J. U. Luthy, D. Bowie and D. Eisenberg. (1992). Assesment of protein models with three dimensional profiles. Nature 356, 83-85.

A. Martin, J. Cheetham and A. Rees. (1989). Modeling antibody hypervariable loops: a combined algorithm. Proc. Natl. Acad. Sci. USA 86, 9268-9272.

A. Martin and J. Thornton. (1996). Structural Families in Loops of Homologous Proteins: Automatic Classification, Modelling and Application to Antibodies. J.Mol.Biol. 263, 800-815.

M. Martí-Renom, J. Mas, P. Aloy, E. Querol, F. Aviles and B. Oliva. (1998). Statistical Analysis of the loop-geometry on a non-redundant database of proteins. J Mol. Mod. 4, 347-354.

M. A. Martí-Renom, A. Stuart, A. Fisher, R. Sánchez, F. Melo and A. Sali. (2000). Comparative protein structure modeling of genes and genomes. Ann. Rev. Biophys. Biomolec. Struc. 29, 291-325.

C. Mattos, G. Petsko and M. Karplus. (1994). Analysis of two residue turns in proteins. J.Mol. Biol. 238, 733-747.

M. McGregor, S. Islam and M. Sternberg. (1987). Analysis of the relationship between side-chain conformation and secondary structure in globular proteins. J. Mol. Biol. 198, 295-310.

F. Melo and E. Feytmans. (1997). Novel knowledge-based mean force potential at atomic level. J. Mol. Biol. 267, 207-222.

F. Melo and E. Feytmans. (1998). Assessing protein structures with a non local atomic interaction energy. J. Mol. Biol. 277, 1141-1152.

V. Morea, A. Tramontano, M. Rustici, C. Chothia and A. Lesk. (1998). Conformations of the third hypervariable region in the VH domain of immunoglobulins. J. Mol. Biol. 275, 265-294.

B. Morgenstern. (1999). Dialign2: improvement of the segment-to-segemnt approach to multiple sequence alignment. Bioinformatics 15, 211-218

J. Moult and M. James. (1986). An algorithm for determiningthe conformation of polypeptide segments in proteins by systematic search. Proteins: Struc. Func. and Gene. 1, 156-163.

N. Nakajima, J. Higo and A. Kidera. (2000). Free energy landscapes of peptides by enhanced conformational sampling. J. Mol Biol. 296, 197-216.

C. Notredame, D. Higgins and J. Heringa. (2000). T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302, 205-217.

T. Oldfield. (1992). Squid: a program for the analysis and display of data from crystallography and molecular dynamics. J. Mol. Graph. 10, 247-252.

B. Oliva, P. Bates, E. Querol, F. Avilés and M. Sternberg. (1997). An automatic Classification of the structure of protein loops. J. Mol. Biol. 266, 814-830.

B. Oliva, P. Bates, E. Querol, F. Avilés and M. Sternberg. (1998). Automated Classification of Antibody Complementarity Determining Region 3 of the Heavy Chain (H3) Loops into Canonical Forms and Its Application to Protein Structure Prediction. J. Mol. Biol.(279), 1193-1210.

O. Olmea, B. Rost and A. Valencia. (1999). Effective use of sequence correlation and conservation in fold recognition. J. Mol. Biol. 293, 1221-1239.

A. Panchenko, A. marchler-Bauer and S. H. Bryant. (2000). Combination of threading potentials and sequence profiles improves fold recognition. J. Mol. Biol. 296, 1319-1331.

K. Pawlowski, A. Bierzynski and A. Godzik. (1996). Structural diversity in a family of homologous proteins. J. Mol. Biol. 258, 349-366.

W. Pearson. (1996). Effective protein sequence comparison. Meth. Enz. 266, 227-258.

W. Pearson and D. Lipman. (1988). Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85, 2444-2448.

R. Petrella, T. Lazaridis and M. Karplus. (1998). Protein sidechain conformer prediction: a test of the energy function. Folding and Design 3, 353-377.

C. Rapp and R. Friesner. (1999). Prediction of loop geometries using a generalyzed Born model of solvation effect. Proteins: Struc., Func. and Gene. 35, 173-183.

C. Ring and F. Cohen. (1994). Conformational sampling of loop structures using genetic algorithm. Isr. J. Chem. 34, 245-252.

D. Rosenbach and R. Rosenfeld. (1995). Simultaneous modeling of multiple loops in proteins. Protein Sci. 4, 496-505.

B. Rost. (1999). Twilight zone of proteins sequence alignments. Protein Eng. 12, 85-94.

S. Rufino, L. Donate, L. Canard and T. Blundell. (1997). Predicting the Conformational Class of Short and Medium Size Loops Connecting Regular Secondary Structures: Application to Comparative Modelling. J. Mol. Biol. 267, 352-367.

R. Russell, M. Saqi, R. Sayle, P. Bates and M. Sternberg. (1997). Recognition of analogous and homologous protein folds: analysis of sequence and structure conservation. J Mo.l Biol. 269, 423-439.

R. Russell, P. Sasieni and M. Sternberg. (1998). Supersites within superfolds. Binding site similarity in the absence of homology. J. Mol. Biol. 282, 903-918.

L. Rychlewski, L. Jaroszewski, L. Weizhong and A. Godzik. (2000). Comparison of sequence profiles. Structural prediction with no structure information. Protein Sci. 8, 232-241.

G. Salem, E. Hutchinson, C. orengo and J. Thornton. (1999). Correlation of observed Fold frequency with the ocurrence of local structural motifs. J. Mol. Biol. 287, 969-981.

A. Sali and T. Blundell. (1993). Comparative protein modeling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779-815.

R. Sánchez, U. Pieper, F. Melo, N. Eswar, M. Martí-Renom, M. Madhusudhan, N. Mirkovic and A. Sali. (2000). Protein Structure Modeling for Structural Genomics. Nature Struct. Biol. Suppl. November, 986-990.

R. Sánchez and A. Sali. (1997). Advances in comparative protein structure modeling. Curr. Opin. Struct. Biol. 7, 206-214.

R. Sánchez and A. Sali. (1997). Evaluation of comparative protein structure modeling by MODELLER-3. Proteins: Struc. Func. and Gene. Suppl 1, 50-58.

M. Saqi, R. Russell and M. Sternberg. (1999). Misleading local sequence alignment: implications for comparative modelling. Protein Eng. 11, 627-630.

J. Sauder, J. Arthur and R. Dunbrack. (2000). Large-scale comparisson of protein sequence alignment algorithms with structure alignments. Proteins: Struc. Func. and Gene. 40, 6-22.

P. Shenkin, D. Yarmush, R. Fine, H. Wang and C. levinthal. (1987). Predicting antibody hypervariable loop conformation: I. Ensembles of random conformation fro ring-like structures. Biopolymers 26, 2053-2085.

H. Shirai, A. Kidera and H. Nakamura. (1999). H3-rules: identification of CDR-H3 structures in antibodies. FEBS Letters 455, 188-197.

M. Sippl. (1993). Recognition of errors in three-dimensional structures of proteins. Proteins: Struc. Func. and Gene. 17, 355-362.

K. Smith and B. Honig. (1994). Evaluation of the conformational free energies of loops in proteins. Proteins: Struc. Func. and Gene. 18, 119-132.

T. Smith and M. Waterman. (1981). Identification of common molecular subsequences. J. Mol. Biol. 147, 195-197.

M. Sternberg, P. Bates, L. Kelley and R. MacCallum. (1999). Progress in proteins structure prediction: assesment of CASP3. Curr. Opin. Struct. Biol. 9, 368-373.

M. Sutcliffe, F. Hayes and T. Blundell. (1987). Knowledge-based modeling of homologous proteins, part II: rules for the conformations of substituted side-chains. Protein Eng. 1, 385-392.

M. Sutcliffe, F. Hayes, D. Carney and T. Blundell. (1987). Knowledge-based modeling of homologous proteins, part I. Three dimensional frameorks derived from the simultaneous superposition of multiple structure. Protein Eng.(377-384).

W. Taylor. (1988). A flexible method to align large numbers of biological sequences. J. Mol. Evol. 28, 161-169.

S. Teichmann, C. Chothia, G. Church and J. Park. (2000). Fast assignements of protein structures to sequences using the intermediate sequence library. Bioinformatics 16, 117-124.

J. Thompson, D. Higgins and T. Gibson. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673-4680.

J. Thompson, F. Plewianiak and O. Poch. (1999). Balibase: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 15, 87-88.

J. Thompson, F. Plewianiak and O. Poch. (1999). A comprehensive comparison of multiple sequence alignment programs. Nucleic Acid Res. 27, 2682-2690.

J. Thompson, F. Plewianiak, J. Thierry and O. Poch. (2000). DbClustal: rapid and reliable global multiple alignments of protein sequence detected by database searches. Nucleic Acids Res. 28, 2919-2926.

C. Topham, N. Srinivasan, C. Thorpe, J. Overington and N. Kalsheker. (1994). Comparative modeling of major house dust mite allergen der p I: structure validation using an extended environmental amino acid propensity table. Protein Eng. 7, 869-894.

A. Torda. (1997). Perspectives in protein fold recognition. Curr. Opin. Struct. Biol. 7, 200-205.

A. Tramontano, C. Chothia and A. Lesk. (1989). Structural determinants of the conformations of medium sized loops in proteins. Proteins: Struc. Func. and Gene. 6, 382-394.

S. Vajda and C. DeLisi. (1990). Determining minimum energy conformations of polypetides by dynamic programming. Biopolymers 29, 1755-1772.

M. Vasquez. (1996). Modeling side-chain conformation. Curr. Opin. Struct. Biol. 6, 217-221.

H. W. v. Vlijmen and M. Karplus. (1997). PDB-based protein loop prediction: parameters for selection and methods for optimization. J. Mol. Biol. 267, 975-1001.

J. Wojcik, J. Mornon and J. Chomilier. (1999). New efficient statistical sequence-dependent structure prediction of short to medium-sized protein loops based on an exhaustive loop classification. J. Mol. Biol. 289, 1469-1490