Homology Modelling

Homology Modelling

Due to the well-known fact that amino acid sequence homology at a given level leads to similar 3D structure of proteins, several databases are interrelating the databases of sequences and structures. However, the term homology, a fundamental concept in bioinformatics, is often used incorrectly . Sequences are homologous if they are related by divergence from a common ancestor (as a first consequence, the search for homology in the sequence database is used to determine indications for function of proteins). Conversely, analogy relates to the acquisition of common structural or functional features via convergent evolution from unrelated ancestors . Homology is not a measure of similarity, but rather an absolute statement that sequences have a divergent rather than a convergent relationship. Among homologous sequences we can distinguish orthologs (proteins having the same function in different species) and paralogs (proteins performing different but related functions within one organism).

The model building of a target structure based on the comparison with the data extracted from homologous sequences with known structures (parents or templates) is named comparative modelling. Besides, this can be extended to homologs with low percentage of identity. All current comparative modelling methods consist of four sequential steps :1) fold assignment and template selection; 2) template-target alignment; 3) model building; and 4) model evaluation.

STEP 1: Fold assignment

To start the modeling process, we have to identify the template and define an alignment (residue-by-residue equivalences between the target and the template sequences. In homology modelling the stretches to be built are chosen according to their sequence alignment, consequently this is the most crucial step in a modeling process. Any errors at this stage are usually impossible to correct later . The sequences of the fold having the larger similarity with the target sequence will be taken as parents or templates. Currently, around 40% of all protein sequences can have at least one domain modelled on a related known protein structure . In particular, some proteins can have very low sequence identity and yet all share the same fold and a closely related function . The current theory of evolution would hold that such structures, having diverged from a common ancestor, often retain some functional and sequence similarity . In addition, divergent evolution has been recently reported on the basis of a biochemical pathway evolution for some proteins with a common (ba)₈barrel fold for which sequence similarity was not detected _.

_{Originally, searches of homologous sequences to the target were
done with local alignement programs as for example: FASTA ; SSEARCH or
BLAST that are able to find identities shared between pairs of related
sequences. With the high rate at which new sequences become available from
genomic initiatives the importance of the sensitive methods of recognizing
distant homologies has increased. Such methods are the main source of annotation,
hence in the last decade very sensitive approaches have been developed
to recognise fold. They have succeeded in different degrees of identification
of relationships between remote homologues. These methods include:}

_{1) Threading approaches evaluating the compatibility between the
target sequence and a given structural template .}

_{2) Advanced sequence comparison procedures that take into account
multiple sequence alignments with a position specific scoring system ,
either provided by a coherent theory for profile methods using machine
learning probabilistic models (Hidden Markov Models) ; by a position specific
iterative BLAST (PSI-BLAST) ; or by searching in sequence space using intermediate
sequences (ISS) . These methods were shown to get better results than simple
threading .}

_{3) Finally, new approaches incorporating sequence profiles and
knowledge-based threading potential have been used, improving the recognition
of remote homologues}

_{Moreover, any additional information about the structure can improve
the recognition by only sequence. As an example, secondary structure prediction
can help to validate the alignment and the identification of related proteins
with divergent sequences and it permits an increase in the number of potential
templates . In recent studies on the comparison and evaluation of searching/aligning
methods it was shown that for an E-value set to 10, the percentage of true
positives (3D structure similar) ranged from 64.7% (SSEARCH) to 96.1% (BLAST),
whereas the percentage of false positives ranged from 35.3% to 3.9% . On
the other hand, using the well known position specific alignment method
of PSI-BLAST, this succeeded to find remote structural homologues in 21%
out of 246 searches . In general, PSI-BLAST correctly aligns 40% of the
residues when the sequence identity is larger than 15% . Consequently,
PSI-BLAST is aknowledged as one of the most powerful tools for detecting
remote evolutionary relationships by sequence considerations only. The
reasons explaining the success of the profile methods are the following:}

_{1) the use of multiple alignment information, hence a larger amount
of information than single sequences. The procedure is based on the hypothesis
that related sequences by a common ancestor have to preserve those important
residues for the function, for the fold, or for both. Therefore, these
specific residues have to be shared by all sequences with the same position
in the multiple alignment of the related members. Consequently, position
specific residues are given a high weight on the alignment scoring, which
in turn is obtained by means of a matrix of weights that is derived from
those sequences found with higher probability to be related to the query.}

_{2) It exploits the transitivity of homology like the intermediate
sequence search , by which a query sequence is aligned to a database (i.e.
SWISS-PROT) . Then, all aligned sequences with high significance similarity
(E-values<0.001) are used as new seeds and this is iterated until no
new sequences are found. This procedure implies a larger search than the
obtained by a single sequence search.}

_{The difference of profile methods with respect to ISS is that those
sequences with high similarity are aligned and the profile is used on the
next search. The distribution of local alignement scores of random sequences
is used to determine the significance of the alignment which is the crucial
step to find the next related sequences. Going further, Rychlevsky et al.
developed a new procedure with profile-profile searches (FFAS) that according
to the authors gave better results than psi-blast, because of being more
sensitive and accurate due to the use of Smith-Waterman dynamic programming
routine to obtain the optimal alignment.}

_{STEP 2: Template
selection and alignment}

_{For the template selection, one or more templates can be used.
The use of multiple templates is not justified when the sequence spread
between parents, relative to the target, is not appropriate for the level
of expected model error. If both the average level of sequence identity
between target and parents is larger than 40% and the sequence spread is
too small between parents, then a single parent is used . The search on
the database produces several local alignments according to the best score
that correlates both target and template sequences. However, this is not
necessarily the best alignment to identify residue correspondences and
construct the target protein conformation, because the procedure was tuned
to find remote homolgues and not the best alignment. Therefore, although
target and templates are likely to be correctly aligned if sharing more
than 40% identity, they need to be realigned if they are in the "twilight
zone" sharing less than 30% identity.}

_{The optimal alignment between homologous proteins, one of them
with known 3D structure (template), is further used for constructing a
model of the spatial structure of the target. However, after superposition
of protein cores, amino acids from loop regions can be significantly displaced
. At least 2/3 of the comparative protein modelling cases are based on
less than 40% sequence identity between target and templates. To obtain
a reasonable level of accuracy, the models must be based on alignments
with few errors. Such alignments can usually be obtained when the sequence
identity between the modelled sequence and at least one known structure
is larger than 30% . A remarkable improvement is obtained by using multiple
alignments of global sequence plus additional structural informations instead
of the pair sequence local alignments used on the search of likely relatives.
Several alignment programs ( MULTIALIGN ; MULTAL ; CLUSTALW ) have been
tested against a database of correctly aligned multiple sequences ( BaliBase)
. After all, the recent approaches that include local and pre-processed
alignments, like those already found by using PSI-BLAST (i.e. DbClustal
); or those recalculating the local ( i.e. using Lalign ) and pre-processed
alignments for segemnt pairs (i.e. using Dialign2 ) as for example the
program T-Coffee ; or by iterative refinement of the multiple alignement
like the program Prrp have obtained extraordinary good results.}

_{Nevertheless, all these alignements loose the structural information
given by those templates for which the conformation is known. On superimposing
very similar structures upon one another, one is immediately able to distinguish
regions of higher conservation; these are commonly referred to as structurally
conserved regions (SCRs), whilst those regions that present the largest
differences in conformation are referred as structurally variable regions
(SVRs). In order to avoid the lost of structural information we suggest
the following re-alignement between the target a sequence and the template:}

_{1) To obtain a multiple structural alignement between the templates
with known structure}

_{2) With the sequence alignment obtained previously for these templates
proceed to calculate a hidden markov profile to align the target sequence
to the HMM profile. or alternatively, use some of the following steps instead:}

_{3') To align the related sequences (target , templates and extra
related sequences) with Dbclustal, T-coffee, Prrp, etc.; check for the
closest result to the structural alignment and refine manually the alignment
of the target sequence.}

_{3'') Use all the different alignments obtained by step 3 and/or
3' to model built several models and evaluate the final model by other
means (see "evaluation of the model") to choose the best model.}

_{3''') To align all the related sequences as in 3'; obtain hidden
markov profiles with these alignements and align both hidden markov profiles
obtained structuraly (from step 2) and sequencially (as in 3'). Several
alignements of the target sequence with the templates with known 3D structure
are extracted from the final alignments. These alignements will be used
to model built several conformations of the target sequence as in 3'' and
the resulting models will also be evaluated by other methods in order to
choose the best model.}

_{STEP 3: Model
building}

_{Methods of model building}

_{Two main methods are used to built the 3D structure in
homology modelling that differ on the definition of function F transforming
sequence space in structure space. The first method is based on rigid body
superimposition and the second in geometric restraints, with analogy to
the molecular replacement and distance geometry methodologies decribed
for Xray and NMR structure determination, respectively.}

_{Several algorithms have been developed in order to obtain a rigid
body superimposition between sequences no directly related (JIG-SAW , COMPOSER
, among others). SCR construction follows the original approach of Greer
using sequentially similar SCR from homologous proteins to define the new
core from a multiple alignment: 1) superimposing the known structures of
homologous proteins (parents) using the SCRs to construct a framework;
2) superimposing the closest template sequence to the target sequence in
the averaged main chain of framework; 3) building the SVRs main chain conformations
by fitting compatible structures in the anchored stumps of the framework
(see section on SVRs modelling for identification of the stretches to use);
and 4) completing the target structure by modelling the side-chains of
the target sequence.}

_{The methods based on the satisfaction of spatial restraints (like
MODELLER ) are based on generating as many constraints (or restraints)
as possible from the structural alignments of the parents and building
the target structure like in the NMR methods (using additional energetic
restraints according to the correct stereochemistry of the protein polymer).
It is clear that regions where the structure of the homologous templates
can not be structurally aligned, or where an alignment between the target
and the multiple alignment of the templates is not given, will have to
be built with an additional function. Most of the structural changes are
produced in the loop regions, but occasional secondary structures may also
be involved in variable regions . In the case of multiple superimposed
parents the coordinates are separated into conserved secondary structure
elements and conserved loops.}

_{Model building of SVRs}

_{SVRs modelling can be seen as a mini protein folding problem, consequently
the number of methods for predicting loop conformation are twofold: ab
initio methods and adopting database searching techniques or knowledge-based
approaches}

_{1. The ab initio prediction is based on a conformational search
guided by a scoring or energy function: (}f,y) space sampling ; minimum perturbation random tweak method ; systematic conformational search ; global energy minimization , local energy minimization ; molecular dynamics simulations ; genetic algorithms ; Monte Carlo and molecular dynamics ; Monte Carlo sampling ; multiple copy sampling ; searching discrete conformations by dynamic programming ; self-consistent field optimization ; among others (for a review see )

2. The database approach to loop prediction consists of finding a segment of main chain that fits the two stem regions of a loop. The procedure has improved since the early works on modeling and in the last few years instead of a single conformation a number of loop conformations are selected for each gap that is as uniformely spread as possible . Hence, the remaining loops from the multiple parent modelling and all loops in the single parent modelling are modelled from database searches in three different databases: 1) homologous structures ; 2) cluster database of loops ; and 3) nonredundant database of proteins with less than 25% homology and accuracy higher than 2.5 A.

The requirements of the chosen loop cluster of conformations are twofold: 1) the fitting between the two bracing secondary structures, and 2) a sequence pattern presented in the target loop to model. This procedure is based on the successful work on canonical loop structures of immunoglobulin complementary determining regions (CDR) by Chothia et al.. Nevertheless, the database search is valid only for short and medium sized loops or for special cases where homologous proteins share some structural commonalities on the loops although still being considered variable regions (as is the case for immunoglobulins ). Up to date classifications of long loops have failed, and it has been demonstrated that a correlation between the geometric variables describing the loop stems is needed in order to obtain such classification. This was only asserted for short and medium sized loops .

Side-chain construction. The side chains of the components need to be changed to those of the target structure. The side-chain packing problem is concerned with obtaining the arrangement of side-chain conformations on a given fixed backbone. Vasquez reviewed on various approaches to side-chain modelling , the major problem for predicting side-chain conformations being again of combinatorial nature. The strategy to model side-chains is also to reduce the dimension of the problem by incorporating as much empirical information as possible. Heuristic procedures either forego any attempt to solve the combinatorial problem, or conduct some degree of combinatorial optimization in a solution space that has been reduced as a result of local analysis. For example, significant correlations are found between side-chain dihedral angles and backbone that go beyond the dependence on the secondary structure . Therefore, the conformation of the side-chains are copied from a homologous template in homology building: a single rotamer for each side chain is built that traces as far as possible the path of the original side-chain. Nevertheless, there is a rapid decrease in the side-chain packing conservation when the sequence identity falls under 30% which implies the need of other strategies for dimensional reduction. An important piece of information is that side-chains can be grouped in representative sets of rotamers with specific distributions. Consequently, the library of rotamers taken from the database of protein structures can be used as an alternative to model the side-chains. First, additional internal coordinates to complete the side chain are taken from a secondary structure dependent rotamer library . Second, the side-chain is chosen by optimization procedures derived from the mean field theory approximation from additional rotamers representing high population densities in PDB. Energy-based procedures rely on the assumption that lower values necessarily correlate with more accurate positioning . This puts the burden on the quality of the particular energy function used. There are several limitations on the potential energy function for structure prediction in vacuum. When modelling side-chains on the surface of the protein it is not possible to calculate its interaction with solvent, because water molecules can not be included with the rotamers from the library. Karplus and cow. have obtained an accuracy of around 70% on de modeling of side-chains by testing the accuracy of new force fields . They demonstrate that the absence of solvent introduces an error in the hydrogen-bonding pattern of polar residues, being necessary the inclusion of electrostatic and solvation effects. The success in the solution of the rotamer-packing problem has enabled incorporation of strategies that solve this problem in docking procedures that evaluate protein-protein interactions.

STEP 4: Model evaluation

The source of errors in comparative modelling is mainly due to the lack of templates and the decrease in sequence identity between the target and the templates. These errors are split in five categories:

Errors in side-chain packing . They are mainly due to the divergence of sequences and critical when occurring in regions involved in the protein function.
Shifts of correctly aligned residues . Also they are produced by the divergence of the templates, where the overall fold remains but the scaffold has been locally displaced.
Regions without template .This is produced in local regions where the target sequence can not be aligned to any of the parents with known structure. These regions belong to the SVRs, and the structure is derived from general databases, hence increasing the conformational diversity that implies the largest errors on the model.
Errors due to misalignments . This is produced by a shift on the alignment between the target sequence and the templates an are the worst source of errors, because up to date they are difficult to be detected. One way of detection is by using multiple alignments including sequences without known structure. However, if the misalignment is produced only in the target sequence the multiple alignment is useless. The best way to detect these errors is by check of the final model and further refinement.
Errors produced by incorrect templates . This problem appears when using distantly related sequences (templates with less than 25% identity) and it is also a difficult problem although it is clearly detected. This represents a difficult problem only for models for which no other homolog templates can be used. Unfortunately, distinguishing between errors produced for a model based on an incorrect alignment with the correct template (previous error 4) and errors produced for a model based on an incorrect template is difficult..

The evaluation of a model is critical for testing and suggesting the best and most accurate model or models. Additionally, the environment can have an important influence on the accuracy of the model, particularly if the protein structure is coordinated to metals or the template used is involved in a complex with other molecular compounds . Two criteria are used to filter the models : 1) based on energetic approaches; and 2) based on experimental data. On the first step, the model is checked to preserve the correct stereochemistry of a protein polymer. This is done with programs like PROCHECK , AQUA , SQUID or WHATCHECK and it can be fixed by using optimization programs based on molecular mechanics like CHARMM , GROMOS , AMBER , X-PLOR or WHAT IF. This implies a final refinement step on the modelling that has to be taken cautiously, mainly because the optimization is done in the wrong environment (i.e. with no solvation, no ions and not necessarily meaningful conformation for side-chains). This refinement is meant to simply remove drastic and local clashes and is done by a few cycles (100-1000) of steepest descent or conjugate gradient minimization runs until achieving convergence . The next step on the evaluation is the assessment of the fold which includes the order and length of the secondary structure elements and the use of energetic profiles introduced by statistical criteria extracted from the structure domain classifications. This implies that the structure will have a particular Z-score calculated by means of fold prediction methodologies indicating those regions wrongly modelled (according to statistical means). The programs VERIFY3D , PROSAII , HARMONY or ANOLEA are among those implementing this approach. In summary, these methods compare the modelled conformation with respect to the expected or standard structure on the X-ray solved protein structures. Although some criticism is introduced at this point, it is reasonably that individual contributions of each residue to the overall energy vary widely. Therefore it seems that there should not be a correlation between wrongly modelled regions and the amount of mean force potential on the region. Still, some applications have proved the use of this method by combination with additional information (secondary structure) to refine the models. The work of Aloy et al. is a clear example where mean force potentials detect wrongly modelled regions and suggest a method to improve the model building by: 1) distinguishing the wrongly modelled regions; 2) selecting the best model between several candidates; and 3) selecting a candidate refined structure after inclusion of additional information (i.e. secondary structure).

Finally, the recent work of Lazaridis and Karplus , shows the improvement on the classical molecular mechanics calculation of the energy by including solvation (environmental) terms to detect wrongly modelled regions. Consequently, the criticism on the potential of mean force can not be applied to this approach that did perform as well as statistical functions in discriminating correct and misfolded models .

The experimental evaluation of the model can only be done by site directed mutagenesis or additional information which is not commonly obtained. One way to escape the experiment is by using the knowledge obtained from a highly spread multiple alignments of related sequences introducing the following conditions:

From such a multiple alignment there are observed conserved regions shared on all sequences, hence reducing the length of SCRs. The surviving common structural regions produced from an extensive refinement of the structural alignment often include the active sites plus additional core secondary structure elements that appear to lend structural support to the binding site. These regions conserve the stereo-specific interactions involved in ligand binding and catalysis. The structural knowledge of these regions, as well as the support for their presence, grow in importance with the development of structural genomics and introduces a mechanism to evaluate the modelled structure that has to agree with these findings.
Those residues with no conservation and mutually involved by a 3D interaction or functionality will present clear correlation in its mutation. Cases of correlated mutations have been deeply studied because of its use on ab initio folding and fold prediction, and in cases where sequences have diverged enough it is possible to use the correlation to evaluate the correct modeling of the target conformation.

REFERENCES

N. Alexandrov and R. Luethy. (1998). Alignment algorithm for homology modeling and threading. Protein Sci 7, 254-258.

B. Al-Lazikani, A. Lesk and C. Chothia. (1997). Standard conformations for the canonical structures of immunoglobulins. J. Mol. Biol. 273, 927-948.

P. Aloy, J. Mas, M. Martí-Renom, E. Querol, F. Avilés and B. Oliva. (2000). Refinement of modelled structures by knowledge based energy profiles and secondary structure prediction: Application to the Human Procarboxypeptidase A2. J Comput-Aided Molec. Des. 14, 83-92.

S. Altschul, T. Madden, A. Schaffer, J. Zhang, Z. Zhang, W. Miller and D. Lipman. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389-3402.

T. Attwood. (2000). The Babel of Bioinformatics. Science 290, 471-473.

A. Bairoch and R. Apweiler. (1997). The SWISS-PROT protein sequence data bank and its supplement TrEMBL. Nucleic Acid Res. 25, 31-36.

G. Barton and M. Sternberg. (1987). A strategy for the rapid multiple alignmentof protein sequences; confidence levels from tertiary structure comparisons. J. Mol. Biol. 198, 327-337.

A. Bateman, E. Birney, R. Durbin, S. Eddy, K. Howe and E. Sonnhammer. (2000). The Pfam protein family database. Nucleic Acid Res. 28, 263-266.

P. Bates and M. Sternberg. (1998). From Sequence to Structure. Protein Structure Prediction: A practical approach (M. Sternberg, Ed.), Oxford Univ. Press, Oxford,UK.

P. A. Bates and M. Sternberg. (1999). Model building by comparison at CASP3: Using expert knowledge and computer automation. Proteins: Struct., Func. and Gene. Suppl. 3, 47-54.

D. Bowie, J. U. Luthy and D. Eisenberg. (1991). A method to identify protein sequences that fold into a known-3D structure. Science 253, 164-170.

B. Brooks, R. Bruccoleri, B. Olafson, D. States, S. Swaminathan and M. Karplus. (1983). CHARMM: a program for macromolecular energy minimization and dynamics calculations. J. Comp. Chem. 4, 187-217.

R. Bruccoleri and M. Karplus. (1987). Prediction of the foldingof short polypetide segments by uniform conformational sampling. Biopolymers 26, 137-138.

A. Brünger. (1992). X-PLOR: A system for X-ray crystallography and NMR. Yale University Press, New haven.

V. Collura, J. Higo and J. Garnier. (1993). Modeling of protein loops by simulated annealing. Protein Sci. 2, 1502-1510.

R. Copley and P. Bork. (2000). Homology among ba8 barrels: implications for the evolution of metabolic pathways. J. Mol. Biol. 303, 627-640.

C. Chothia, A. Lesk, A. Tramontano, M. Levitt, S. Smith-Gill, G. Air, S. Sheriff, E. Padlan, D. Davies, W. Tulip, P. Colman, S. Spinelli, P. Alzari and R. Poljak. (1989). Conformations of Immunoglobulin Hypervariable Regions. Nature 342, 877-883.

S. Chung and S. Subbiah. (1996). A structural explanation for the twilight zone of protein sequence homology. Structure 4, 1123-1127.

C. Deane, Q. Kaas and T. Blundell. (2001). SCORE: predicting the core of protein models. Bioinformatics 17, 541-550.

R. Dima, J. Banavar and A. Maritan. (2000). Scoring functions in protein folding and design. Protein Sci. 9, 812-819.

F. S. Domingues, W. A. Koppensteiner, M. jaritz, A. Prlic, C. Weichenberger, M. Wiederstein, H. Floeckner, P. lackner and M. Sippl. (1999). Sustained performance of knwoledge-based potentials in fold recognition. Proteins: Struct., Func. & Gene. Suppl. 3, 112-120.

L. Donate, S. Rufino, L. Canard and T. Blundell. (1996). Conformational analysis and clustering of short and medium size loops connecting regular secondary structures. A database for modelling and prediction. Proteins Sci. 5, 2600-2616.

M. Dudeck, K. Ramnarayan and J. Ponder. (1998). Protein structure prediction using a combination of sequence homology and global energy minimization: II. Energy functions. J. Comp. Chem. 19, 548-573.

S. Eddy. (1998). Profile hidden markov models. Bioinformatics 14, 755-763.

K. Fidelis, P. Stern, D. Bacon and J. Moult. (1994). Comparison of systematic search and database methods fro constructing segments of protein structure. Protein Eng. 7, 953-960.

D. Fischer and D. Eisenberg. (1996). Protein fold recognition using sequence-derived predictions. Protein Science 5, 947-955.

A. Fiser, R. Do and A. Sali. (2000). Modeling of loops in protein structures. Protein Sci. 9, 1753-1773.

I. Friedberg, T. Kaplan and H. Margalit. (2000). Evaluation of Psi/Blast algnment accuracy in comparison to structural alignments. Protein Sci 9, 2278-2284.

D. W. Gatchell, S. Dennis and S. Vajda. (2000). Discrimination of Near-native Protein Structures from Misfolded Models by Empirical Free Energy Functions. Proteins: Struct., Func. & Gene. 41, 518-534.

C. Geourjon, C. Combet, C. Blanchet and G. Deleague. (2001). Identification of related proteins with weak sequence identity using secondary structure information. Protein Sci. 10, 788-797.

O. Gotoh. (1996). Significant inprovement in accuracy of multiple sequence alignments by iterative refinements assessed by reference to structural alignments. J. Mol. Biol. 264, 823-838.

J. Greer. (1990). Comparative modeling methods: application to the family of the mammalian serine proteases. Proteins: Struc. Func. and Gene. 7, 317-334.

W. v. Gunsteren, S. Billeter, A. Eising, P. Hünenberger, P. Früger, A. Mark, W. Scott and I. Tironi. (1996). Biomolecular Simulation: The GROMOS96 Manual and User Guide. Verlag der Fachvereine, Zürich.

R. Hooft, G. Vriend and C. Sander. (1996). Verification of protein structures: side-chain planarity. J. Appl. Crystallogr. 29, 714-716.

X. Huang and W. Miller. (1991). A time-efficient linear-space local similarity algorithm. Advan. Appl.Math. 12, 337-357.

J. Irving, J. Whisstock and A. Lesk. (2001). Protein structural alignments and functional genomics. Proteins: struc. Func and Gene. 42, 378-382.

L. Jaroszewski, L. Rychlewski and A. Godzik. (2000). Improving the quality of twilight-zone alignments. Protein Sci. 9, 1487-1496.

A. Jennings, C. Edge and M. Sternberg. (2001). An approach to improving multiple alignments of protein sequences using predicted secondary structure. Protein Eng. 14, 227-231.

D. Jones. (1999). GenTHREADER: an efficient and reliable protein fold recognition method for genomicsequences. J. Mol. Biol. 287, 797-815.

T. A. Jones and S. Thirup. (1986). Using known substructures in protein model building and crystallography. EMBO J. 5, 819-822.

K. Karplus, C. Barrett, M. Cline, M. Diekhans, L. Grate and R. Hughey. (1999). Predicting proteins tructure using only sequence information. Proteins: Struc. Func. and Gene. Suppl 3, 121-125.

L. A. Kelley, R. M. MacCallum and M. Sternberg. (2000). Enhanced genome annotation using structural profiles in the program 3D-PSSM. J. Mol. Biol. 299, 499-520.

A. Kidera. (1995). Enhanced conformational sampling in Monte carlo simulations of proteins: Applications to a constrained peptide. Proc. Natl. Acad. Sci. USA 92, 9886-9889.

P. Koehl and M. Delarue. (1995). A self-consistent mean field approach to simultneous gap closure and side-chain positioning in protein homology modeling. Nat. Struct. Biol. 2, 163-170.

P. Koehl and M. Delarue. (1996). Mean-field minimization methods for biological macromolecules. Curr. Opin. Struct. Biol. 6, 222-226.

R. Laskowski, M. MacArthur and J. Thornton. (1998). Validation of Protein models derived from experiment. Curr. Opin. Struct. Biol. 5, 631-639.

T. Lazaridis and M. Karplus. (1999). Discrimination of the native from misfolded protein models with an energy function including implicit solvation. J. Mol. Biol. 288, 477-487.

J. U. Luthy, D. Bowie and D. Eisenberg. (1992). Assesment of protein models with three dimensional profiles. Nature 356, 83-85.

A. Martin, J. Cheetham and A. Rees. (1989). Modeling antibody hypervariable loops: a combined algorithm. Proc. Natl. Acad. Sci. USA 86, 9268-9272.

A. Martin and J. Thornton. (1996). Structural Families in Loops of Homologous Proteins: Automatic Classification, Modelling and Application to Antibodies. J.Mol.Biol. 263, 800-815.

M. Martí-Renom, J. Mas, P. Aloy, E. Querol, F. Aviles and B. Oliva. (1998). Statistical Analysis of the loop-geometry on a non-redundant database of proteins. J Mol. Mod. 4, 347-354.

M. A. Martí-Renom, A. Stuart, A. Fisher, R. Sánchez, F. Melo and A. Sali. (2000). Comparative protein structure modeling of genes and genomes. Ann. Rev. Biophys. Biomolec. Struc. 29, 291-325.

C. Mattos, G. Petsko and M. Karplus. (1994). Analysis of two residue turns in proteins. J.Mol. Biol. 238, 733-747.

M. McGregor, S. Islam and M. Sternberg. (1987). Analysis of the relationship between side-chain conformation and secondary structure in globular proteins. J. Mol. Biol. 198, 295-310.

F. Melo and E. Feytmans. (1997). Novel knowledge-based mean force potential at atomic level. J. Mol. Biol. 267, 207-222.

F. Melo and E. Feytmans. (1998). Assessing protein structures with a non local atomic interaction energy. J. Mol. Biol. 277, 1141-1152.

V. Morea, A. Tramontano, M. Rustici, C. Chothia and A. Lesk. (1998). Conformations of the third hypervariable region in the VH domain of immunoglobulins. J. Mol. Biol. 275, 265-294.

B. Morgenstern. (1999). Dialign2: improvement of the segment-to-segemnt approach to multiple sequence alignment. Bioinformatics 15, 211-218

J. Moult and M. James. (1986). An algorithm for determiningthe conformation of polypeptide segments in proteins by systematic search. Proteins: Struc. Func. and Gene. 1, 156-163.

N. Nakajima, J. Higo and A. Kidera. (2000). Free energy landscapes of peptides by enhanced conformational sampling. J. Mol Biol. 296, 197-216.

C. Notredame, D. Higgins and J. Heringa. (2000). T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302, 205-217.

T. Oldfield. (1992). Squid: a program for the analysis and display of data from crystallography and molecular dynamics. J. Mol. Graph. 10, 247-252.

B. Oliva, P. Bates, E. Querol, F. Avilés and M. Sternberg. (1997). An automatic Classification of the structure of protein loops. J. Mol. Biol. 266, 814-830.

B. Oliva, P. Bates, E. Querol, F. Avilés and M. Sternberg. (1998). Automated Classification of Antibody Complementarity Determining Region 3 of the Heavy Chain (H3) Loops into Canonical Forms and Its Application to Protein Structure Prediction. J. Mol. Biol.(279), 1193-1210.

O. Olmea, B. Rost and A. Valencia. (1999). Effective use of sequence correlation and conservation in fold recognition. J. Mol. Biol. 293, 1221-1239.

A. Panchenko, A. marchler-Bauer and S. H. Bryant. (2000). Combination of threading potentials and sequence profiles improves fold recognition. J. Mol. Biol. 296, 1319-1331.

K. Pawlowski, A. Bierzynski and A. Godzik. (1996). Structural diversity in a family of homologous proteins. J. Mol. Biol. 258, 349-366.

W. Pearson. (1996). Effective protein sequence comparison. Meth. Enz. 266, 227-258.

W. Pearson and D. Lipman. (1988). Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85, 2444-2448.

R. Petrella, T. Lazaridis and M. Karplus. (1998). Protein sidechain conformer prediction: a test of the energy function. Folding and Design 3, 353-377.

C. Rapp and R. Friesner. (1999). Prediction of loop geometries using a generalyzed Born model of solvation effect. Proteins: Struc., Func. and Gene. 35, 173-183.

C. Ring and F. Cohen. (1994). Conformational sampling of loop structures using genetic algorithm. Isr. J. Chem. 34, 245-252.

D. Rosenbach and R. Rosenfeld. (1995). Simultaneous modeling of multiple loops in proteins. Protein Sci. 4, 496-505.

B. Rost. (1999). Twilight zone of proteins sequence alignments. Protein Eng. 12, 85-94.

S. Rufino, L. Donate, L. Canard and T. Blundell. (1997). Predicting the Conformational Class of Short and Medium Size Loops Connecting Regular Secondary Structures: Application to Comparative Modelling. J. Mol. Biol. 267, 352-367.

R. Russell, M. Saqi, R. Sayle, P. Bates and M. Sternberg. (1997). Recognition of analogous and homologous protein folds: analysis of sequence and structure conservation. J Mo.l Biol. 269, 423-439.

R. Russell, P. Sasieni and M. Sternberg. (1998). Supersites within superfolds. Binding site similarity in the absence of homology. J. Mol. Biol. 282, 903-918.

L. Rychlewski, L. Jaroszewski, L. Weizhong and A. Godzik. (2000). Comparison of sequence profiles. Structural prediction with no structure information. Protein Sci. 8, 232-241.

G. Salem, E. Hutchinson, C. orengo and J. Thornton. (1999). Correlation of observed Fold frequency with the ocurrence of local structural motifs. J. Mol. Biol. 287, 969-981.

A. Sali and T. Blundell. (1993). Comparative protein modeling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779-815.

R. Sánchez, U. Pieper, F. Melo, N. Eswar, M. Martí-Renom, M. Madhusudhan, N. Mirkovic and A. Sali. (2000). Protein Structure Modeling for Structural Genomics. Nature Struct. Biol. Suppl. November, 986-990.

R. Sánchez and A. Sali. (1997). Advances in comparative protein structure modeling. Curr. Opin. Struct. Biol. 7, 206-214.

R. Sánchez and A. Sali. (1997). Evaluation of comparative protein structure modeling by MODELLER-3. Proteins: Struc. Func. and Gene. Suppl 1, 50-58.

M. Saqi, R. Russell and M. Sternberg. (1999). Misleading local sequence alignment: implications for comparative modelling. Protein Eng. 11, 627-630.

J. Sauder, J. Arthur and R. Dunbrack. (2000). Large-scale comparisson of protein sequence alignment algorithms with structure alignments. Proteins: Struc. Func. and Gene. 40, 6-22.

P. Shenkin, D. Yarmush, R. Fine, H. Wang and C. levinthal. (1987). Predicting antibody hypervariable loop conformation: I. Ensembles of random conformation fro ring-like structures. Biopolymers 26, 2053-2085.

H. Shirai, A. Kidera and H. Nakamura. (1999). H3-rules: identification of CDR-H3 structures in antibodies. FEBS Letters 455, 188-197.

M. Sippl. (1993). Recognition of errors in three-dimensional structures of proteins. Proteins: Struc. Func. and Gene. 17, 355-362.

K. Smith and B. Honig. (1994). Evaluation of the conformational free energies of loops in proteins. Proteins: Struc. Func. and Gene. 18, 119-132.

T. Smith and M. Waterman. (1981). Identification of common molecular subsequences. J. Mol. Biol. 147, 195-197.

M. Sternberg, P. Bates, L. Kelley and R. MacCallum. (1999). Progress in proteins structure prediction: assesment of CASP3. Curr. Opin. Struct. Biol. 9, 368-373.

M. Sutcliffe, F. Hayes and T. Blundell. (1987). Knowledge-based modeling of homologous proteins, part II: rules for the conformations of substituted side-chains. Protein Eng. 1, 385-392.

M. Sutcliffe, F. Hayes, D. Carney and T. Blundell. (1987). Knowledge-based modeling of homologous proteins, part I. Three dimensional frameorks derived from the simultaneous superposition of multiple structure. Protein Eng.(377-384).

W. Taylor. (1988). A flexible method to align large numbers of biological sequences. J. Mol. Evol. 28, 161-169.

S. Teichmann, C. Chothia, G. Church and J. Park. (2000). Fast assignements of protein structures to sequences using the intermediate sequence library. Bioinformatics 16, 117-124.

J. Thompson, D. Higgins and T. Gibson. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673-4680.

J. Thompson, F. Plewianiak and O. Poch. (1999). Balibase: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 15, 87-88.

J. Thompson, F. Plewianiak and O. Poch. (1999). A comprehensive comparison of multiple sequence alignment programs. Nucleic Acid Res. 27, 2682-2690.

J. Thompson, F. Plewianiak, J. Thierry and O. Poch. (2000). DbClustal: rapid and reliable global multiple alignments of protein sequence detected by database searches. Nucleic Acids Res. 28, 2919-2926.

C. Topham, N. Srinivasan, C. Thorpe, J. Overington and N. Kalsheker. (1994). Comparative modeling of major house dust mite allergen der p I: structure validation using an extended environmental amino acid propensity table. Protein Eng. 7, 869-894.

A. Torda. (1997). Perspectives in protein fold recognition. Curr. Opin. Struct. Biol. 7, 200-205.

A. Tramontano, C. Chothia and A. Lesk. (1989). Structural determinants of the conformations of medium sized loops in proteins. Proteins: Struc. Func. and Gene. 6, 382-394.

S. Vajda and C. DeLisi. (1990). Determining minimum energy conformations of polypetides by dynamic programming. Biopolymers 29, 1755-1772.

M. Vasquez. (1996). Modeling side-chain conformation. Curr. Opin. Struct. Biol. 6, 217-221.

H. W. v. Vlijmen and M. Karplus. (1997). PDB-based protein loop prediction: parameters for selection and methods for optimization. J. Mol. Biol. 267, 975-1001.

J. Wojcik, J. Mornon and J. Chomilier. (1999). New efficient statistical sequence-dependent structure prediction of short to medium-sized protein loops based on an exhaustive loop classification. J. Mol. Biol. 289, 1469-1490