Course on structural bioinformatics

Course on structural bioinformatics

Tutorial Index

1) Principles of Protein Structure

2) Sequence Comparison

3) Structure Comparison

4) Principles of Comparative Modeling and Threading

5) Loop Model Building and Refinement

6) Evaluation of the model.

7) Advanced exercices.

Tutorial

1) Principles of Protein Structure and Protein Crystallography

a. Visualization with RasMol: Download the program RasMol and install it in your computer. Download a set of example proteins.

i. Open PDB file: open <name>

ii. Select chain A: select *A

iii. Coloring red: color red

iv. Format: ribbons

v. Remove from view: ribbons off

vi. Select residue 10 from chain A: select 10 && *A

vii. Select residue 10 and 20: select 10,20

viii. Select from 10 to 20: select 10-20

ix. Select polar atoms: select polar

b. Use the list of examples and the database of protein structures PDB for the following exercises:

i. Exercise: How many chains and domains are in the problem structure?. Identify the type of fold of each domain of this problem.

ii. Exercise: Identify Polar/Non-polar properties in each fold

1. Open Up&Down b barrel

2. Remove from view chain B

3. Select chain A

4. Color white chain A

5. Select polar residues and color them in red

6. Do the same with the sheets of a Rossmann fold

7. Question 1: Why do you think the pattern in the sheet is different?

8. Question 2: Where will it be the active site on the Rossmann fold?

iii. Check the description of secondary structure given in a PDB formatted file (i.e. code 8fab). Visualize with RasMol the distribution of hydrogen bonds (command hbonds) and compare with the definition of secondary structures from the PDB file.

iv. Exercise: Open with your favorite editor one of the example files and understand the description of the coordinates for each atom. Identify the Ca atoms that describe the protein trace. Copy the file with another name, remove with the editor half of the atoms of the list and open it with RasMol.

c. SCOP / CATH / DALI / DBAli / HOMSTRAD

i. Exercise: Compare the folds of b-propellers

ii. Determine the folds of your favorite protein from PDB (ie. 8fab)

2) Sequence Comparison

a. BLASTP (see tutorial)

i. Select the database of search:

1. Searching your favorite sequence(s) in UniProt/Swiss-Prot

2. Searching your favorite sequence(s) in PDB

ii. Exercise: Do the search with your favorite sequence(s).

1. Compare the results of the search in different databases

2. What results contain more information?

3. From the tutorial of BLAST try to answer the following questions:

a. What’s the meaning of the e-value?

b. What are the substitution matrices?

c. What’s the dependence between the e-value and the length of our favorite sequence(s)?

b. PSI-BLAST (see tutorial)

i. Iterative search: do at least two iterative PSI-BLAST runs after a BLASTP search in the NR database of your favorite sequence(s).

ii. Choose a format to keep the PSSM.

iii. Use the PSSM matrix for searching on the PDB database

iv. Questions:

1. Compare the e-values of the same pair of sequences aligned with BLAST between two iterations. Why are they different?

2. When do you think we may need to do the search using a PSSM?

3. Search 4 sequences with known structures that can be comparatively aligned with each of the following sequences

c. Multiple Alignment of Sequences

i. Download 4 sequences of the same family of Globins and 4 sequences of Phycocyanin-like phycobilisome proteins and obtain a multiple alignment using : ClustalW and T-Coffee

ii. Align the 4 sequences found in the previous exercise with PSI-BLAST and your favorite sequence(s) using ClustalW and T-Coffee.

d. Hidden Markov Models: PFAM / SMART / Superfamily (it may be useful to download and install on your computer the package HMMER, or run it on the web server)

i. Search the PFAM / SMART and/or Superfamily domain of your favorite sequence(s).

ii. Download the PFAM / SMART profiles where the favorite sequence(s) belongs

iii. Obtain the alignment of this sequence and the sequence with known structure and smallest e-value by PSI-BLAST using the previous PFAM / SMART profile.

iv. Obtain a multiple alignment of the 4 sequences previously aligned with clustalW and T-coffee plus our favorite sequence(s) using the PFAM / SMART profile and compare the alignments.

v. Search with the chosen PFAM / SMART profile in the set of sequences of PDB+NR+Swissprot+PIR.

e. Searching short FingerPrints and PROSITE patterns: ScanProsite, Motifs, PPSearch

i. Find the main motifs of your favorite sequence(s).

ii. How many motifs would you confirm are appropriate for it?

iii. What are the common motifs of your favorite sequence(s) and the 4 previously aligned sequences with known structure?

f. Sequence domains: InterPro / CDD / Prodom

i. Split by domains this protein sequence

ii. Check the function, motifs and main properties in InterPro

3) Structure Comparison

a. Understand the methods from the manual and tutorials of the programs for 3D superposition: CE / MAMMOTH / STRAP / STAMP / SUPERPOSE/ MISTRAL/ DBAli

b. Exercises: Check pairwise (2 proteins) and multiple (>2) alignments of known structures in the PDB and from structures uploaded from your own computer.

i. Use proteins of the families of Immunoglobulin VL-lambda and NKP44

ii. Use proteins of the superfamilies of Globins and Phycocyanin-like phycobilisome proteins

iii. Use proteins from different superfamilies of the 6 blade b-propeller fold.

c. Exercise: Download the PDB files of the 4 sequences previously aligned with your favorite sequence(s) in 2.c.ii. Obtain the multiple structure superposition and extract the multiple alignment of their sequences. Compare the multiple alignments based on the structure, ClustalW, T-Coffee, PFAM and SMART.

d. Exercise: Compare the previous alignments with the alignments in HSSP database

4) Principles of Comparative Modeling and Threading

e. Model building of your favorite sequence(s) in the servers of: ModWeb, Swiss-Model and 3D-Jigsaw

i. Automatic Model Building: Run the automatic modeling from the servers.

ii. Driven Model Building:

1. Use the alignments obtained in 2.c.ii and 2.d.iv to run a driven modeling of your favorite sequence(s).

2. Extract the alignment with the best template (smallest e-value and largest Id%) from the alignments in 2.c.ii and 2.d.iv to run a driven modeling of your favorite sequence(s).

3. Obtain the structural alignment of the candidate templates using your favorite superposition program(exercise 3). Generate with hmmbuild the Hidden Markov Model. Align your templates and the target with the obtained profile using hmmalign. Extract the alignment with the best template (smallest e-value and largest Id%) from the multiple alignment and run a driven modeling of your favorite sequence(s).

iii. Compare your model with the structures of the templates using the structural superposition.

iv. Compare your model with the model in ModBase using the structural superposition.

v. Compare driven and automatic models by superimposition.

f. Assign the fold of your favorite sequence(s) using threading: 3D-PSSM/PHYRE, FUGUE, LOOPP , Threader and PredictProtein(TOPITS)

i. Compare the alignments of the related proteins with known structure and the alignments of the same sequences obtained with ClustalW, T-Coffee, PFAM and SMART.

ii. Compare by superposition the structures assigned by threading (all servers) and the structures obtained by sequence search

iii. Model build your favorite sequence(s) using the alignment obtained by threading: 3D-PSSM/PHYRE, FUGUE, LOOPP and PredictProtein(TOPITS).

iv. Compare by superposition the models by homology and by threading. What are the main differences and why? What models do you think are more reliable and why?

g. Predict the putative fold of a sequence by threading and fold prediction of these problematic sequences: Split them in domains and assign the fold for each domain using all servers available

h. Model build with the servers ModWeb, Swiss-Model and 3D-Jigsaw the models of the domains of the problematic sequences using the thread alignments.

5) Loop Model Building and Refinement

a. Getting used with ArchDB. Browsing the database of protein loops

i. Check loops with 4 residues between a-helix.

ii. Check the affinity of loops b-a with 5 residues length to bind ATP. What classes/sub-classes are the best?

iii. Check the loops of the template structures used on the model building of your favorite sequence(s)

iv. Download the structure of one of your templates and check the interval of residues of one of its loops with RasMol. Remove this loop from the structure (see exercise 1.b.iv) and upload the new coordinates to query as structure on the database. How it is classified the removed loop?

b. Query ArchDB with the model of your favorite structure (use any of the structures previously modeled for uploading the structure)

i. Compare the loops of the model and the loops of the templates (check if all belong to the same classes and sub-classes). Check the loops with different conformation between the model and the template.

ii. Check the classes and subclasses assigned (if any) to the loops of the target that could not be aligned with the sequence of the template(s).

iii. Compare the putative conformations of the loops of the model that were previously checked in 5.b.i: download the protein structures that contain the loops with the same geometry (disposition of secondary structures), extract the coordinates of the particular loop (see exercise 1.b.iv) and superpose them with the loop of the model (identical procedure for using these coordinates alone).

c. Model build the conflictive loops with the server ArchPred, and ModLoop

i. Compare the previous and last model of the loops using ArchDB.

6) Evaluation of the model.

a. Download and install the program Prosa 2003 on your computer.

i. Run the tutorial examples (sessions 1, 2) of the manual, by downloading form PDB the files: 2aat, 3aat, 1aaw and 1spa).

b. Evaluate the pseudo-energy of your model(s), according to statistical potentials, with ANOLEA, Verify3D and with Prosa 2003.

i. Compare both graphs of energy (for the results of ANOLEA you can use your favorite graphics program, ie EXCEL).

ii. Identify the picks of positive energy as those where the model is likely wrong and the best model among the ones you have build.

c. Run the prediction of secondary structure of your target sequence with PSIPRED, JPRED , PROF, and PredictProtein(PHD)

i. Compare the predicted secondary structure and the secondary structure of the model in the regions where the model is likely wrong.

ii. If the model and the prediction differ check the accuracy of the prediction. Modify the model accordingly by increasing or reducing the secondary structure elements.

iii. Reconstruct the model, check the loops and evaluate the new energy. Calculate the pseudo-energy of the new models and compare it with the previous models.

7) Advanced exercises:

a. Model the following sequence

b. We only know the sequence of a protein. Can you tell us what should be its function and if this can be performed?

c. We have the coordinates of a protein Ca trace. We wish to evaluate the difference between the pseudo-energy distribution along its sequence calculated with Prosa 2003 and with statistic potentials for Ca atoms and for Cb.

d. Detect the errors in the following model and fix them.