Introduction

Protein-protein interaction (PPI) has been described as one of the main processes by which proteins perform their cellular functions. Thus, correctly identifying the PPI network (or protein interactome) of a given organism would be useful both to understand the key molecular mechanisms behind a biological function and to assign a function to an unknown protein based on its interactions. The iLoops Server uses the loop classification as defined in ArchDB and/or the SCOP classification of domains to predict whether or not a pair of proteins interact. The method uses data from both known interactions and putative no-interactions (NIPs) identified as co-localized, non-redundant, non-interacting random pairs of proteins non-similar to PPIs to define interacting (positive) or non-interacting (negative) relations between groups of similar structural features of different proteins. Several parameters extracted from the positive and negative interaction signatures (the ratio of the number of signatures or the ratio of their top p-value among them) are submitted to a random forest analysis to describe if there is or not an interaction.

The method


Main iLoops Server Squema
Main iLoops Server Squema

a) Assignment of protein signatures

First, structural features (ArchDB loops and SCOP domains) are assigned to each given sequence through sequence homology by BLAST1. The hits used for the annotation had to satisfy a minimum percentage of identity according to the length of the alignment (above the twilight-zone curve, as described by Rost2. Second, we required a minimum sequence coverage of the structure in the structural feature region (100% for loops adn 75% for domains). For each protein, protein signatures are built as groups of up to 3 similar structural features.We consider three different types of structural features. Groups of ArchDB loops, groups fo SCOP domains, and groups of ArchDB loops located in the same SCOP domain.

b) Evaluation of interaction signatures

For a pair of proteins (A,B), an interaction signature is defined as a pair of protein signatures of the same type, one from protein A and the other from protein B. Each Interaction Signature is given a positive and negative score (if any) according to the positive (M+, built from known PPI) and negative (M-, built from known NIP) matrices.


Building the positive scoring matrix from known PPI
Building the positive scoring matrix from known PPI

c) Predicting PPI

The decision on whether or not a protein pair is a PPI is performed through a random forest classifier. The parameters included as data for the random forest are:

  • S+: Total number of positive signatures.
  • TOP10 pV+: Top ten best scoring p-values.
  • min(pV+), max(pV+), mean(pV+), Q1(pV+), Q2(pV+), Q3(pV+): parameters of the positive p-values distribution.
  • AA+: Total number of amino acids belonging to a positive signature.
  • %AA+: Coverage of amino acids belonging to a positive signature considering the size of the protein.
  • S-: Total number of positive signatures.
  • TOP10 pV-: Top ten best scoring p-values.
  • min(pV-), max(pV-), mean(pV-), Q1(pV-), Q2(pV-), Q3(pV-): parameters of the positive p-values distribution.
  • AA-: Total number of amino acids belonging to a positive signature.
  • %AA-: Coverage of amino acids belonging to a positive signature considering the size of the protein.
  • LpVR: Log ratio of the best positive p-value against the best negative p-value.
  • LSR: Log ratio of the total number of positive interaction signatures against the total number of negative interaction signatures.

Several random forest classifiers were built with WEKA4 considering different relative costs (RC) of false predictions (penalty for predicting as PPI a pair that is not) and expected unbalance ratios (UR) (expected proportion of PPI vs. NIPs in the set).

More detailed information on the construction of the random forest classifiers can be found in the supplementary note 8 of the method’s main article3.

Validation

Different datasets were compiled to validate the results of the random forest classifiers (see supplementary note 8 and tables S10 and S11 of the methods main article3 for details and numeric figures). The datasets were built using different unbalance ratios (the ones available in the server) and were tested using the previously trained classifiers, which consider different relative costs for false predictions. Attending to the heuristic, non-deterministic nature of random forest classifiers, ten replicas of each test were made to ensure the robustness of the predictions. Hence, for each relative cost and unbalance ratio assessed, we repeated 10 times n different tests, where n depends on the available data and ranges from 40 for the combination of loops and domains to 100 for the loops-based predictor. Positive predictive value (PPV, precision) and True Positive Rate (TPR, sensitivity) were computed as results of the validation. Such values are given as the mean and standard error of the joined set of tests and replicas used for each tested unbalance ratio and relative cost.


Prediction of PPIs using random forest classifiers
Prediction of PPIs using random forest classifiers

Averaged PPV (blue lines) and TPR (yellow lines) [note that TPR values were similar for all different unbalanced ratios tested and are shown only for the most unfavorable (1:50) since it encompass the largest error] are shown as functions of the relative cost applied in the WEKA package using interaction signatures {L} (a), {LD} (b), {D} (c), and joining signatures {L}, {LD}, and {D} (d). Standard deviations are shown in error bars. Unbalanced ratios of PPIs versus NIPs are shown in different hues of blue: 1:1 (dark), 1:10 (navy), 1:20 (light), and 1:50 (cyan).


The validation of the random forest classification showed that the combination of both types of structural features performed far better than the others and better than random even in the most difficult circumstances (unbalance ratio = 1:50).

More detailed information on the random forest classifier evaluation can be found in the method’s main article3.

References

  1. Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res., 25(17), 3389–3402.
  2. Rost, B. (1999). Twilight zone of protein sequence alignments. Protein Eng., 12(2), 85–94.
  3. Planas-Iglesias, J., Bonet, J., Garcia-Garcia, J., Marín-López, M.A., Feliu, E. & Baldo Oliva. (2013). Understanding protein-protein interactions using local structural features, J. Mol. Biol., 425 (7), 1210-1224
  4. Hall, M., Frank, E., Holmes, G., Pfahringer, B.,Reutemann, P. & Witten, I. H. (2009). The WEKA data mining software: an update. SIGKDD Explorations, 11.

Submission

Mandatory input:

  • SEQUENCES BOX: A set of sequences (in standard FASTA format). The first word of the header line of each sequence (characters between “>” and the first whitespace) is used as the identifier code of the protein.
  • INTERACTIONS BOX: A list of the protein pairs to test (up to a maximum of 25). The pairs are indicated by the protein identifier separated by a double colon “::” (see example)
IMPORTANT: Due to the importance of using the specific sequence provided by the user, the iLoops server only processes identifiers given in the SEQUENCES BOX. Hence, it does not retrieve sequences from external databases.

Additional parameters are:

  • Domain mapping: PFAM or CATH domains can be mapped on the given sequences.
  • iLoops analysis to run: The user can select the structural feature (either loops or domains or a combination of both) used in the calculation. By default, the calculation is performed considering the combination of loops and domains as the structural feature of interest. It has to be noted that this combination has a lower applicability than other analyses. If you don’t achieve iLoops results for your query proteins, try using some non-default analysis.

After submitting a job, the user will be given a code to retrieve the results linked to a specific web address.

The server contains a pre-calculated sample data to clarify input and output formats (check the "Try sample data" box)

Output

To retrieve the results the user can:

  • Use the code provided after the submission in the Results tab
  • Save or bookmark the link assigned to the provided code

While the prediction is not finished, the Results page remains reloading.

Results for each submission are stored for 7 days before being erased completely. During that time the user can download the results as a compressed XML file.

Interpretation of results

Basic options in the Result Summary page allows the user to set the expected unbalance ratio in the input data. The unbalance ratio reflects the expected proportion of PPIs Vs. NIPs in a certain experimental condition. According to the selected unbalance ratio, the server sets the most convenient relative cost of false predictions. This parameter can be set manually using the advanced options, and refers to the random forest classifier. Increasing the relative cost of false predictions will improve the positive predictive value (PPV or precision) of the prediction, but will tend to decrease its true positive rate (TPR or sensitivity), as can be seen in the Validation section of the help.

According to the selected unbalance ratio and relative cost, the Result Summary page lists all the tested interactions. It provides for each one:

  • Name of the pair of proteins
  • Interaction prediction (YES/NO)
  • Score given by the random forest classifier on the prediction of interaction or non-interaction (it ranges from 0.5 to 1 both for positive and negative predictions)
  • Inferred precision, calculated as detailed below
  • Interaction details (new page)
NOTE that the selected unbalance ratio affects the inferred precision of the prediction while the relative cost may affect the prediction itself and its score (i.e. under two different conditions of relative cost, the same protein pair may achieve different random forest scores or even it may change the prediction class from “YES” to “NO” or vice-versa)

The Prediction column specifies the final decision for the protein pair:

  • YES:If the iLoops Server predicts that the pair of proteins does interact
  • NO: If the iLoops Server predicts that the pair of proteins does not interact
  • N/A: If the iLoops Server can not apply the method to that specific pair. This could happen because:
    • No structural feature (loop or domain) can be assigned to either or both proteins
    • Despite the assignation of structural features, none of the following interacting features are described neither as favouring or disfavouring in our database.

The inferred precision is derived from the Validation of the method, and it has been calculated separately for each of the two prediction classes “YES” and “NO”. For each prediction class, relative cost and unbalance ratio tested, validated data (10 replicas of n different datasets) was split into different bins according to the obtained random forest score, each bin spanning 0.050 points of score. The inferred precision was computed as the average PPV of all data within a score bin. If the prediction is “YES”, then the precision corresponds to the total number of “YES” predictions within the bin over the total number of predictions within the bin. Conversely, if the prediction is “NO” the precision represents the number of “NO” predictions within the bin over the total number of predictions within the bin. The inferred precision is given with an error range, which corresponds to the standard error of the computed PPV mean.

The Interaction details page provides additional details for each query-pair of the submission:

  • Prediction details displays the number of Positive and Negative Interaction Signatures, Log2 ratio of signatures and the Log2 of the p-value.
  • Loops/Domains Assigned to ... it lists the structural features assigned to each member of the protein pair. Each structural feature is linked to their description in the corresponding database and the homolog through which that feature has been assigned. The segment that the structural feature is covering in the sequence is shown so that the user can map those features over their sequence, and the alignment itself can be explored through its link. In the case of a prediction through loops, the domain to which each loop is assigned is also shown. If the loop is outside any region recognized as an SCOP domain, an “out of domain” message is displayed in the column. Furthermore, if the user has requested so, PFAM and CATH matching the structural features are displayed.
  • Positive and Negative Interaction Signatures are displayed separately and sorted from most to less reliable p-value.

Detailed result relevance

Apart from the interaction/non-interaction prediction, the specific data on each partner and each participating structural feature can be quite useful. Despite the fact that the iLoops Server DOES NOT PREDICT THE INTERACTING REGION, the knowledge of the most relevant segments of the protein that participate in promoting or avoiding the interaction (those regions contained in a structural feature with a high participation in defining the interacting features - both positive and negative) can be used to tweak a possible interaction through directed mutation (for example).

For any question or suggestion about the iLoops Server, please contact us: sbi 'plus' iloops 'at' upf 'dot' edu.

Input: troubleshooting errors

Why do I get a “sequence/interactions field required” error?

This error is shown when the mandatory inputs (SEQUENCES BOX and/or INTERACTIONS BOX) are empty. Fill in the corresponding field with the information needed to solve this error.

Why do I get an “At least one FASTA sequence is required” error?

This error is displayed when the FASTA field is not empty, but no FASTA header could be found. It could happen when the user provides sequences without header, which is necessary to identify the protein in the INTERACTIONS BOX. In order to resolve this error, enter a header line (starting with a “>” character) for each of your FASTA sequences.

Why do I get a “Not valid FASTA sequences […]” error?

Protein FASTA sequences are represented using single-letter codes (standard amino acids abbreviations). Thus, neither numeric digits nor special characters (i.e. +,-,%,$...) are allowed. Remove the forbidden characters to get correct FASTA sequences.

Why do I get a “The number of interactions exceed the limit (25 interactions)” error?

The iLoops Server allows the user to test up to 25 protein pairs simultaneously. Please, remove the exciding pairs. If you need to run a larger test, contact us: iLoopsSBI 'at' gmail 'dot' com.

Why do I get a “Not valid interaction format […]” error?

Each protein pair to evaluate is represented on a new line with their respective FASTA headers separated by double colons “::”. Follow this format to introduce correct protein pairs.

Why do I get an “Integrity failure […]” error?

Each protein needs to have its sequence entered in the SEQUENCES BOX. Thus, a yellow warning box shows the identifiers of the proteins for which no sequence have been found. Fill in the missing sequences in order to run the evaluation.

Input: restrictions

Do the query proteins require homologous known protein structures?

Yes, at least to some extent. Particularly, to be amenable for interaction prediction, each protein in the query pair must have:

  • A certain sequence similarity between the input sequence and a protein sequence with known structure stored in ArchDB (for loops or in SCOP (for domains). The sequence alignment must be above the twilight zone1.
  • A high coverage between the input sequence and the protein with known structure in the alignment of a specific feature region (100% for loops and 75% for domains)

Furthermore, the prediction for teh query pair is unfeasible if none of the resulting interaction signatures was previously observed in the fractions of the Positive and Negative Evaluation Sets used to score interaction signatures2 (details concerning this issue are explained in Supplementary Note 8 and supplementary Table S10).

Do the protein pairs have to be co-located?

Not necessarily. The server evaluates any pair of proteins provided if a) structural features can be assigned to both members of the pair and b) at least one of the interaction signatures of the pair is recorded and scored in the iLoops database (see the method’s main article2).

However, results provided by iLoops must be interpreted with caution if non co-located pairs are given as input. This is because the underlying funnel-like intermolecular energy landscape for protein-protein interactions model3,4 assumes co-evolution of the interacting pair. To learn characteristic patterns of non-interacting structural features, iLoops makes a parallel assumption for non-interacting pairs, restricting the learning set to pairs of co-located non-interacting proteins. This assumption implies that a pair of non-interacting proteins will only develop non-interacting signals if the interaction between them is possible. If such a pair is not co-located, there would be no need to develop the signals preventing the molecular binding.

Hence, the server will likely fail detecting signals that enhance or prevent the interaction if you use pairs of non co-located proteins. However, if there exist a pair of homologous sequences that are co-localized or the ancestors of the queried pair were co-localized in the past, the iLoops server may still provide useful predictions for your proteins.

What species is this tool expected to be useful for?

The iLoops server may be used to test interactions in proteins of any species. However, in order to obtain more accurate predictions, it is useful to query protein pairs from each different species separately. This is because the 99% of our test set consisted of pairs of proteins of the same organism (the remaining 1% consisted of inter-species protein pairs). If you are interested in testing inter-species protein pairs, the main requirement in order to enhance the accuracy of the obtained predictions is to know the unbalance ratio between PPIs and NIPs of the co-located inter-species protein pairs.

Interpreting the results

What unbalance ratio should I set to get the best predictions possible?

It depends on your experiment. This parameter should reflect the occurring unbalance ratio between protein-protein interactions and non-interacting pairs in your experimental system. Normally, the best unbalance ratio to use is that naturally occurring in co-localized proteins of the species you are studying (i.e. 1:50 in human). However, different circumstances may advise to use a different unbalance ratio:

Previous knowledge

If you can have additional knowledge about your dataset, it can be used to better set the unbalance ratio. For instance, if you are testing a list of pre-filtered candidates to interact, you should reduce the unbalance ratio in the similar proportion determined by your filter.

Let`s suppose you want to know the interactors of the human protein X and, using any method you trust, you have compiled a list of candidates that represent a 20% (1 over 5) of all the proteins that co-localize with X. Assuming that your filter (or prior knowledge) is correct, you have rejected a huge number of non-interactors. If the normal unbalance ratio to use for co-localized proteins in human was 1:50, your previous knowledge has reduced this unbalance ratio in 1:5. This is 1:50/1:5 = 5:50 = 1:10. Hence, the best unbalance ratio to use for this example would be 1:10.

Predict interacting domains in interacting proteins

The problem of elucidating which are the interacting domains (at least 1) in a multi-domain protein-protein interaction is similar to identifying the actual protein-protein interaction in an unbalanced set of protein pairs. In the multi-domain protein-protein interaction case, the unbalance ratio to be applied depends on the maximum number of domain-domain interactions.

For instance, you know that two multi-domain proteins interact between them. Protein A has 5 domains and protein B has 4 domains. Hence, the number of possibilities of domain-domain interactions is 5*4=20. If you consider that only one of such possibilities represents the real domain-domain interaction and you want iLoops to predict it, you would set an unbalance ratio of 1:20 for this experiment.

Elucidate interacting proteins in a Tandem Affinity Purification (TAP) experiment

A typical situation in which the unbalance ratio should not represent the naturally occurring in vivo unbalance ratio is when you are requesting predictions on the results of an experiment. In fact, such experiment represents an a priori filter, previous knowledge you have gained on your data beforehand (see above).

In this example we will consider the results from a TAP experiment, in which for one of you bait proteins you eluted twenty possible interactors (prey). Although bait and prey can form a large complex, you know (or suspect) that only one of the possible interactors actually interacts with the bait protein. In this case, the best unbalance ratio to use is 1:20, since this is the expectations you have from your previous knowledge and/or experiment.

OK. I’ve chosen the unbalance ratio that best represents my experimental conditions. Now, what relative cost should I choose?

Once the expected unbalance ratio is set, the use of a relative depends on the expectations of the particular experiment. If you want to improve the coverage of your experiment at the expense of the quality of the predictions, you will use a low relative cost. On the contrary, if you want to obtain an accurate set of predictions at the expense of coverage, you should use a high relative cost. This is shown in teh the graphics presented in the section About iLoops of this help. Default values are set as those that maximize the sensitivity and the precision of the predictions for the selected unbalance ratio according to that graphic.

Why increasing the unbalance ratio increased the inferred precision in the results when one should expect the opposite effect?

This is because changes in the unbalance ratio affect the relative cost of the false positive predictions (if you are using only the basic options of the server). In turn, the use of different relative costs for false predictions in the random forest classifiers may lead to different prediction outcomes. This is, the prediction of your queried protein pair probably changed from “YES” to “NO” upon increasing the expected unbalance ratio of your input data. In other words, increasing the unbalance ratio changed the prediction made by the iLoops server on your query proteins from a weak “YES” to a strong “NO”, which is in agreement with the increase of the unbalance ratio.


References

  1. Rost, B. (1999). Twilight zone of protein sequence alignments. Protein Eng. 12(2), 85–94.
  2. Planas-Iglesias, J., Bonet, J., Garcia-Garcia, J., Marín L&ocatue;pez, M.A., Feliu, E. & Oliva, B. (2013) Understanding protein-protein interactions using local structural features, J. Mol. Biol. 425 (7), 1210-1224
  3. McCammon, J.A. (1998) Theory of biomolecular recognition. Curr. Opin. Struct. Biol. 8 (2), 245-249
  4. Tsai, H.H., Reches, M., Tsai, C.J., Gunasekaran, K., Gazit, E. & Nussinov, R.. (2005) Energy landscape of amyloidogenic peptide oligomerization by parallel-tempering molecular dynamics simulation: significant role of Asn ladder. Proc. Natl. Acad. Sci. U.S.A. 102 (23), 8174-8179