--------------------- README.piana_examples --------------------- This file describes a few examples on how to execute piana for different purposes. For all these examples there is a piana configuration_file associated (that you can find under piana/code/execs/conf_files/). To create your own configuration file (which should contain the parameters and commands you need), follow the instructions under piana/code/execs/conf_files/general_template.piana_conf In order to try these examples on your machine you must make sure that all installation instructions have been followed: piana/README.piana_installation --------------------------------------------------------------------- A generalized example of using piana would be: 1. follow instructions on piana/code/execs/conf_files/general_template.piana_conf to define your parameters and execution commands 2. execute piana.py on the command line, giving as argument the configuration file you wrote $> python2.3 piana.py --configuration-file=your_configuration_file [alternatively, you can write a configuration file that leaves 'blank' some parameters, and set the parameters values through the command line: $> python2.3 piana.py --configuration-file=your_configuration_file --piana-dbname=pianaDB_limited --piana-dbhost=piana_server ] 3. analyze results from output files ---------------------------------------------------------------------- EXAMPLES OF PIANA EXECUTION ---------------------------------------------------------------------- --> all examples are shown for a mysql database pianaDB_limited located in the same machine as the code (ie localhost) with a mysql server that does not require password --> if your mysql server is on a machine different from the one where your code is, write the name of the machine instead of localhost --> if your mysql server requires a password, you need to add to all commands line the arguments: --piana-dbuser=username and --piana-dbpass=password DON'T FORGET TO set the correct piana-dbname and piana-dbhost on the commands described below!!! (unless your database name is pianaDB_limited and your mysql server is localhost) Attention! The protein lists provided as examples DO NOT correspond to real examples (except for example 8): therefore, biological interpretations of examples described below are not advised... This is presented here just for giving you a clearer idea on how to use PIANA Attention! The results described in this file or in directory piana/code/execs/dummy_files/output are not always updated to show the last developments of PIANA. Therefore, some details such as format or information might be different from the results you will get when following the examples instructions. For an updated description of PIANA formats and information look on file piana/code/execs/conf_files/general_template.piana_conf ******************************************************************** EXAMPLE 1 ==> Your first PIANA example ******************************************************************** Situation: we want to create the interaction network for a single protein and view the information associated to proteins in that network 1.1. go to the piana execution directory $> cd piana/code/execs 1.2. create a file with one protein per line, using your preferred type of protein identifier --> we have placed one example protein in file piana/code/execs/dummy_files/input/example_protein.txt 1.3. write a configuration file to obtain results for this protein: --> we have written for you the configuration file: piana/code/execs/conf_files/first_example.piana_conf - look at this configuration file to better understand how this file has been written - descriptions of all PIANA parameters and commands are provided in the template for configuration files: piana/code/execs/conf_files/general_template.piana_conf 1.4. execute piana with this configuration file to get the interaction network and table $> python piana.py --configuration-file=conf_files/first_example.piana_conf --> Attention! If you look into first_example.piana_conf you'll see that we have asked PIANA to print results in two format modes: html and txt. This means that if you have executed the command as described above, you'll have results files '.txt' and '.html'. Both format modes show more or less the same information (txt is always more complete) but have to be visualized differently: txt files are easily parseable (or visualized using any text editor) and html files have to be visualized with a web browser. - 'txt' mode is mainly thought to be used for quick visualization or for being parsed afterwards by the user. - 'html' mode is thought to be used to visually interpret the results In most of the following examples, for speeding up things when looking at results, we are going to use 'txt' format mode. In case you are interested in visualizing your results in a web browser, just change the format-mode argument of the piana command of your configuration file from txt to html --> Attention! If you look into first_example.piana_conf you'll see that we have asked PIANA to print the network in DOT format (format-mode=dot). This can be changed to other formats (eg. SIF format, if you want to visualize the network using cytoscape). the previous command created the following files: - example_results.all.print-network.dot -> DOT file of the network (to be converted in image as explained below in 1.5) - example_results.all.print-table.html -> HTML file with the table of protein interactions - example_results.all.print-table.txt -> Text file with the table of protein interactions - example_results.compact.print-all-prots-info.html -> HTML file with information about the proteins in the network - example_results.compact.print-all-prots-info.txt -> Text file with information about the proteins in the network 1.5. to visualize the network you can use any software that reads .dot format (eg. neato from Graphviz, see README.piana_requirements): $> neato -Tgif -o example_results.gif example_results.all.print-network.dot $> xview example_results.gif -> the image you just produced 'example_results.gif' must be identical to piana/code/execs/dummy_files/output/example1/example_results.gif (unless your database contains interactions different from those in the pianaDB_limited version provided in our web) -> more instructions on network visualization can be read at piana/code/execs/README.visualize_piana_network -> color codes used in the network are explained in command print-network of file piana/code/execs/conf_files/general_template.piana_conf ---> color meanings of the network are also shown in this image: piana/docs/documentation/network_colors.gif 1.6. look at the interaction table in html format in your web browser with option 'open file' and then searching in the directory for 'example_results.all.print-table.html' 1.7. look network proteins information in html format in your browser with option 'open file' and then searching in the directory for 'example_results.compact.print-all-prots-info.html' Attention! Most PIANA commands and parameters can also be given to piana interactively. Just do... $piana/code/execs> python piana.py ... and PIANA will show you the possibilities of the interactive mode. The interactive mode also accepts a configuration file where you can set the execution parameters: just set the exec-mode parameter of the configuration file to interactive. The following examples are all shown for batch mode but they could have also been achieved using the interactive mode. --> in mode interactive, the commands section of the configuration file is ignored --> some parameters and commands are not available in interactive mode *********************************************************************** EXAMPLE 2 ==> Getting standard results (interactions, network, proteins that connect root nodes, etc) for a list of proteins *********************************************************************** Situation: we want to get all standard results for the list of proteins in piana/code/execs/dummy_files/input/liver_cancer_proteins.txt --> this file contains one protein of interest per line (we are supposedly studying proteins related to liver cancer) -> in PIANA, the proteins that are used to build a network (in this case, proteins from file liver_cancer_proteins.txt) are called root proteins -> these proteins come from any type of wet lab experiment where they have found to be somehow related to liver cancer --> we want to analyze the protein interaction network formed by these proteins and their interaction partners -> by visualizing the network -> by listing the information associated to the proteins in the network -> by identifying other relevant proteins with a high probability of being related to liver cancer --> for example, linker proteins (proteins that connect more than one root protein) are of special interest because they connect two proteins that we know are related to liver cancer. 2.0. the input file liver_cancer_proteins.txt contains one protein per line, where the code used for proteins is uniprot accession number --> the first important thing to do is to find out which is the 'piana identifier name' for uniprot accession numbers -> you can get a list of all 'piana names for identifier types' by doing: $piana/code/execs> python piana.py --print-reference-card --piana-dbname=pianaDB_limited --piana-dbhost=localhost -> in the case of uniprot accession numbers, the 'piana name' is 'uniacc' -> other piana names are 'unientry' (for uniprot entries), 'gi' (for ncbi GenBank gi), 'geneName' (for gene names), 'geneID' (for ncbi Gene ID), ... 2.1. In order to get the results for these proteins, we need to create a configuration file that sets the parameters and executes the commands needed for getting standard results. To create this configuration file, we use general_template.piana_conf as a guide to creating our own configuration file -> the result of modifying general_template.piana_conf can be seen in piana/code/execs/conf_files/get_example_results.piana_conf -> instead of setting all the parameters in the configuration file itself, we leave some of them to 'blank'. This implies that parameters set to 'blank' have to be set in the command line when calling PIANA. --> This is done this way so we can use one configuration file for many experiments. For example, in this case we might want to use get_example_results.piana_conf for other proteins, and therefore, parameters 'input-file', 'input-id-type', 'output-id-type' and 'results-prefix' of the configuration file have been left to 'blank'. As you saw in example 1, in case you usually repeat the same PIANA run, you can fix all these parameters in the configuration file and call PIANA just with argument --configuration-file=your_configuration_file.piana_conf In get_example_results.piana_conf, we have as well left to blank 'piana-dbname' and 'piana-dbhost', because we want to run PIANA using different piana databases. However, if you only have one piana database, you can fix in all your configuration files the parameters 'piana-dbname' and 'piana-dbhost', so you don't have to write them each time in the command line. Note: parameters set through the command line overwrite parameters set in the configuration file. Therefore, you can have a configuration file with the database name you normally use and for those cases in which you want to use another database, set it through the command line. 2.2. Execute PIANA, giving the configuration file as argument, as well as the parameters that we left to 'blank' in the configuration file # go to the piana execution directory $> cd piana/code/execs # execute piana.py $> python piana.py --configuration-file=conf_files/get_example_results.piana_conf --piana-dbname=pianaDB_limited --piana-dbhost=localhost --input-file=dummy_files/input/liver_cancer_proteins.txt --input-id-type=uniacc --output-id-type=uniacc --results-prefix=liver_cancer_results 2.3. In the directory you have executed piana (in the configuration file you can change the directory where results are printed), you'll find the results files: liver_cancer_results.all.print-network.dot --> the network in .dot format that you can convert into a network image using neato liver_cancer_results.all.print-table.txt --> the text table with all the interactions liver_cancer_results.compact.print-all-prots-info.txt --> text information for all proteins in your network liver_cancer_results.compact.print-connect-prots-info.txt --> text information for proteins that connect your root proteins (linker proteins) liver_cancer_results.all.print-table.html --> the html table with all the interactions (use a browser to visualize) liver_cancer_results.compact.print-all-prots-info.html --> html table with information for all proteins in your network (use a browser to visualize) liver_cancer_results.compact.print-connect-prots-info.html --> html table with information for linker proteins (use a browser to visualize) ( we have placed the files we have obtained under piana/code/execs/dummy_files/output/liver_cancer_results/liver_cancer_results.* Of course, if you are using a database different from the one we provide in our website, the content of files will be different. ) 2.4. Analyze the results you obtained 2.4.1 -> to visualize the network you can use any software that reads .dot format (eg. neato from Graphviz, see README.piana_requirements): $> neato -Tgif -o liver_cancer_results.gif liver_cancer_results.all.print-network.dot $> xview liver_cancer_results.gif -> you can see that there is only one interaction coming from DIP (in red) and the rest are predictions by structural similarity (in green) -> the proteins that were used to build the network are in yellow (root proteins). Their interaction partners in blue. -> as you see, no interactions were found for root protein P53985 -> the image we have obtained can be found in piana/code/execs/dummy_files/output/liver_cancer_results.gif -> more instructions on network visualization can be read in piana/code/execs/README.visualize_piana_network -> color codes used in the network are explained in command print-network of file general_template.piana_conf ---> color meanings of the network are also shown in this image: piana/docs/documentation/network_colors.gif 2.4.2 -> You can use results file liver_cancer_results.compact.print-all-prots-info.txt to do searches for specific protein information you are interested in (it is maybe easier to read it in its html version) --> However, for parsing txt mode is more convenient: format followed by output txt files is explained on file piana/code/execs/conf_files/general_template.piana_conf --> Other searches can be manual: for example, if we were interested in chaperones, we could do a manual search on this file to see if there are any chaperones in the network (If you are using the pianaDB_limited version provided in our web, you won't observe this result as the description has been deleted from database due to copyright and database size) $> grep "chaperone" liver_cancer_results.compact.print-all-prots-info.txt ---------------------------------------------------------------------------------------------------------- Q9NU22 ['MDN1, midasin homolog (Yeast).', 'Midasin (MIDAS-containing protein).', 'midasin', 'MDN1, midasin homolog (yeast)'] ['May function as a nuclear chaperone and be involved in the assembly/disassembly of macromolecular complexes in the nucleus.'] root=0 expression=None fitness=no emblpid:CAI13203 emblpid:CA ............................... ............................... ---------------------------------------------------------------------------------------------------------- 2.4.3 -> As you saw in the image liver_cancer_results.gif, there is one linker protein: P56715, connecting root proteins P43304 and P04792 We have found by working with experimentalist collaborators that 'linker proteins' are usally of great interest to them. Why? Very simple: if we have produced a network from a list of 'interesting proteins' and in the network there are proteins connecting these 'interesting proteins' it is very likely that the proteins that act as connectors are also 'interesting'. Therefore, P56715 is probably also involved in liver cancer, and it should be subject of further studies. (remember this is a dummy example... don't try to submit an article to Nature saying that you have found a new protein involved in liver cancer...) In this case it was easy to visually identify the linker protein. However, in more complex networks it isn't that easy: but don't worry! PIANA does it for you. You can see the list of linker proteins (with additional info about them) of the network in the text results file 'liver_cancer_results.compact.print-connect-prots-info.txt' or as an html table in 'liver_cancer_results.compact.print-connect-prots-info.html' In addition to this information about linker proteins, if you have created a local GO database (see README.populate_piana_db to learn how to create a local GO database) you can also produce an html table describing the linkers with their GO terms (or fixing at which level of the GO hierarchy you want to get the GO term from) To do this, we use a separate parser located in piana/code/evaluation/tests --> the parser takes as input files *.print-connect-prots-info.txt -> therefore, if you want to get GO information, you should get the linker proteins in 'txt' -> as you saw before, you can also get the linker proteins in html, but this file is not parseable by parse_linkers.py $> cd piana/code/evaluation/tests $> python parse_linkers.py --input-file=../../execs/liver_cancer_results.compact.print-connect-prots-info.txt --input-id-type=uniacc --piana-dbname=pianaDB_limited --piana-dbhost=localhost --results-prefix=liver_cancer_linkers.go --output-format=html --print-go-info --go-dbname=goDB --go-dbhost=localhost --go-level=-1 --label-size=all (do '$> python parse_linkers.py --help' for more parsing options, such as changing the GO level or highlighting keywords) This command will print results to files : liver_cancer_linkers.go.linkers_table.html --> HTML with a table where each linker is described liver_cancer_linkers.go.dot --> DOT file for the linkers and roots (to be visualized using neato) with their GO terms to create the image of the network using GO terms: $> neato -Tgif -o liver_cancer_linkers.go.gif liver_cancer_linkers.go.dot ( you can see these files under dummy_files/output/liver_cancer_linkers.*, and decide whether it is worth for you to create a local GO database or not.) 2.5. Now.... imagine that you have a set of keywords that you want to use to check if the proteins in your network are involved in certain processes For example, for liver cancer we could check if keywords cancer, stress, carcinoma, tumor, apoptosis or death appear in the network. Note: this example does not work with the current version of PIANA database in the web, as the information of description and function has been removed. If you are interested in using this information, it must be created the database from scratch. PIANA does this automatically for you by coloring the nodes in red and adding labels to the tables whenever the keywords appear in the protein function, name or description. All you have to do is a small change to the configuration file get_example_results.piana_conf (or create a new configuration file): where it says list-keywords=blank you should write list-keywords=cancer:stress:carcinoma:tumor:apoptosis:death -> Then, repeat step 2.2 setting the command line argument '--results-prefix' to 'liver_cancer_results.keywords' (this ensures that results are written to different files) -> Then, repeat step 2.4.1. to visualize the network, using neato to convert liver_cancer_results.keywords.all.print-network.dot into liver_cancer_results.keywords.gif If you do '$> xview liver_cancer_results.keywords.gif' you'll see that protein P04792 is now orange, which means that it is a root protein that contains a keyword -> you have the new image we have obtained in dummy_files/output/liver_cancer_results.keywords.gif -> more instructions on network visualization can be read in piana/code/execs/README.visualize_piana_network -> color codes used in the network are explained in command print-network of file general_template.piana_conf ---> color meanings of the network are also shown in this image: piana/docs/documentation/network_colors.gif If you look to the other results files, you'll see that now in the html tables, the proteins that contained a keyword are highlighted in red and underlined. In the text results files, the list of keywords that were found in that protein are written in tokens 'user_keyword=word' -> you can see the new files with keywords highlighted in dummy_files/output/liver_cancer_results.keywords.* -> for example, if you open with your browser liver_cancer_results.keywords.all.print-table.html you can see that 'P04792' is highlighted in the three interactions where it appears. ********************************************************************** EXAMPLE 3 ==> Getting protein code equivalences between uniprot accesion numbers and uniprot entry identifiers ********************************************************************** Situation: we need the uniprot accession equivalents of proteins that we have in a different type of identifiers (eg. uniprot entry names) 3.0. our input file piana/code/execs/dummy_files/input/proteosome.uniprot_entries contains one uniprot entry per line 3.1. configuration file to be used is piana/code/execs/conf_files/protein_code_2_protein_code.piana_conf -> we have left input-id-type and output-id-type to blank so this configuration file can be used to translate between any type of protein identifiers -> in this file, we have just written one piana command: protein-code-2-protein-code (as always, look at this file to better understand how PIANA works. We have written the same command twice: one in format-mode txt and another one with format-mode html) 3.2. execute PIANA to get the equivalences # go to the piana execution directory $> cd piana/code/execs ( before executing, find out which are the piana type names for 'uniprot accession numbers' and 'uniprot entry names' -> you can get a list of all 'piana names for identifier types' by doing $piana/code/execs> python piana.py --print-reference-card --piana-dbname=pianaDB_limited --piana-dbhost=localhost -> in the case of uniprot accession numbers, the 'piana name' is 'uniacc' -> for uniprot entry identifiers, the 'piana name' is 'unientry' ) $> python piana.py --piana-dbname=pianaDB_limited --piana-dbhost=localhost --configuration-file=conf_files/protein_code_2_protein_code.piana_conf --input-file=dummy_files/input/proteosome.uniprot_entries --input-id-type=unientry --output-id-type=uniacc --results-prefix=proteosome_translation -> this creates a file "proteosome_translation.protein-code-2-protein-code.unientry2uniacc.txt" looking like this: ------------------------------- PSB2_YEAST P22141 PSA6_YEAST P21243 P15708 PSA1_YEAST P40302 PSB5_YEAST P30656 PSA2_YEAST P23639 PSB4_YEAST P30657 .............................. .............................. ------------------------------ If this operation (ie. translating from uniprot entry to accession) becomes rutinary in your work, you can create a new configuration file uniprot_entry2uniacc.piana_conf that sets most of the parameters in the configuration file itself. For example, if you are always going to transform from uniprot entries to gi, in your configuration file you would change: input-id-type=blank to input-id-type=unientry output-id-type=blank to output-id-type=uniacc Then, if you always use the same piana database, in your configuration file you would change: piana-dbname=blank to piana-dbname=pianaDB_limited piana-dbhost=blank to piana-dbhost=localhost If these parameters are set in the configuration file, the command line only needs the configuration file, the input file and the results prefix. Remember that if you are using gene names it is advised to set in the configuration file input-proteins-species and output-proteins-species to prevent using gene names that are of a species different from the one being analyzed ******************************************************************** EXAMPLE 4 ==> Doing interaction predictions for proteins in an input file ******************************************************************** Situation: we want to get interaction predictions (eg expansion by COG) for proteins in piana/code/execs/dummy_files/input/liver_cancer_proteins.txt -------------------------------------------------------------------- "expansion by COG" is a prediction based on interologs: Each expand-interactions piana command does the following: For each protein in the network: 1. find interactions of this protein in the current network 2. find proteins in the database that share a certain characteristic with this protein (e.g cog code) 3. for each protein that shares that characteristic: - find interactions for protein that shares the characteristic in the database - find interactions for protein that shares the characteristic in the network - assign to protein being processed all interactions of protein that shares the characteristic - assign to protein that shares that characteristic all interactions of protein being processed This process can be repeated more than once, to reach far-fetched deductions For example, if root protein is A, and if we know that C and D (yeast) interact, and that A =cog= C and B =cog= D ( X =cog= Y means that X and Y have the same COG code) - simple expansion will predict that A interacts with D - double expansion will predict that A interacts with D and that A interacts with B (ie double expansion predicts interactions from a previous prediction) (this is achieved by executing two consecutive expand-interactions piana commands) --------------------------------------------------------------------- For this example, we will use user interface run_multiple_pianas.py instead of piana.py - Why? Because instead of building a complete network with all the proteins in the input file, we just build the network for one protein and then do the predictions based on that network. This is faster and easier to manage for the memory, and the results are the same. - Attention! In run_multiple_pianas.py you cannot set hub-threshold in the configuration file, it must be done through the command line (refer to general_template.piana_conf if you do not know what is hub-threshold for.) $> cd piana/code/execs $> python run_multiple_pianas.py --input-file=dummy_files/input/liver_cancer_proteins.txt --input-id-type=uniacc --output-id-type=uniacc --piana-dbname=pianaDB_limited --piana-dbhost=localhost --results-prefix=liver_cancer_predictions --configuration-file=conf_files/get_double_cog_expansions.piana_conf --hub-threshold=0 --> This will produce files 'protein_name'.liver_cancer_predictions.expand-interactions.cog.root where each line is a protein interaction prediction. $> ls -lh *.liver_cancer_predictions.* -------------------------------------------------------------------------- 39K Jul 6 17:46 P04792.liver_cancer_predictions.one_protein_file.txt.liver_cancer_predictions.expand-interactions.cog_thres0.root 0 Jul 6 17:41 P43304.liver_cancer_predictions.one_protein_file.txt.liver_cancer_predictions.expand-interactions.cog_thres0.roo ------------------------------------------------------------------------ For P43304 no predictions where made. For the other protein, you might use the predictions at your will or if you wish, you can insert these predictions into your piana database using parser piana/code/dbParsers/expansionParser/expansion2piana.py by doing: (Attention! These are real predictions made by PIANA, but they are not related in any way with liver cancer) $> cd piana/code/dbParsers/expansionParser $> python expansion2piana.py --piana-dbname=pianaDB_limited --piana-dbhost=localhost --expansion-file=../../execs/P04792.liver_cancer_predictions.one_protein_file.txt.liver_cancer_predictions.expand-interactions.cog_thres0.root --num-expansions=2 --input-id-type=uniacc --verbose --database-name="expansion" As you see in the parser verbose (if you have set it), only 178 interactions of 735 were inserted into pianaDB_limited, because most of the predictions were made between proteins of different species. If you add flag --no-species to the previous commands, all interactions will be inserted into the database regardless of the species of the proteins --> if you repeat now example 2, you'll see that there are new interaction in the network, in orange color. --> Expansions introduced to a piana database using expansion2piana are labeled with the name you specify at "database-name" parameter. In this case, it is 'expansion'. Predictions not added to the piana database will not appear in subsequent piana executions. However, you can add the predictions to the network and continue working with it, by setting argument exp-output-mode to add in command expand-interactions. For a detailed explanation on which options are available when doing predictions read description of command expand-interactions in piana/code/execs/conf_files/general_template.piana_conf --> when doing expansions that are going to be inserted into a PIANA database, we recommend using proteinPiana as the type of output identifier (ie. output-id-type=proteinPiana) Since the interactions are going to be inserted on the database it is better not to do code translations in between the two steps. In any case, never use geneName as output type! It will introduce a lot of noise in your predictions, because they are ambiguous even within species. --> You can also do predictions based on SCOP and InterPro codes (ie proteins of the same SCOP family will tend to interact with the same proteins). For those predictions, you can create a configuration file similar to conf_files/get_double_cog_expansions.piana_conf but changing the parameters as explained in command 'expand-interactions' of conf_files/general_template.piana_conf --> We do not recommend doing predictions based on predictions: ie. we do not recommend executing command expand-interactions on networks that were built from a database with predictions. What we do in our lab is that we have a piana database that only contains experimental data (DIP, MIPS, HPRD, BIND, ...) and another database with all interaction data (DIP, MIPS, HPRD, BIND, STRING, expansions, ....). Then, when we want to get predictions, we use the experimental database. The predictions made by PIANA are then inserted into the database that contains all interaction data. In this way, we avoid predictions that are based on predictions. -> having two separate piana databases is not extrictly necessary, since PIANA allows you to choose which databases have to be used in each analysis using parameter list-source-dbs. But it is more convenient to separate experimental interactions from predictions, since introducing restrictions has a side-effect: slows down the creation of the network. Therefore, if you do not have disk space problems, it is easier to have to (synchronized) piana databases: one with only experimental interactions and the other one with all interactions. *************************************************************** EXAMPLE 5 ==> Matching proteins in the network to spots in a 2D electrophoresis gel *************************************************************** Situation: we have spot ids from a 2D electrophoresis gel, with their molecular weights (MW) and isoelectric points (IP). Some of those spots were identified by mass spectrometry (that was how we obtained the list of proteins in liver_cancer_proteins.txt) but other spots were unassigned. We can use PIANA to identify some of those unnassigned spots, by comparing the MW and IP of the spots with the MW and IP of the proteins in the network. This is based on the fact that it is very likely that the proteins in the 2D gel also appear in the network, since the network has been built from the list of proteins of the gel that could be identified. And, since all proteins in the gel are related, the proteins of the spot will probably appear in the network. For example, in the liver cancer experiment, only 4 proteins could be assigned by mass spectrometry to the 2D gel spots. Using PIANA, by comparing MW and IP of all spots in the 2D gel with MW and IP of all proteins in the network built from the root proteins (ie. those 4 proteins that could be identified by mass spectrometry), we can make some predictions on the correspondances between spots and proteins. Then, these predictions can be validated in the wet lab. Note: This example won't work with the current PIANA database in the web, as the information of sequence IP is not available. 5.0. we have a text file with spot ids, MW and IP, formatted as indicated in command 'match-proteins-to-spots' of general_template.piana_conf --> format description as extracted from piana/code/execs/conf_files/general_template.piana_conf: # # - spots-file-name is a file name following the structure (one spot per line): spot_idmolecular_weightisoeletric_point # -> where decimals are expressed with "." --> We will use a dummy file with spot ids, MW and IP for a 2D electrophoresis with proteins involved in liver cancer: -> you can see this file in piana/code/execs/dummy_files/input/formatted_spots_liver_cancer.txt 5.1. we create a configuration file that builds a network from an input file, and then executes the command that matches proteins to spots -> you can see how this file looks like in piana/code/execs/conf_files/match_proteins_to_spots.piana_conf 5.2. execute piana with this configuration file and the command line parameters required (because we left them to blank in the configuration file) $> cd piana/code/execs $> python piana.py --configuration-file=conf_files/match_proteins_to_spots.piana_conf --piana-dbname=pianaDB_limited --piana-dbhost=localhost --input-file=dummy_files/input/liver_cancer_proteins.txt --input-id-type=uniacc --output-id-type=uniacc --results-prefix=matching_proteins_spot_cancer --depth=1 --spots-file-name=dummy_files/input/formatted_spots_liver_cancer.txt 5.3. the result of the command has been written to the results file 'matching_proteins_spot_cancer.match-proteins-to-spots.txt' (you can also see the results file in html format using a browser: 'matching_proteins_spot_cancer.match-proteins-to-spots.html') -> we have placed in piana/code/execs/dummy_files/output/ -> it looks something like this: $> more matching_proteins_spot_cancer.match-proteins-to-spots.txt -------------------------------------------------------------------------------- error level 6 (mw_error 0.1 - ip_error 0.1) spot_id 8201 matches protein P04792 error level 7 (mw_error 0.2 - ip_error 0.2) spot_id 1305 matches protein P04792 error level 7 (mw_error 0.2 - ip_error 0.2) spot_id 1306 matches protein P04792 error level 7 (mw_error 0.2 - ip_error 0.2) spot_id 1307 matches protein P04792 error level 7 (mw_error 0.2 - ip_error 0.2) spot_id 1301 matches protein P04792 error level 7 (mw_error 0.2 - ip_error 0.2) spot_id 1303 matches protein P04792 error level 7 (mw_error 0.2 - ip_error 0.2) spot_id 6904 matches protein P00488 error level 7 (mw_error 0.2 - ip_error 0.2) spot_id 6903 matches protein P00488 error level 7 (mw_error 0.2 - ip_error 0.2) spot_id 303 matches protein P04792 error level 7 (mw_error 0.2 - ip_error 0.2) spot_id 8104 matches protein P61771 error level 8 (mw_error 0.3 - ip_error 0.3) spot_id 1304 matches protein P04792 ................................................................................. ................................................................................. ................................................................................. --------------------------------------------------------------------------------- where mw_error 0.1 - ip_error 0.1 means allowing 10% error for MW and IP when searching for matches Attention! Correspondences that appear in a given error level will not be shown in higher error levels For example, "spot_id 8201 matched protein P04792" does not appear in error level 7, althougth it is clear that since it was found at 10% error it will also appear at 20% error. Attention! One spot can be assigned to several proteins, and viceversa. This just means that the spots MW and IP are within a short range and therefore several assignments can be made. ************************************************************** EXAMPLE 6 ==> Clustering proteins by their molecular function ************************************************************** Situation: we have a list of proteins for which we want to build their interaction network and then analyze their relationship in terms of molecular function. ==> to do this we are going to use configuration file piana/code/execs/conf_files/get_clustered_go_network.piana_conf. - The parameters that tell the clustering when to stop are detailed in the configuration file. - Depending on how specific or general you want the network to be, you can play with these parameters. ==> we are going to perform the clustering for proteins of the proteosome 'piana/code/execs/dummy_files/input/proteosome.uniprot_entries' Attention! In order to do the clustering, you must have information for distances between go terms in your piana database (pianaDB_limited only has it for GO terms involved in this example). In case you do not have GO information in your piana database, the clustering will not know which is the criteria for grouping proteins. Parsing GO takes a long time if you want to calculate the distances between all the GO terms. Therefore, if you do not have that time but you still want to do the clustering, there is the option of calculating the distances only between specific GO terms. -> How do you do that? Read piana/code/dbParsers/goParser/README.limiting_parsing_to_specific_gos (it has already been done for pianaDB_limited) 6.1 run piana with the configuration file described above: get_clustered_go_network.piana_conf $> cd piana/code/execs $> python piana.py --configuration-file=conf_files/get_clustered_go_network.piana_conf --input-file=dummy_files/input/proteosome.uniprot_entries --input-id-type=unientry --input-proteins-species=yeast --results-prefix=clustering_proteosome --piana-dbname=pianaDB_limited --piana-dbhost=localhost 6.2 visualize the clustered network $> neato -Tgif -o clustering_proteosome.0.2.molecular_function.1.min.3.cluster-by-go-terms.gif clustering_proteosome.0.2.molecular_function.1.min.3.cluster-by-go-terms $> xview clustering_proteosome.0.2.molecular_function.1.min.3.cluster-by-go-terms.gif (you can see the result of the clustering in piana/code/execs/dummy_files/output/clustering_proteosome.0.2.molecular_function.1.min.3.cluster-by-go-terms.gif ) ==> the interpretation of this network is not always straightforward... However, in some cases it is very helpful to visualize the network from this perspective. ==> the clustering can be also performed in terms of biological process and cellular location (using GO terms). - read description of command 'cluster-by-go-terms' in piana/code/execs/conf_files/general_template.piana_conf to learn more about changing the molecular_function to biological_process and other functionalities of the clustering - for this example, we have set level-threshold to 1, which has as a consequence that we have very general terms in the network ==> the clustering that is implemented right now in PIANA is far from optimal. We are working on it to make it faster and more relevant to biological problems. We are also working on providing the user with more information retrieved during the clustering, such as which proteins belong to each cluster. ******************************************************************* EXAMPLE 7 ==> Getting a flavor of (almost) all the formats in which PIANA can produce outputs ******************************************************************* This example is just for showing you the different formats in which PIANA can print results. --> using configuration file get_all_formats_summary_results.piana_conf --> using the dummy file piana/code/execs/dummy_files/input/liver_cancer_proteins.txt run: $> cd piana/code/execs $> python piana.py --configuration-file=conf_files/get_all_formats_summary_results.piana_conf --piana-dbname=pianaDB_limited --piana-dbhost=localhost --input-file=dummy_files/input/liver_cancer_proteins.txt --input-id-type=uniacc --input-proteins-species=all --results-prefix=trying_all_formats --output-id-type=uniacc Now, take a look to the files trying_all_formats.* trying_all_formats.all.print-all-prots-info.html --> complete info for all proteins in html format (use browser to visualize) trying_all_formats.compact.print-all-prots-info.html --> limited info for all proteins in an html table (use browser to visualize) trying_all_formats.all.print-connect-prots-info.html --> complete info for linker proteins in html format (use browser to visualize) trying_all_formats.compact.print-connect-prots-info.html --> limited info for linker proteins in html format (use browser to visualize) trying_all_formats.all.print-all-prots-info.txt --> complete info for all proteins in text format trying_all_formats.compact.print-all-prots-info.txt --> limited info for all proteins in text format trying_all_formats.all.print-connect-prots-info.txt --> complete info for linker proteins in text format trying_all_formats.compact.print-connect-prots-info.txt --> limited info for linker proteins in text format trying_all_formats.all.print-network.dot --> DOT file with all interactions in network (use neato to visualize) trying_all_formats.connecting.print-network.dot --> DOT file with interactions for root proteins and linker proteins (use neato to visualize) trying_all_formats.all_root.print-network.dot --> DOT file for interactions with at least one root involved (use neato to visualize) trying_all_formats.only_root.print-network.dot --> DOT file for interactions between root proteins (use neato to visualize) trying_all_formats.all.print-table.html --> html table with all interactions in the network (use browser to visualize) trying_all_formats.connecting.print-table.html --> html table with interactions for root and linker proteins (use browser to visualize) trying_all_formats.all_root.print-table.html --> html table with interactions with at least one root involved (use browser to visualize) trying_all_formats.only_root.print-table.html --> html table with interactions between root proteins (none in this case) trying_all_formats.all.print-table.txt --> text table with all interactions in the network trying_all_formats.connecting.print-table.txt --> text table with interactions for root and linker proteins trying_all_formats.all_root.print-table.txt --> text table with interactions with at least one root involved trying_all_formats.only_root.print-table.txt --> text table with interactions between root proteins (none in this case) For a full description of the information contained in these files, as well as which are the parameters needed for each kind of input, read commands descriptions in piana/code/execs/conf_files/general_template.piana_conf ***************************************************************** EXAMPLE 8 ==> Finally! A real example! ***************************************************************** Situation: All examples shown up to this point used lists of proteins unrelated to the problem we said we were studying... Now, let's look to a real example. We have used genes that mediate in breast cancer metastasis to lung, discovered by the team of J. Massague and published in Nature some time ago: Minn AJ, Gupta GP, Siegel PM, Bos PD, Shu W, Giri DD, Viale A, Olshen AB, Gerald WL, Massague J. Genes that mediate breast cancer metastasis to lung. Nature. 2005 Jul 28;436(7050):518-24. The list of genes can be found in piana/projects/metastasis/data/metastasis_gene_names.txt Starting from this list of genes (hereafter referred as root proteins), we are going to do the following: (genes are not proteins, of course, but we are going to work with their products: PIANA does it automatically for you) 8.1 - create configuration files for the different analyses that we want to perform: (to create each configuration file, we follow the instructions on piana/code/execs/conf_files/general_template.piana_conf) (a) -> print the interaction table to obtain a complete description of the interactions where these root proteins are involved -> highlighting proteins that contain a keyword in their description, name of function -> the list of keywords we are going to use is: cancer:carcinoma:tumor:metastasis:apoptosis:death (b) -> print the interaction network, highlighting proteins in the network that contain keywords related to cancer -> the list of keywords we are going to use is: cancer:carcinoma:tumor:metastasis:apoptosis:death -> this will highlight other proteins that interact with "metastasis proteins", which are known to be involved in disease being studied (c) -> print all the information associated to the proteins in the network -> this file can be used to do manual searches of specific information we are interested in (d) -> identify linkers, proteins that connect at least two root nodes -> these linker proteins must be looked very carefully, since it is very likely that they are also involved in the mediation of breast cancer metastasis to lung. (e) -> print a network only with experimental interactions, not taking into account the predictions by structural similarity -> to do so, just change the parameter list-source-dbs in the configuration file -> the network obtained will contain less information but it will be more reliable than the network built using predictions as well (f) -> predict new interactions for these genes using interologs -> these predictions might be useful for better understanding the pathways related to the root proteins, thanks to the fact that interactions of these gene products have been detected in orthoulogous proteins ===> The configuration file that executes commands (a) to (d) for metastasis_gene_names.txt is piana/code/execs/conf_files/metastasis.piana_conf -> Read this configuration file to better understand how PIANA is going to perform the analyses ===> For interaction predictions (f) we are going to use the configuration file get_double_cog_expansions.piana_conf and the interface to PIANA 'run_multiple_pianas.py' (read explanation in example 4) ===> Just to show you another way of doing predictions and visualizing interactions, we have created the configuration file piana/code/execs/conf_files/metastasis_only_dip.piana_conf -> This configuration file shows the process from just having a network with dip interactions (e) to adding predictions by interologs to the network and then printing out again the network 8.2 - execute PIANA with the configuration files detailed above 8.2.1 commands (a) to (d) $> cd piana/code/execs $> python piana.py --configuration-file=conf_files/metastasis.piana_conf --piana-dbname=pianaDB_limited --piana-dbhost=localhost 8.2.2 Now, get the PIANA predictions for these proteins: (f) $> python run_multiple_pianas.py --input-file=../../projects/metastasis/data/metastasis_gene_names.txt --input-id-type=geneName --output-id-type=geneName --piana-dbname=pianaDB_limited --piana-dbhost=localhost --results-prefix=metastasis_predictions --configuration-file=conf_files/get_double_cog_expansions.piana_conf --hub-threshold=0 8.2.3 print network with dip interactions (e), add interologs to the network and print the new network $> python piana.py --configuration-file=conf_files/metastasis_only_dip.piana_conf --piana-dbname=pianaDB_limited --piana-dbhost=localhost 8.3 - analyze the results 8.3.0 -> these are the files that contain the results: - from 8.2.1: metastasis_results.all.print-network.dot --> the network in DOT format for all interactions in the database metastasis_results.connecting.print-network.dot --> the network in DOT format for roots and linkers metastasis_results.all.print-table.html --> the table with all interactions (HTML format) metastasis_results.compact.print-all-prots-info.html --> information for all proteins in the network (HTML format) metastasis_results.compact.print-connect-prots-info.txt --> information about linker proteins (proteins that connect root nodes) metastasis_results.compact.print-connect-prots-info.html --> information about linker proteins (HTML format) - from 8.2.2: 'protein_name'.metastasis_predictions.expand-interactions.cog_thres0.root --> interaction predictions for protein_name - from 8.2.3: metastasis_results.only_dip.all.print-network.dot --> network in DOT format for DIP interactions in the database metastasis_results.only_dip.expanded_network.dot --> network in DOT format for DIP interactions and interologs predictions 8.3.1 -> visualize the networks: - network with all interactions in the database: $> neato -Tgif -o metastasis_results.all.gif metastasis_results.all.print-network.dot $> xview metastasis_results.all.gif Since it is quite a big network you might want to play with the parameters of the DOT file metastasis_results.all.print-network.dot - removing proteins and interactions that do not look interesting - removing overlap=scale and increasing the len of the edges to 10. PIANA has what we think are optimal DOT parameters, but in some cases, the user has to manually modify the DOT file to optimize the image for that particular case. In the network you can identify proteins that contain keywords (in red and orange). If you find interactions that are of particular interest, you can look at file metastasis_results.all.print-table.html for a more detailed description of the interaction - network for roots and linkers $> neato -Tgif -o metastasis_results.connecting.gif metastasis_results.connecting.print-network.dot $> xview metastasis_results.connecting.gif This network is very useful for looking at the root proteins that are connected directly or via another protein. In this case, you can see that there are many roots that are not connected to the others, and some others that belong to the same graph component. - network with only dip interactions $> neato -Tgif -o metastasis_results.only_dip.gif metastasis_results.only_dip.all.print-network.dot $> xview metastasis_results.only_dip.gif - network with dip interactions and interologs This network is too big to be visualized with the standard PIANA parameters for DOT files. Therefore, you must edit file metastasis_results.only_dip.expanded_network.dot and: - remove this from the header line: ', pack=true, overlap=scale' - do a 'replace all' of 'len=1' to 'len=4' Then, you can create the network image (although it is not that helpful, due to the large number of interactions that have been added when doing the prediction). $> neato -Tgif -o metastasis_results.only_dip.expanded_network.gif metastasis_results.only_dip.expanded_network.dot $> xview metastasis_results.only_dip.expanded_network.gif A more practical way to see the predictions would be to add the interactions to the database and then visualize the network for your root proteins. In the previous visualization, you were seeing all (double cog) predictions for root proteins, all (single cog) predictions for all proteins in the initial network and interactions for other proteins in the database If you add the predictions to your piana database and then visualize the network for your root proteins, you'll only see (double cog) predictions for your input proteins. See comments on 8.3.3. --> color codes are described in PianaGlobals.py and in piana/docs/documentation/network_colors.gif 8.3.2 -> analyze linker proteins: Proteins that connect root proteins between them are probably also involved in mediation of breast cancer metastasis to lung. PIANA identifies these linker proteins and produces an output that can be analyzed by the biologist to try to detect funcions or biological processes that might have a role in his/her problem of interest. - open file metastasis_results.compact.print-connect-prots-info.html in your web browser to see the list of linker proteins. Moreover, linker proteins that have the cancer keywords in their name, description or function appear in red and underlined. If you have created a local GO database (see README.populate_piana_db) you can also produce an html table describing the linkers with their GO terms (see comments on example 2.4.3): $> cd piana/code/evaluation/tests $> python parse_linkers.py --input-file=../../execs/metastasis_results.compact.print-connect-prots-info.txt --input-id-type=geneName --piana-dbname=pianaDB_limited --piana-dbhost=localhost --results-prefix=metastasis_linkers.go --output-format=html --print-go-info --go-dbname=goDB --go-dbhost=localhost --go-level=-1 --label-size=all The results of this command are printed to files: metastasis_linkers.go.dot --> network for roots and linkers using GO terms (use neato to visualize) metastasis_linkers.go.linkers_table.html --> html table with linkers and their GO terms ( one node of the metastasis_linkers.go.dot contains a lot of information and makes the GIF image difficult to visualize. You can edit metastasis_linkers.go.dot to remove irrelevant information from the nodes and then use neato again to create the new image ) 8.3.3 -> analyze predictions of interactions made by PIANA results files 'protein_name'.metastasis_predictions.expand-interactions.cog_thres0.root contain predictions of interactions for each root protein. You can insert these interactions into PIANA as explained in example 4, or just analyze these interactions separately. If you are going to insert predictions into your database, when parsing the files with expansion2piana.py, instead of executing the parsing separately for each file, you can merge all *.root files into a single file and do the parsing just once: $> cat *.metastasis_predictions.expand-interactions.cog_thres0.root > all.metastasis_predictions.expand-interactions.cog_thres0 $> cd piana/code/dbParsers/expansionParser $> python expansion2piana.py --piana-dbname=pianaDB_limited --piana-dbhost=localhost --expansion-file=../../execs/all.metastasis_predictions.expand-interactions.cog_thres0 --num-expansions=2 --code-type-name=geneName Attention! all files generated by following this example can be seen in piana/projects/metastasis/results/ These are the results obtained when using interactions from DIP and from predictions based on sequence/structure distant patterns (these predictions are labeled internally as 'ori'). However, if you had populated your database with other interaction databases as described in README.populate_piana_db, the analysis would had been far more complete. --> TODO!!!! - example over/under expressed - example special proteins - explain command create-report - example classify-network-proteins ********************************************************** ==> Other PIANA commands: read general_template.piana_conf ********************************************************** Now that you are an expert in using PIANA, just by looking to conf_files/general_template.piana_conf you should be able to find out what other things can be done with PIANA. Take a look to all the piana commands listed in that file are decide which ones you want to use. We have included a configuration file that tests most of the PIANA commands using dummy proteins. You can take a look at it to see at work some PIANA commands we haven't used in the examples above: piana/code/execs/conf_files/test_all_commands.piana_conf We have placed some comments on this configuration file to guide the user through the different PIANA possibilities. To execute it, you can do: $piana/code/execs> python piana.py --configuration-file=conf_files/test_all_commands.piana_conf Some things (apart from the ones you've seen in the examples above) that can be done using PIANA: --> ignoring all unreliable interactions (using parameter ignore-unreliable) --> doing the intersection of different protein interaction databases --> using files with infra/over-expressed genes (eg. from a microarray experiment) to visualize in your network which proteins are infra/over expressed --> building the protein interaction network for a given species --> limiting the network to contain interactions that were detected by a given method (ie. y2h) --> create several networks using just one configuration file --> avoid adding to the network proteins that have too many interactions --> getting a list of proteins that are at distance X from another protein --> creating a network from a text file with interaction pairs, no need to have the interactions in the database. --> delete interactions from the piana database using piana/code/dbModification/delete_interactions_from_db.py --> for example, if you want to delete predictions made by expansion (ie. interactions labeled 'expansion', you can do: $> python2.3 delete_interactions_from_db.py --piana-dbname=pianaDB_limited --piana-dbhost=localhost --db-to-delete=expansion --> attention! doing a direct delete over the PIANA database (ie. via sql commands) is very dangerous, because you might loose the correspondences between the different tables, or you can delete an interaction from expansion and, at the same time, delete the same interaction that was as well in another database... CONCLUSION: use the script described above to delete interactions from the database (ie. never manipulate the interaction tables directly) --> finding the shortest route between two given proteins: finds the minimal path that goes from one protein in the network to another protein in the network --> Attention: this command requires changing your PIANA mode to 'advanced' or 'developer'. Read general_template.piana_conf for a complete description of this command --> you've got an example configuration file on conf_files/test_shortest_route.piana_conf --> ................. ................ ................ ........ If there is something you would like to do with PIANA but you don't find the piana command for doing it, there are two possibilities: a - send us an email (boliva at imim.es) explaining that 'something' (and wait until we do it, which can take some time...) b - modify the code so PIANA does that 'something'. It is much easier than you might think: yo do not need to know SQL, or Graph theory... just a little python and reading attentively PianaApi.py and piana.py If you find bugs or have suggestions about PIANA, please send us an email to boliva at imim.es If you develop code based on PIANA that you think might be useful to other people, please send us an email and we will include your code in the next release. ********************************************************** ==> Using PIANA as a framework? Programming based on PIANA ********************************************************** PIANA has been designed in a way that it is easy to use as a library to develop your own protein interaction network code. You can use PIANA at different levels: - as a user: examples shown above in this file - as a library: use PianaApi methods Take a look to piana/code/execs/piana.py It is basically a script that reads arguments and then makes calls to PianaApi with those arguments. For more information read PianaApi documentation: piana/docs/documentation/pydoc_docs/PianaApi.html - as a developer of new tools: use PIANA classes and methods All the code developed for this project is an example on how to use PIANA. For example, the Clustering uses the class Graph to create a ClusterGraph class. For more information, read piana/README.piana_developers and piana documentation piana/docs/documentation/piana_documentation.html