--------------------- README.piana_tutorial --------------------- Read this file and follow the instructions provided. ### =========================================================== ### ### INTRODUCTION ### ### =========================================================== ### PIANA - Proteins Interactions And Network Analysis PIANA is a framework for (i) integrating data from multiple sources about proteins and their interactions; (ii) working with this data in an easy-to-use manner; and (iii) doing automatic analysis of the protein interaction networks. It has been mainly designed to be used in batch mode, but it also features a simple user interface. Main help files you should read for learning how to use PIANA. - piana/README.piana_tutorial -- describes PIANA and how to use it - piana/README.piana_installation -- explains how to install PIANA - piana/README.piana_examples -- examples on how to use PIANA - piana/code/execs/conf_files/general_template.piana_conf -- description of all piana commands and parameters. Describes as well the format followed by PIANA outputs - PIANA reference card -- the reference card you should always have next to you (read section "PIANA reference card" to learn how to create it for your PIANA database) - our website: http://sbi.imim.es/piana -- check that you have the latest version of PIANA. You can also subscribe to our PIANA groups to receive an email when we release a new version. -- for any questions that you cannot solve reading the documentation, you can always ask your question at the PIANA group for discussion If you are 100% new to PIANA (ie. you need to install it) read section "Preparing PIANA". If you are a user of PIANA, or you already have access to a PIANA installation and/or PIANA database, you can skip that section. If you are one of those individuals that cannot wait to use a new toy, read README.read_me_first and maybe you are lucky enough to learn how to use PIANA in just a few minutes... then, you should give a try to the examples in README.piana_examples ### =========================================================== ### ### PREPARING PIANA ### ### =========================================================== ### PIANA code must be installed somewhere in order to be used (no web server services currently provided). Apart from the code, you will also need to either create and populate your own piana database or have access to an external one (a team of people can have PIANA installed on several computers but have all data centralized in a single mysql server). You can also use the database provided in our web page. --> if you need to install PIANA on your computer, follow steps in file piana/README.piana_installation --> the piana database can be either on your computer or in a PIANA host (i.e. a machine with mysql server): - if you will be using an external piana database, find out the name of the database you'll be using and the server where it is located - if you have to use your own piana database, follow steps in file piana/README.populate_piana_db before continuing reading. You can also use the database we provide along with the code pianaDB_limited. Read piana/README.pianaDB_limited for more info on this subject. ### =========================================================== ### ### PIANA REFERENCE CARD ### ### =========================================================== ### In many occassions across this text and other README files, you will be referred to the PIANA reference card. This card contains (almost) all information you will need to have at hand when you become an expert on PIANA, as well as other useful information that will help you speed your PIANA usage. For example, as you might already know, PIANA accepts multiple protein identifier types for your input and output. For example, PIANA can receive as input gene names and produce the output for you using NCBI GeneID. Althougth this is trasparent to you, there is one thing you must know when asking PIANA to do the work: how to tell PIANA that your input identifiers are gene names. And... how do you do that? Usually, by setting the command line argument as follows --input-id-type=geneName; or by setting the parameter input-id-type in your configuration file. And the same for your output: --output-id-type=geneid. How did I know that NCBI GeneID is called by PIANA geneid? I read the PIANA reference card: along other information, it will tell you the PIANA name for each type of identifier that it accepts. I hope you are convinced you need to print the PIANA reference card before continuing to read this tutorial. To do so: $> cd piana/code/execs $> python piana.py --piana-dbname=your_pianaDB --piana-dbhost=your_host --print-reference-card > PIANA_reference_card.txt ### =========================================================== ### ### PIANA MAIN COMPONENTS ### ### =========================================================== ### How does PIANA work? This is the standard procedure of use for PIANA: 1. Create a configuration file as described in the template for configuration files: piana/code/execs/conf_files/general_template.piana_conf (you have several examples of PIANA configuration files under directory piana/code/execs/conf_files ) 2. Run piana: $> python piana.py --configuration-file=your_configuration_file (you can as well set some parameters through the command line, see below for more details on this) 3. Analyze your results The following is a short explanation of the different modules of PIANA: (if you can't wait to use piana, README.piana_examples is a step-by-step guide for using (and learning) PIANA) ------------- - piana.py - ------------- To use PIANA as a tool, you must call piana.py with the command line arguments that will tell PIANA what do you want it to do. - piana.py is to be found under directory piana/code/execs - piana.py admits two execution modes, interactive and batch - in interactive mode, parameters are requested to the user interactively. (some specific parameters are not asked to the user and, if no configuration file provided, defaults are taken) - in batch mode, all parameters and commands are written in a configuration file. - in interactive execution mode, you can configure the parameters through a configuration file, but not the commands - use piana/code/exec/conf_files/general_template.piana_conf to create your own configuration file - see examples of configuration files under piana/code/exec/conf_files/ - main parameters can also be set on the command line (instead of doing it in the configuration file) for easy-and-quick changes of parameter values. There are some specific parameters that cannot be set through the command line (eg list-source-methods). A command line argument overwrites whichever value was set in the configuration file. This is useful because you can have a configuration file with most parameters set to default values and then tune your PIANA execution by overwriting defaults with new values set in the command line - to get a list of command line options (remember you have as well the PIANA reference card) $> python piana.py --help - typical piana actions are described in piana/code/execs/README.piana_examples - one typical example would be: # go to the piana execution directory $> cd piana/code/execs # create a file with one protein per line $> cat > example_protein.txt BAXA_HUMAN (Ctl-d to exit the cat command) # execute piana with the configuration file to get # network and table $> python piana.py --input-file=example_protein.txt --piana-dbname=pianaDB_limited --piana-dbhost=localhost --depth=1 --input-id-type=unientry --output-id-type=unientry --results-prefix=example_results --configuration-file=conf_files/get_example_results.piana_conf # visualize the network $> neato -Tgif -o example_results.gif example_results.all.print-network $> xview example_results.gif If you take a look at the configuration file that was used (conf_files/get_example_results.piana_conf) you will see the reason you are obtaining those files with results (ie. you executed different commands). Read file piana/code/exec/conf_files/general_template.piana_conf for a full explanation of each command and a description of the format followed by the different PIANA outputs. Note: - piana.py will work at the same time with all proteins provided in the input file: the network will be built from all proteins, by retrieving known interactions for each of the proteins in your input file There is another way of running PIANA that lets you perform multiple piana runs with just one command, whether you want to run piana.py for all protein files in a given directory or you want to create an independent network for each protein in an input file: run_multiple_pianas.py - run_multiple_pianas.py has two modes: mode 1) in which you set an input_file_name and a separate network is built for each protein in the file and your commands are executed on that network. When PIANA finishes processing that network and writing results, it will go to the next protein and repeat the same operations. --> PIANA will work separatedly with each protein: one network will be built at a time for each protein, and analyses will be done on this individual network. In cases where you do not need the network to contain all your input proteins (eg. predicting interactions) it is faster to use run_piana_protein_by_protein.py mode 2) in which you set an input_dir that contains files with proteins (all have to be of the same type of identifier) and piana.py is run for each of those files independently. This is useful when you want to apply the same commands (with the same parameters) to a number of input protein files File README.piana_examples has more examples on how to use piana.py and run_multiple_pianas.py. ----------------------------------------------------------- - PIANA databases contain all the information PIANA needs - ----------------------------------------------------------- - If you are a PIANA user not very interested in how it works, you can safely skip this section. However, you will be even safer if you read it: there are some details that might be important to decide how are you going to use PIANA. - PIANA databases follow the structure described in piana/code/dbCreation/create_piana_tables.sql - each time you work with PIANA, you must specify which PIANA database should it use by setting the parameters piana-dbname, piana-dbhost and depending on the system (only if your mysql server requires authentication), piana-dbuser and piana-dbpass - the results PIANA will produce are those extracted from the database you told PIANA to work with. If you use a dummy database, the results will also be dummy. If you use a good database, the results will be good. To create your own PIANA database and populate it with information from UniProt, NCBI, IntAct, DIP, etc, read README.populate_piana_db - PIANA uses a long integer to uniquely identify each protein. This identifier is called proteinPiana. Each identifier corresponds to one unique combination of the protein sequence and the protein taxonomy id (ie. different [sequence, tax i] = different proteinPiana). Therefore, if there are two proteins in a given species that have the same sequence, they will be considered to be the same protein, ie. they will have the same proteinPiana. We use tax ids from NCBI. From the PIANA user perspective, this is trasparent to you: you can work with PIANA for years and never read proteinPiana in your output files, since PIANA will identify proteins using your preferred type of identifier (e.g. NCBI GeneID) There are many tables in the PIANA database that contain correspondences between proteinPianas and third-party identifiers. For example, table geneName contains correspondences between proteinPianas and gene names. PIANA uses these tables to translate between identifiers. All operations done by PIANA internally use proteinPiana as an identifier, and only at the last step it translates the identifier to the type demanded by the user. Types of identifiers that PIANA accepts and they way PIANA uses to refer to these identifiers can be read on the PIANA reference card (see section "PIANA reference card" on how to print it). Attention! proteinPiana identifiers are not maintained across PIANA databases (unless you have synchronized DBs, see below): you cannot use proteinPiana as an identifier for your protein of interest if you are going to use several PIANA databases. Even if you only use one database, you will update it some day, and your proteinPianas will not be coherent between the updated version and the one you are using now. For example, MOT1_YEAST might be 11111 in one piana database and 22222 in the next version created: it all depends on the order in which the proteins are inserted. However, there are some cases in which it can be useful to keep your results using proteinPianas (eg. predicting interactions), but keep in mind that they are only valid for that specific database (or a synchronized one). When analysing your results, you might face the following problem: two proteins that in reality are "the same" appear in the results as two separate entitities. This occurs when there are two similar sequences that were obtained from two different databases and there was no co-reference between them. Different techniques have been implemented to avoid this problem (see section 'PIANA and protein names') but it is not 100% solved, because it is impossible to create a perfect database from far-from-perfect databases (and, let's admit it: there is no one single biological database that is perfect). - if you have access to a pianaDB (i.e. somebody else already created it in your lab) then all you need is access to the machine where it is installed (and, if required, a mysql user name and password) - if you don't have access to a pianaDB, you will either need to create one (read README.populate_piana_db for that) or use the one we provide in our website (read README.pianaDB_limited for that) - Although PIANA has parameters that let you set which third-party databases have to be used for building the networks, these restrictions slow down the program. To avoid introducing restrictions (e.g. use only interactions detected experimentally) what we do in our PIANA MySQL server is that we have a piana database that only contains experimental data (DIP, MIPS, HPRD, BIND, ...) and another database with all interaction data (DIP, MIPS, HPRD, BIND, STRING, interologs, predictions from sequence/structure, ...). Then, each user chooses which database to use - only experimental or all interactions - just by setting the corresponding database in parameter piana-dbname Our advise is that you keep these two separate DBs (only experimental and experimental + predictions) in a synchronized way. This is very easy to do, and can be very useful when building your networks (for example, you can use parameter use-secondary-db in your piana configuration files). How do you keep two synchronized PIANA databases: simply, start creating a database as indicated in README.populate_piana_db. When you reach the point where there are interactions you don't want to be in your primary PIANA database, do a mysqldump for the database and then use that dump to create a new PIANA database (eg. your primary PIANA database can be called pianaDB_experimental, then you do a mysqldump (see documentation on MySQL website) and then you use that dump to create a pianaDB_all_ints. Then, you simply continue parsing the interaction databases you wish into pianaDB_all_ints. Therefore, pianaDB_experimental and pianaDB_all_ints will use the same proteinPianas and will contain the same protein information. The only difference will be in the number of interactions they will contain, pianaDB_experimental being a subset of pianaDB_all_ints. Nevertheless, this is not required, since you can always restrict your network to contain specific interactions by setting the appropiate parameters (i.e. list-source-dbs and inverse-dbs) in your piana configuration file for your experiment. - if you want to add information to a PIANA database, you can use the parsers described in README.populate_piana_db or create your own parsers. - what is parameter use-secondary-db in the configuration files? As presented previously, in our lab we maintain two different PIANA databases: one with experimental interactions and another one with experimental and predicted interactions. There are cases in which we want to retrieve interactions for a group of proteins, restricting those interactions to be only from experimental methods. But, for those cases in which no interactions can be found for a given protein using the primary PIANA database (i.e. the one set in piana-dbname), sometimes we want to add predictions automatically to the network, just to be sure that at least, something is said for that protein. Parameter use-secondary-db is used in these cases: it tells PIANA to use the secondary PIANA database for proteins for which no interactions can be found in the primary database. Apart from setting use-secondary-db to yes in your configuration file, you will need to update file piana/code/utilities/piana_configuration_parameters.py with the name, host, user and pass of your secondary PIANA database. More information on the secondary db can be read in file piana/code/execs/conf_files/general_template.piana_conf --------------------------------------------------------------- - interface to pianaDB_* : piana/code/PianaDB/PianaDBaccess.py --------------------------------------------------------------- - methods and classes used to access information in pianaDB - you don't need to use this component unless you are developing piana code or accessing the PIANA database from your own programs. - full documentation of PianaDBaccess can be found at: piana/docs/documentation/pydoc_docs/PianaDBaccess.html - if you want to use PIANA for developing your own code, please read README.piana_developers ------------------------------------------------------------- - parsers for third party databases - piana/code/dbParsers/* ------------------------------------------------------------- - PIANA databases are populated with data from third party databases (eg. UNIPROT, DIP, BIND, ...). In order to populate the PIANA database, we provide a number of parsers for these external databases. - all parsers can be found under piana/code/dbParsers - to learn how these parsers are used, read piana/README.populate_piana_db - if you want to develop your own piana parsers, please read section "PIANA parsers" of README.piana_developers ---------------------------------------------- - Graph Management tools - piana/code/Graph/* ---------------------------------------------- - classes and methods used to manage networks - you don't need to use this component unless you are developing piana code - full documentation of classes Graph, PianaGraph and others can be found at: piana/docs/documentation/piana_documentation.html - if you want to use the graph library of PIANA, please read README.piana_developers ------------------------------------------------- - PIANA library - piana/code/PianaApi/PianaApi.py ------------------------------------------------- - PianaApi.py is the module you have to import from your python script if you want to use PIANA directly from your code. PianaApi has all methods related with creating, analyzing and working with PIANA. In fact, piana.py is just a user interface to PianaApi. - All PianaApi methods are documented in piana/docs/documentation/PianaApi.html ### =========================================================== ### ### USING PIANA ### ### =========================================================== ### Once you have installed PIANA and you know which piana database you will be using, you are ready to start using PIANA. First of all, it is recommended to read all this help file (piana/README.piana_tutorial). Then you can read some examples (piana/README.piana_examples) for different cases where piana has been used. If you are going to use piana in its interactive mode (which provides the same functionalities as the batch mode, with the difference that commands have to be executed manually one by one and that some parameters cannot be set) you can already try it by doing: $piana/code/execs> python2.3 piana.py PIANA will ask you for some information needed for execution (database and mysql server) and then will show you a menu with all execution options. You should start building a network using commands add-protein, add-proteins-file or add-interactions-file. Alternatively, if you gave an argument --input-file in the command line the network will be automatically built before presenting the menu. In our lab, we never use PIANA in the interactive mode, and althougth all efforts are done to assure it works, we all know that things that are not frequently use, are more likely to contain errors. If you are going to use PIANA in its batch mode (which is the mode for which piana has been designed, we strongly suggest that you use this mode) then read README.piana_examples to learn more about it. For a complete description of all piana commands and parameters, and to interpret the outputs of PIANA commands, please read the descriptions given on piana/code/execs/conf_files/general_template.piana_conf PIANA types of users -------------------- PIANA has three types of users: "developer" "advanced_user" and "simple_user" Most users will be "simple_user". Unless you are going to use PIANA for analyzing the topology of your network (connectivity, adjacency matrix, ...), you shouldn't worry about the types of users: the default version of PIANA is set to "simple_user". If you believe you are not a 'simple_user', you can modify your profile on file piana/code/utilities/piana_configuration_parameters.py Types of users have been introduced so standard users of PIANA do not need to install all external modules required for doing more complicated network operations. If you change your profile to be a "developer" or "advanced_user", you'll need to respect as well the extra requirements listed in README.piana_requirements for non standard users of PIANA. This section might not be the best place to mention the following, but I'll do it anyway: PIANA is not very good at analyzing the interaction networks from a mathematical network perspective: we currently do not provide many internal methods that calculate things such as clustering coefficients or betweenness or scalefreeness. However, if you are interested in keeping an integrated repository of interactions coming from multiple sources, or you want to perform biological analyses of your networks, then PIANA is the perfect tool for you (and I might add, the best on the market ;-) ) PIANA memory usage ------------------ PIANA works in connection with a MySQL database. In order to make PIANA faster, most of the information that has been queried is stored in memory in order to avoid repeated database queries. However, in large networks it can be a roblem, as the memory is filled quickly. In order to be able to manage large networks without memory problems, a parameter is specifyied in configuration files: "memory-usage". The parameter memory-usage can take values: - "high": all information of the network is stored in memory. Information from database is only retrieved when needed, but then it is stored in memory. It is slower to build the network, but it is faster when information is printed more than once. - "low": It uses low memory, as all the information is retrieved from database when needed, and it is never stored in memory. It is faster to build the network, but it is slower to print and to post-process the network. The default value for this parameter is "high". It is recomended to use memory_usage=high. If the network is too big and you have memory problems, then set memory_usage to "low". See comments on Graph for more info on the different memory usages available in PIANA PIANA protein identifier types ------------------------------ Something important when using PIANA is the type of protein identifier you use. It is well known to all of us how messy the world of "protein naming" is, and although PIANA tries to simplify this by providing a unique interface for all codes, there is still something you must do: tell PIANA which type of identifier you want to use each time you run it. The types of identifiers that PIANA admits can be seen on the PIANA reference card (to print the PIANA reference card, read (above) section PIANA reference card). For example, if you are using Uniprot Accessions to keep your lists of proteins, then your id type is Uniprot Accession (which in PIANA is refer as 'uniacc'). And therefore, when running PIANA you will do something like: $> python piana.py --input-id-type=uniacc --configuration-file=... Most PIANA names for identifier types are self explanatory, and you can read their description on the PIANA reference card. One important remark is that when you give PDB identifiers to PIANA, you must always specify the chain of the PDB code, respecting the format "pdbfile.chain" (for example 1b5n.a) PIANA and protein identifiers (i.e. the name given to a protein) ---------------------------------------------------------------- Protein naming is one of the main problems faced by biologists and bioinformaticians. Databases are not coherent, there are thousands of name conflicts, the same sequence can have many names associated and the viceversa, the same name can refer to different sequences. PIANA tries to alleviate this problem by doing the following: - the user can use any type of protein identifier for his/her input proteins. The only requirement is to tell PIANA which type of identifier he/she is using - the user can set the type of identifier he/she wants to use for outputting information - PIANA uses as internal identifier what we call a proteinPiana (an integer). Each proteinPiana is linked to a pair [sequence, tax id], so there is a unique identifier for each protein. Two proteins with the same sequence of different species will have a different proteinPiana. Read more about proteinPianas on section "PIANA databases" of this README file. - PIANA has thousands of cross-references between third-party databases (obtained from several repositories) and all third-party identifiers are linked to a proteinPiana. Therefore, at any PIANA execution the process followed is: 1) user identifiers are 'translated' to proteinPianas; 2) all network operations performed; 3) protenPianas are translated to the type of identifier demanded by the user (and final analysis done based on the protein interaction network for the user identifiers) - PIANA uses a unique protein name in the output: PIANA "guarantees" that each protein name used in the output refers to a different [sequence, tax id] and that sequences that are in fact "the same" protein appear as a single node of the network. Due to the problems mentioned before, one might find that PIANA has considered two proteins to be the same in cases where this was not true, but there is little we can do against this. Furthermore, if the user gave a list of proteins to build the network with, PIANA will give preference to those names over other names found in the database for that protein. This is achieved by doing a 'name unification', by which all proteinPianas in the network that share at least one external id (eg. gene name) are linked to the same protein name. The type of external id is determined by the user using the parameter output-id-type on his piana configuration file. If you want to learn more about this unification, you'll have to look into PianaGraph.py, mainly the comments on method _create_unified_network - Advice: gene names are by far the worst protein code type that exists... if you use gene names, expect problems with the names when interpreting the results... Moreover, since PIANA joins in a single "node" those proteins that have the same gene name, you might find that many of your proteins are placed in the same node (for example, if all proteins belonging to the same complex have the same gene name). Whenever it is possible, PIANA gives the official gene name to a protein and disregards other gene names that are also related to that protein. However, many wet lab biologists do not use official gene names in their daily work, and that's why PIANA is prepared to accept as well those names as input. To reduce gene name ambiguity to a maximum, PIANA uses the species given by the user to limit the gene names and their associated sequences (this is done with parameters input-proteins-species and output-proteins-species) - To get a detailed description of the names assigned to a protein in the databases and the name that PIANA has chosen for a given protein, read results from commands print-*-prots-info in compact mode. These results files produced by PIANA give information on all the protein identifiers and sequences that are associated to the protein name used in the output. - Attention! proteinPiana identifiers are not maintained across piana databases: you cannot use proteinPiana as an identifier for your protein of interest. MOT1_HUMAN might be 11111 in one piana database and 22222 in the next version created: it all depends on the order in which the proteins are inserted. Read more about this on section Piana Database (above). - When analysing your results, you might face the following problem: two proteins that in reality are "the same" appear in the results as two separate entitities. This occurs when there are two similar sequences that were obtained from two different databases and there was no co-reference between them. Different techniques have been implemented to avoid this problem, but it is not 100% solved. Setting which interaction databases to use ------------------------------------------ PIANA databases can contain information from many external interaction databases. In order to let the user choose which are the interactions he want to use, there are two options: introduce restrictions on which are the source databases to use (eg. DIP and mips) and introduce restrictions on which detection methods are accepted (eg. y2h and tap). A user can also set both to all, which uses all interactions in the PIANA database. This is achieved with parameters list-source-dbs and list-source-methods in your PIANA configuration file. Moreover, if you want to use all interactions except those coming from a specific database or method, you can set inverse-dbs (or inverse-methods) to yes, which will exclude from the network any interactions coming from list-source-dbs (or list-source-methods). Read more about these parameters on piana/code/execs/conf_files/general_template.piana_conf When list-source-dbs (or list-source-methods) is all, inverse-dbs (and inverse-methods) is not taken into account PIANA results and output formats -------------------------------- PIANA results are written to files, that you can then read/process to make your analysis. Each of the results files will have the results prefix you set in the configuration file, and a file extension with the name of the command that originated those results. For example, if you set as results prefix 'test' and you execute commands print-network (with format-mode dot) and print-table (with format-mode txt), once PIANA has finished there will be two results files: - test.print-network.dot (a DOT file that can be use to create an image of the network) - test.print-table.txt (where each line is one of the interactions found by PIANA). Other outputs are HTML tables with the interactions (that can be visualized with your favourite web browser by opening the file on the browser and search for the local file) and SIF files, that can be imported from programs such as Cytoscape for more detailed visualization. PIANA DOT result files are to be converted into an image using a program that 'reads' DOT files (eg. neato). To learn how to create images of networks from DOT files, please read file piana/code/execs/README.visualize_piana_network. To learn how to visualize SIF files using cytoscape, please refer to the cytoscape web page http://www.cytoscape.org PIANA txt result files can be visualized on any text editor or parsed with your own scripts for further processing. Format for each txt result file is described in the command documentation that originated that file. All commands are described in detail on piana/code/execs/conf_files/general_template.piana_conf. PIANA abbreviations (for methods, source databases, ...) -------------------------------------------------------- PIANA uses abbreviations for describing the information associated to interactions. For example, the detection methods for which PIANA has interactions. If you need to know which is the complete name for this abbreviations, you should read the PIANA reference card. - Database abbreviations are set at the time of parsing by the PIANA administrator (it is automatically inserted when using the parsers we provide). - Method abbreviations are set in the python dictionary method_names of PianaGlobals.py. Each element of this dictionary is as follows: __mirar__ hemos cambiado el formato de este diccionario? "method_abbreviation" : [ name1 for method, name2 for method, ................ ] If you are doing your own parser, you must make sure before inserting your interactions that your method name appears in this dictionary. Otherwise, the method name for your interactions will be 'unknown', as PIANA won't know which abbreviation to use for your method name. PIANA internals --------------- Maybe you are wondering how does PIANA work. In that case, this section is for you. If you just want to use PIANA for creating and analyzing networks and do not really care about the internals, you can skip this section. 1. proteinPianas Always keep in mind that the unique identifier for PIANA is a proteinPiana. And, there is one different proteinPiana for each combination of (sequence, taxonomy id). Therefore, if there are two different proteins in the same species that have the same sequence, they will have the same proteinPiana. I don't known if this is entirely correct (I am a computer scientist) but I would guess that if there are two identical sequences in the same species, they might be considered the same protein, even if they are located in different organelles or perform different functions. 2. PIANA networks only know proteinPianas Although no-one uses proteinPianas as a protein identifier (except for me, but I am suposed to know what I am doing) it is important to understand that PIANA networks always identify nodes using proteinPianas. Other types of identifiers you might use (eg. uniprot entries) are preprocessed before creating the networks and before presenting the results. Therefore, when you ask PIANA to build a network, first thing it does is 'translating' your code into proteinPianas. Due to the wonderfully standarized world of protein identifiers, you will find all kinds of situations: one uniprot entry that has multiple proteinPianas, one proteinPiana that corresponds to multiple uniprot entries, new uniprot entries that are not known by PIANA (because your database has not been updated), etc. Therefore, even if you just give one code to PIANA for building the network, PIANA can internally represent that code as two different nodes (that might have different interactions). When you ask PIANA to give you results in a particular type of identifier, the opposite process occurs: PIANA 'translates' proteinPianas into your type of code. In that translation process, many things can happen as well: what were 5 different nodes in the PIANA network might become a single node in the output when using your type of identifier. This single node will 'inherit' the interactions and characteristics of the five nodes that were fused to form it. See section "PIANA and protein names" for more info on this. 3. Databases are full of errors Protein and interaction databases are full of errors, typos, strange characters, wrong labels, etc. Since PIANA parses these databases, PIANA databases are also full of errors. And, let's be realistic: no parser is perfect, so you should expect PIANA to contain more errors than the databases it uses to populate its own database. Some things are corrected when parsing, but it is almost impossible to do it perfect... This said, we think our parsers are quite good. Until the next release of a third-party database, of course... Because, let me guess: they are going to change (again) the format for the database, and we will have to develop a new (or modified) parser to read the information. 4. When at doubt, look at the MySQL database When you see strange data of something you do not trust in PIANA results, it is always good to know a little mysql and be able to query the PIANA database. If you go to the machine where the database is placed, do '$> mysql' and then 'mysql> use name_of_your_pianaDB' then you can see the different tables 'mysql> show tables' or ask for the description of each table 'mysql> desc table_name' A few "select" commands can help you to understand why PIANA gave you a specific result that you do not understand... 5. Third party databases change their formats Everyone working with biological databases knows that the formats are changed from one version to another without any advertisement. Moreover, some fields in the databases might be introduced incorrectly, and then the PIANA parsers that worked well before, do not work anymore. You can try correcting the error yourself (and send us the correction, please) or wait till somebody else does it for you. But, this is going to happen from time to time... ### =========================================================== ### ### WHAT CAN I DO WITH PIANA ### ### =========================================================== ### Before going into the specific details of things you can do with PIANA, it is important that you understand the "concept of using PIANA". Basically, it is the following (when you are using PIANA as a user, not as a developer): 1 - you've got a list of proteins that are of interest for you (hereafter referred as root proteins) -> that you have obtained from mass spectometry -> that your boss has asked you to study -> that you've read in an article that are important to a certain disease -> that belong to a pathway you are studying -> ..... 2 - one way of studying a list of proteins is by analysing their interactions (ie their protein interaction network) -> for example, if all of your root proteins appear connected in the network, that means that you are looking at proteins that are closely related -> for example, you could identify other proteins that connect your root proteins between them. If a protein appears in the network connecting two root nodes (hereafter referred as a linker protein) then it means that it is probably also involved in the pathway you are studying. 3 - therefore, you feed PIANA with your root proteins, PIANA looks for the interactions and builds the network. Then analyses and printouts can be performed from this network. 4 - you can ask PIANA to print the network, print a table with the interactions, or identify linker proteins. 5 - you can do many other things, but you'll have to read piana/code/execs/conf_files/general_template.piana_conf to get a taste of all commands and parameters that PIANA accepts piana/README.piana_examples explains in detail how to carry out each of these tasks with PIANA From the practical point of view, this is a non-exhaustive list of things you can do with PIANA: 1. build a protein-protein interaction network - networks can be build from an input file or adding protein by protein, or adding interactions from a file, or specifying a species to retrieve its interactome, or ... - for any input protein, piana will look for interactions in the database and add them to the network, respecting the restrictions (e.g. use only yeast two hybrid interactions) you specified in the input parameters of your piana configuration file. - commands and parameters for building networks are described in piana/code/execs/conf_files/general_template.piana_conf 2. do 'things' with the network - you can just print the network, or do expansions, or match proteins to spots, or map over/under expressed lists of proteins into the network, or search for keywords in the description of proteins in the network, or ... - commands and parameters for doing things with the network are described in piana/code/execs/conf_files/general_template.piana_conf 3. translate between protein identifiers or getting information about proteins - some piana commands do not deal with networks but just with proteins - commands for retrieving protein information are described in piana/code/execs/conf_files/general_template.piana_conf 4. use the database and its interface to develop your own code - read README.piana_developers to learn more about this 5. use the Graph library and the PianaGraph library to develop your own code - read README.piana_developers to learn more about this 6. use the clustering library to perform your own clusterings - piana command for clustering according to GO terms (see general_template.piana_conf) - documentation can be found in piana/docs/documentation/pydoc_docs/Clustering.html ### =========================================================== ### ### SUMMARY ### ### =========================================================== ### Before executing piana, you should: - install PIANA - know which pianaDB you want to use (look at the database component above to decide) - know which are your input proteins, and of which identifier type - know which piana commands you want to execute - write a configuration file following the template piana/code/execs/conf_files/general_template.piana_conf - know which interface suits your needs: piana.py or run_piana_protein_by_protein.py - execute the program, with the required command line parameters and setting argument --configuration-file=your_configuration_file If you are still not sure how this works... --> Read piana/code/execs/README.piana_examples to see how to use PIANA for specific purposses. This file explains the procedure followed for some typical piana actions. Start from example 1 and execute example by example to better understand how PIANA works. Once you understand the examples you can start basing your own analyses on them. --> Read piana/code/execs/conf_files/general_template.piana_conf This file explains how to create your own configuration file, and describes all the parameters and commands that are available in PIANA.