------------------------ README.populate_piana_db ------------------------ This file explains how to create a PIANA database and how to maintain it. ============================================================ INDEX ============================================================ 1. Introduction 2. Create a PIANA database 3. Drop a PIANA database 4. Introduce an external database into PIANA database 4.1 Parsing databases of protein codes and information 4.1.1. Parse taxonomy 4.1.2. Parse Uniprot and TrembL 4.1.3. Parse NCBI GenBank 4.1.4. Parse NCBI BLAST NR Dataset 4.1.5. Parse correspondences between pdb identifiers and gi identifiers 4.1.6. Parse correspondences between uniprot identifiers and gi identifiers 4.1.7. Parse correspondences between pdb identifiers and uniprot identifiers 4.1.8. Parse geneID identifiers 4.1.9. Parse geneID - geneName correspondences 4.1.10. Parse RefSeq identifiers 4.1.11. Parse COG and KOG (Cluster of ortologous genes) information 4.1.12. Parse SCOP information 4.1.13. Parse GO information 4.2 Parsing databases of protein-protein interactions 4.2.1. DIP 4.2.2. MIPS 4.2.3. HPRD 4.2.4. BIND 4.2.5. Intact 4.2.6. BioGRID 4.2.7. MINT 4.2.8. Parse any interaction database that comply with the HUPO PSI standard (UNFINISHED) 4.2.8. String (UNFINISHED) 4.2.9. OriDB (UNFINISHED) 4.2.10. Parse your own data about protein interactions (UNFINISHED) 5. Eliminate an external database previously inserted into Piana database 6. Updating a PIANA database ============================================================ 1. INTRODUCTION ============================================================ PIANA requires a mysql database in order to work. This database contains the proteins, the information related to these proteins and the interactions between the proteins. But don't panic! You don't need to know sql, as PIANA makes the database existance completely trasparent to you. But something you do need to do is having this database created before starting to work with PIANA. You've got two possibilities: 1 - Using the database we provide in our webpage. (THIS OPTION IS NOT AVAILABLE IN THE BETA VERSION OF v1.4) This database (called pianaDB_limited) is provided as a mysql dump and is ready to be used with PIANA. You can also use this database as a starting point for creating your own database by adding more data using the parsers we provide. Read piana/README.pianaDB_limited for more information on this database and how to "install" it on your machine. Once pianaDB_limited is on your machine you can follow instructions provided below to add more data to it. Take into account that pianaDB_limited does not contain all the information that PIANA is capable of containing. 2 - Create the database from scratch You can easily create your own database by using the parsers we provide. On the positive side (one must always think positively) you will be able to choose exactly which data to insert (and therefore not loosing precious time while MySQL searches through irrelevant data) as well as getting a taste on how wonderfully organized biological databases are. On the negative side, it will take 2-3 days of processing time to populate your PIANA database. ============================================================ 2. CREATE A PIANA DATABASE ============================================================ If you are going to add data to an existing database (eg. pianaDB_limited) you should skip this section ------------------------------------------------------------ If you need to create your own piana database (because you are the only user of PIANA in your lab, or because you don't have access to the main piana database, or because you don't like the database we provide along with the code) you'll need to follow these steps. Remember that you must have privileges to create databases on the MySQL server. 2.1 -> create a piana database (you choose the name of the database) on the machine that will act as piana database server (piana code and piana database can be on different machines) 2.1.1 - [mysql_machine_server]$> mysql 2.1.2 - mysql> create database name_of_your_piana_db; 2.2 -> use script piana/code/dbCreation/create_piana_tables.sql to create the tables of the database 2.2.1 - [machine with piana code]$> cd piana/code/dbCreation $> mysql --database=name_of_your_piana_db --host=mysql_machine_server < create_piana_tables.sql The database has been created: now you need to populate it (section 4) ============================================================ 3. DROP A PIANA DATABASE ============================================================ If you don't want to use anymore a PIANA database, you only have to execute the following DROP DATABASE command in your MySQL Server. But if you only want to delete all tables and mantain the name of the database, you must run the following command: [machine with piana code]$> cd piana/code/dbCreation $> mysql --database=name_of_your_piana_db --host=mysql_machine_server < drop_piana_tables.sql ================================================================== 4. INTRODUCE INFO FROM EXTERNAL DATABASES INTO your PIANA DATABASE ================================================================== Generalizing, there are two steps to do for any external data you want to insert into your piana database: 1 - download data from the internet each subdirectory of piana/data/externalDBs has a README file explaining how to obtain the data files of a particular database. Download the files of the databases you want to insert into pianaDB to the directories where the README files are. 2 - parse the data then, you'll need to use parsers under piana/code/dbParsers to transfer the information from these files to your PIANA database. The parser in directory name piana/code/dbParsers/xxxxParser will parse data files of external databases. To get a description of each parser, do $> python name_of_parser.py --help It is important to know that in all parsers there are two mandatory arguments: --database-name: this will be the internal database identifier to the inserted database --database-version: this will be the version label that identifies the version of the external database that has been inserted into PIANA In many parsers there are some other optional database related arguments: --database-information: indicates which kind of information this database is going to insert into PIANA database It can be one or more of the following: - protein sequences - protein attributes - identifiers cross-references - protein-protein interactions Finally, in many parsers there are other optional arguments, that are: --time-control: prints to standard error the progress of the parsing (i.e. lines or proteins processed for unit of time) --verbose: prints to standard details of the parsing process --log-file: prints to a file specified in this option a summary of the data introduced into the database. This generalization is beautiful, but I understand you need more precise instructions. Here they are... each of the following subsections explains how to parse a particular type of biological data. If you don't want it in PIANA, just skip the corresponding section Obviously, before inserting protein-protein interaction data (PPI data), you need to insert information about the proteins themselves. PPI parsers only insert interactions if both proteins of the interaction appear in pianaDB. To make sure you do things correctly, follow the order detailed in this file. One good thing about using pianaDB_limited as a base for your piana database is that information for proteins has already been inserted. You are not obliged to do so, but I suggest you save the downloaded data under the corresponding directories in piana/data/externalDBs Once the external database has been parsed you can delete it from the disk to save space. The following lines describe in detail how to populate a database called 'pianaDB_paper' on a local mysql server (ie. localhost). If your system requires identification for the mysql server, you'll need to add as well parameters --piana-dbuser=your_user_name and --piana-dbpass=your_pass to each command. If you want to add information to the database we provide along with the code (pianaDB_limited) just change pianaDB_paper to pianaDB_limited and 'localhost' to the name of your mysql server machine. Depending on the computational power available (and the speed of the mysql server) this process can take from a few days to a full week. Find other things to do while the parsers do their job :-) If you want to insert your own data into a PIANA database you've got several options: 1. format your data as indicated in (section "PARSE your own data: protein interactions") and use the parser described in that section 2. create a new parser for your data. Creating your own parser is quite easy: you do not need to know sql to do it, just a little python. PIANA provides its developers with an easy-to-use library to access and insert information into piana databases (class PianaDBaccess.py). There is a parser template in piana/code/dbParsers/templateParser that will be useful to follow, as it provides to developers the basic schema of a parser. 3. do not insert the interactions into the piana database: instead, you can create your networks directly from a text file PIANA command add-interactions-file lets you add interactions to a network from a text file formatted as described in README.piana_interaction_data_format However, in all cases you'll need at least a PIANA database with information for proteins: otherwise, PIANA would not know which are the proteins involved in the interactions. * ============================================================ * * 4.1. PARSING DATABASES OF PROTEIN CODES AND INFORMATION * * ============================================================ * Attention: all piana parsers have a flag '--help: when this flag is written in the command line, the parser outputs information about its usage and exits (eg. python taxonomy2piana.py --help) Attention: all piana parsers have a flag '--verbose': when this flag is written in the command line together with the other arguments, the parser will output information about the process to your screen. Therefore, if you wish to see what's going on, you should set this flag (eg. python taxonomy2piana.py --taxonomy-file=the_file --piana-dbname=pianaDB_paper --piana-dbhost=localhost --verbose) Attention: all piana parsers have a flag '--log_file': when this flag is written in the command line, the parser saves a file with the information inserted with this parser (eg. number of inserted proteins, number of inserted uniprot codes...) (eg. python taxonomy2piana.py --log-file="./taxonomy2piana.log") Attention: when parsing protein data, you must respect the order described in this file. However, when parsing protein interactions you can follow any order you wish. Attention: if you have limited disk space, you can delete the downloaded files after parsing them. Once a file has been parsed, PIANA will not use it anymore. ........................................................ 4.1.1 PARSE TAXONOMY --> protein information: species information ........................................................ - download taxonomy data (file taxdump.tar.gz in ftp://ftp.ncbi.nih.gov/pub/taxonomy/ ) - untar the file (using tar -xzvf) (save it to directory piana/data/externalDBs/taxonomyDB/) - parse taxonomy data and insert information into pianaDB_paper --> on directory piana/code/dbParsers/taxonomyParser execute the following command: $> python taxonomy2piana.py --taxonomy-file=../../../data/externalDBs/taxonomyDB/names.dmp --piana-dbname=pianaDB_paper --piana-dbhost=localhost --database-name="ncbi_taxonomy" --database-version="Apr 03 2007" --log_file="./taxonomy2piana.log" ............................................................... 4.1.2 PARSE UNIPROT SWISSPROT AND TREMBL --> protein information: sequences, codes and info from uniprot ............................................................... - download swissprot data (file uniprot_sprot.dat.gz in ftp://ftp.expasy.org/databases/uniprot/knowledgebase/ ) - download trembl data (file uniprot_trembl.dat.gz in ftp://ftp.expasy.org/databases/uniprot/knowledgebase/ ) - uncompress the files (using gunzip -d) (save them to directory piana/data/externalDBs/uniprotDB/) - you need to follow instructions on piana/data/externalDBs/uniprotDB/README.rg_deleted before parsing the data. Biopython has a problem with certain fields in the data, and you need to get rid of those problems. Do it for both files (sprot and trembl) Briefly, you just have to do: $> sed 's/^\(RG\|RX\)/RA/g' uniprot_sprot.dat | grep -v -P "^OH" > uniprot_sprot_rg_rx_oh_deleted.dat $> sed 's/^\(RG\|RX\)/RA/g' uniprot_trembl.dat | grep -v -P "^OH" > uniprot_trembl_rg_rx_oh_deleted.dat - parse uniprot data and insert information into pianaDB_paper (do it for both files: first sprot and then trembl) --> on directory piana/code/dbParsers/uniprotParser execute the following commands: $> python uniprot2piana.py --input-file=uniprot_sprot_rg_rx_oh_deleted.dat --piana-dbname=pianaDB_paper --piana-dbhost=localhost --mode="scratch" --database-name="swissprot" --database-version="Apr 03 2007" --database-description="Uniprot manually curated database" --log-file="./uniprot_log_file.log" --database-information="protein sequences,protein attributes, identifiers cross-references" --time-control $> python uniprot2piana.py --input-file=../../../data/externalDBs/uniprotDB/uniprot_trembl_rg_rx_oh_deleted.dat --piana-dbname=pianaDB_paper --piana-dbhost=localhost --mode="scratch" --database-name="trembl" --database-version="Apr 03 2007" --database-description="Uniprot complete not manually curated database" --log-file="./trembl_log_file.log" --time-control --database-information="protein sequences,protein attributes, identifiers cross-references" ............................................................... 4.1.3 PARSE NCBI GenBank --> protein information: sequences and codes from NCBI ............................................................... - First of all, you must download a file that contains information on species for GenBank codes gi. This file will be used by parsers genpept2piana.py, nr2piana.py, pdbaa2piana.py and swissprot2piana.py: - download gi_taxid_prot.dmp.gz from ftp://ftp.ncbi.nih.gov/pub/taxonomy/ - uncompress the file (using gunzip -d) (save it to directory piana/data/externalDBs/taxonomyDB/gi_vs_tax/) - download genpept data (file rel_xxx.fsa_aa.gz in ftp://ftp.ncbi.nih.gov/genbank/ ) - uncompress the file (using gunzip -d) (save it to directory piana/data/externalDBs/genpeptDB/) - parse genepept data and insert information into pianaDB_paper --> on directory piana/code/dbParsers/genpeptParser execute the following command: $> python2.3 genpept2piana.py --input-file-name=./../../data/externalDBs/genpeptDB/relXXX.fsa_aa --piana-dbname=pianaDB_paper --piana-dbhost=localhost --tax-id-file=../../../data/externalDBs/taxonomyDB/gi_vs_tax/gi_taxid_prot.dmp --database-name="genpept" --database-version="release 158 - 16/2/2007" --database-description="ncbi genbank database" --piana-dbuser="piana" --log-file="./genpept2piana.log" --database-information="protein sequences,protein attributes, identifiers cross-references" ...................................................................... 4.1.4 PARSE NCBI NON REDUNDANT (NR) dataset --> protein information: sequences and codes from NCBI (and some uniprot correspondences) ...................................................................... - download ncbi nr data (file nr.gz in ftp://ftp.ncbi.nih.gov/blast/db/FASTA) - uncompress the file (using gunzip -d) (save it to directory piana/data/externalDBs/ncbi_nrDB/) - parse nr data and insert information into pianaDB_paper --> on directory piana/code/dbParsers/nrParser execute the following command: $> python2.3 nr2piana.py --input-file-name="./../../data/externalDBs/ncbi_nrDB/nr" --piana-dbname="pianaDB_paper" --piana-dbhost=localhost --tax-id-file="./../../data/externalDBs/taxonomyDB/gi_vs_tax/gi_taxid_prot.dmp" --database-name="nr" --database-version="Apr 09 2007" --database-description="non-redundant ncbi database" --database-information="protein sequences,protein attributes,identifiers cross-references" --time-control ....................................................................... 4.1.5. PARSE NCBI2PDB correspondences --> protein information: correspondences between pdb codes and gi codes ....................................................................... - download ncbi2pdb data (file pdbaa.gz in ftp://ftp.ncbi.nih.gov/blast/db/FASTA/ ) - uncompress the file (using gunzip -d) (save it to directory piana/data/externalDBs/misc_ncbiDB/) - parse ncbi2pdb data and insert information into pianaDB_paper --> on directory piana/code/dbParsers/misc_ncbiParser execute the following command: $> python2.3 pdbaa2piana.py --input-file-name="./../../data/externalDBs/misc_ncbiParser/pdbaa" --piana-dbname=pianaDB_paper --piana-dbhost=localhost --tax-id-file="./../../data/externalDBs/taxonomyDB/gi_vs_tax/gi_taxid_prot.dmp" --database-name="ncbi2pdb_pdbaa" --database-version="Apr 09 2007" --database-description="correspondence between pdb and gi identifiers" --database-information="identifiers cross-references" --piana-dbuser="piana" --log-file="./pdbaa2piana.log" ...................................................................... 4.1.6. PARSE NCBI2UNIPROT correspondences --> protein information: correspondences between uniprot codes and gi codes ...................................................................... - download ncbi2uniprot data (file swissprot.gz in ftp://ftp.ncbi.nih.gov/blast/db/FASTA/ ) - uncompress the file (using gunzip -d) (save it to directory piana/data/externalDBs/misc_ncbiDB/) - parse ncbi2uniprot data and insert information into pianaDB_paper --> on directory piana/code/dbParsers/misc_ncbiParser execute the following command: $> python2.3 swissprot2piana.py --input-file-name="./../../data/externalDBs/misc_ncbiParser/swissprot" --piana-dbname=pianaDB_paper --piana-dbhost=localhost --tax-id-file="./../../data/externalDBs/taxonomyDB/gi_vs_tax/gi_taxid_prot.dmp" --database-name="ncbi2uniprot swissprot" --database-version="Apr 09 2007" --database-description="correspondences between uniprot and gi identifiers" --database-information="identifiers cross-references" --time-control --log-file="./ncbiswissprot2piana.log" ..................................................................... 4.1.7. PARSE PDB2UNIPROT correspondences --> protein information: correspondence between pdb codes and uniprot codes ..................................................................... - download mapping.txt (plain text file) from http://www.bioinf.org.uk/pdbsprotec/ (save it to directory piana/data/externalDBs/pdbsprotecDB/) - parse pdb2uniprot data and insert information into pianaDB_paper --> on directory piana/code/dbParsers/pdbsprotecParser execute the following command: $> python2.3 pdbsprotec2piana.py --input-file-name="./../../data/externalDBs/pdbsprotec/mapping.txt" --piana-dbname="pianaDB_paper" --piana-dbhost=localhost --database-name="pdbsprotec" --database-version="15-Jan-2007" --database-description="correspondences between pdb and uniprot identifiers" --database-information="identifiers cross-references" --time-control ...................................................................... 4.1.8. PARSE geneID correspondences --> protein information: correspondences between gi identifiers and geneID identifiers ...................................................................... - download gene_info data (file gene2accession.gz in ftp://ftp.ncbi.nih.gov/gene/ ) - uncompress the file (using gunzip -d) (save it to directory piana/data/externalDBs/misc_ncbiDB/) - parse gene_info data and insert information into pianaDB_paper --> on directory piana/code/dbParsers/misc_ncbiParser execute the following command: $> python2.3 gene2accession_parser.py --input-file-name="./../../data/externalDBs/gene2accession" --piana-dbname="pianaDB_paper" --piana-dbhost=localhost --tax-id-file="./../../data/externalDBs/taxonomyDB/gi_vs_tax/gi_taxid_prot.dmp" --database-name="gene" --database-version="" --database-description="correspondences between ncbi accession number and geneID identifiers" --database-information="identifiers cross-references" --time-control --log-file="./gene2accession2piana.log" ..................................................................... 4.1.9. PARSE geneID - geneName correspondences --> protein information: correspondence between geneID identifiers and geneName identifiers ..................................................................... - download gene_info file(file gene_info in ftp://ftp.ncbi.nih.gov/gene/ ) - uncompress the file (using gunzip -d) - parse gene_info data and insert information into pianaDB_paper --> on directory piana/code/dbParsers/misc_ncbiParser execute the following command: $> python2.3 gene_info2piana.py --input-file-name="./../../data/externalDBs/gene_info" --piana-dbname="pianaDB_paper" --piana-dbhost="localhost" --database-name="gene_info" --database-version="Apr 19 2007" --database-description="Gene NCBI database - gene_info" --time-control --log-file="./gene_info2piana.log" --database-information="identifiers cross-references" ..................................................................... 4.1.10. PARSE RefSeq correspondences --> protein information: correspondence between RefSeq identifiers and gi identifiers ..................................................................... - download RefSeq catalog data (file RefSeq-release??.catalog.gz in ftp://ftp.ncbi.nih.gov/refseq/release/release-catalog/ ) - uncompress the file (using gunzip -d) - parse RefSeq Catalog data and insert information into pianaDB_paper --> on directory piana/code/dbParsers/refseqParser execute the following command: $> python2.3 refseq2piana.py --input-file-name="./../../data/externalDBs/refseq/RefSeq-release22.catalog" --piana-dbname="pianaDB_paper" --piana-dbhost=localhost --tax-id-file="./../../data/externalDBs/taxonomyDB/gi_vs_tax/gi_taxid_prot.dmp" --database-name="refseq" --database-version="release22 April 2007" --database-description="NCBI RefSeq Database" --piana-dbuser="piana" --time-control --log-file="./refseq2piana.log" --database-information="identifiers cross-references" .................................................................. 4.1.11. PARSE COG AND KOG --> protein information: clusters of orthologous genes (KOG for 7 eukaryotic and COG for 66 complete genomes) ................................................................. - download cog data (files whog and myva=gb in ftp://ftp.ncbi.nih.gov/pub/COG/COG/) - download kog data (files kog and kyva=gb in ftp://ftp.ncbi.nih.gov/pub/COG/KOG/) (save them to directory piana/data/externalDBs/cogDB/) - parse cog data and insert information into pianaDB_paper --> on directory piana/code/dbParsers/cogParser execute the following command: $> python2.3 xyva_gb2piana.py --input-file-name="../../../data/externalDBs/cogDB/myva=gb" --piana-dbname="pianaDB_paper" --piana-dbhost=localhost --database-name="COG-myva=gb" --database-version="2003" --database-description="Cluster of orthologous genes" --time-control --database-information="identifiers cross-references,protein attributes" --log-file="cog_myva2piana.log" $> python2.3 xog2piana.py --input-file-name="../../../data/externalDBs/cogDB/whog" --piana-dbname="pianaDB_paper" --piana-dbhost=localhost --database-name="COG-whog" --database-version="2003" --database-description="Cluster of orthologous genes" --time-control --database-information="identifiers cross-references,protein attributes" --log-file="cog_xog2piana.log" - parse kog data and insert information into pianaDB_paper --> on directory piana/code/dbParsers/cogParser execute the following command: $> python2.3 xyva_gb2piana.py --input-file-name="../../../data/externalDBs/cogDB/kyva=gb" --piana-dbname="pianaDB_paper" --piana-dbhost=localhost --database-name="KOG-kyva=gb" --database-version="2003" --database-description="Eucariotic Cluster of orthologous genes" --time-control --database-information="identifiers cross-references,protein attributes" $> python2.3 xog2piana.py --input-file-name="../../../data/externalDBs/cogDB/kog" --piana-dbname="pianaDB_paper" --piana-dbhost=localhost --database-name="KOG-kog" --database-version="2003" --database-description="Eucariotinc Cluster of orthologous genes" --time-control --database-information="identifiers cross-references,protein attributes" --log-file="kog_xog2piana.log" ............................................................. 4.1.12. PARSE SCOP --> protein information: structural domains information ............................................................. - download SCOP data (file dir.cla.scop.txt_X.XX in http://scop.mrc-lmb.cam.ac.uk/scop/parse/index.html ) (save it to directory piana/data/externalDBs/scopDB/) - parse scop data and insert information into pianaDB_paper --> on directory piana/code/dbParsers/scopParser execute the following command: $> python2.3 scop2piana.py --input-file-name="../../../dir.cla.scop.txt_1.71" --piana-dbname="pianaDB_paper" --piana-dbhost="localhost" --database-name="SCOP" --database-version="Release 1.71" --database-description="SCOP - Structural Classification of Proteins" --database-information="protein attributes" --time-control --log-file="scop2piana.log" .............................................................. 4.1.13. PARSE GO --> protein information: Gene Ontology terms for proteins ............................................................... - download GO data (file go_dateYYYYMM-assocdb-tables.tar.gz in http://archive.godatabase.org/latest/) - uncompress the file (using tar -xzvf) (save it to directory piana/data/externalDBs/goDB/) -> Instead of inserting information directly to the piana database, we first create a local database for this data and then transfer that data from that local database to the piana database. It simplifies the parsing, and it is always nice to have a separate database for a particular external database - create a local go database goDB on your mysql server machine - create the name of the database (on some systems, this can only be done with root rights) [localhost]$> mysql mysql> create database name_of_your_go_db; - the tables of the database will be created by parseGO.py (it is trasparent to you, don't worry about it) -> you'll need rights for creating tables... if not, ask the root to run parseGO.py for you - insert go data into your local mysql go database $> cd piana/code/dbParsers/goParser -> Attention: if it is the first time you are creating the go database, this script will generate errors when trying to 'drop' the former tables You can ignore these errors... Example of error to ignore: ERROR 1051 at line 1: Unknown table 'assoc_rel' $goParser> python2.3 parseGO.py --input-go-dir="./../../../data/externalDBs/GO/go_200704-assocdb-tables" --go-dbname="name_of_your_go_db" --go-dbhost="localhost" - insert go information to pianaDB_paper $goParser> python2.3 go2piana.py --go-dbname="name_of_your_go_db" --go-dbhost="localhost" --piana-dbname="pianaDB_paper" --piana-dbhost="localhost" --database-name="go" --database-version="April 2007" --database-description="GO. Gene Ontology" --insert-protein-go --insert-go-info --time-control --database-information="protein attributes" if you are going to do clustering based on GO, you can add flags '--insert-go-distance' and '--threshold=4' --threshold=4 means that the distance will be set to INFINITE when the terms are separated by more than 4. This speeds up the process of calculating distances and does not affect the clustering because those terms are too far away However, be ready for a very long parsing... it takes several days to calculate distances for all GO terms I suggest instead that you parse GO without the flag '--insert-go-distance' and then calculate the distances only for GOs associated to proteins in your interest network using the flags '--input-file' and '--input-proteins-type' for example: $goParser> (TO BE COMPLETED) To learn more about calculating distances only for a limited number of GO terms, read piana/code/dbParsers/goParser/README.limiting_parsing_to_specific_gos * =========================================== * * 4.2 PARSING DATABASES OF INTERACTIONS * * =========================================== * Populate PIANA with data from third-party interaction databases by using the parsers provided. Remember as well that if you have interactions in a text file, you can also use it directly instead of inserting it into the database. Read command 'add-interactions' in piana/code/execs/conf_files/general_template.piana_conf ...................................................... 4.2.1 PARSE DIP --> protein interactions ...................................................... - download DIP tabulated data (file dip????????.txt in section "Files - Full" of http://dip.doe-mbi.ucla.edu ) (registration required) (save it to directory piana/data/externalDBs/dipDB/) - uncompress the file (using gunzip -d) (save it to directory piana/data/externalDBs/dipDB/) - transfer information from file to pianaDB_paper --> on directory piana/code/dbParsers/dipParser execute the following command: $> python2.3 diptxt2piana.py --input-file-name=../../../data/externalDBs/dipDB/dip????????.txt --piana-dbname="pianaDB_paper" --piana-dbhost="localhost" --database-name="dip" --database-version="19.2.2007" --database-description="DIP (Database of Interacting Proteins)" --time-control --log-file="diptxt2piana.log" --database-information="protein-protein interactions" ..................................................................... 4.2.2. PARSE MIPS --> protein interactions ..................................................................... - download MIPS data (file mppi.gz from http://mips.gsf.de/proj/ppi/) - uncompress the file (using gunzip -d) (save it to directory piana/data/externalDBs/mipsDB/) -> Instead of inserting information directly to the piana database, we first create a local database for this data and then transfer that data from that local database to the piana database. It simplifies the parsing, and it is always nice to have a separate database for a particular external database - create a local mips database mipsDB on your mysql server machine, using script piana/code/dbCreation/create_psi_tables.sql -> create the mips database where mips data will be inserted [localhost]$> mysql mysql> create database mipsDB; -> create tables --> on directory piana/code/dbCreation execute the following command: $dbCreation> mysql --database=mipsDB --host=localhost < create_psi_tables.sql - parse MIPS data and insert information into your local mips database mipsDB --> on directory piana/code/dbParsers/psiParser execute the following command: $psiParser> python parsePSI.py --psi-dbname=mipsDB --psi-dbhost=localhost --input-file=../../../data/externalDBs/mipsDB/mppi --mode=psiDB --psi-db=mips - transfer information from mipsDB to pianaDB_paper --> on directory piana/code/dbParsers/psiParser execute the following command: $> python2.3 psi2piana.py --piana-dbname=pianaDB_paper --piana-dbhost=localhost --psi-dbname=mipsDB --psi-dbhost=localhost --psi-db=mips --database-name="mips" --database-version="March 2007" --database-description="The MIPS Mammalian Protein-Protein Interaction Database" --insert-psidb-ids --time-control --log-file="mips2piana.log" --database-information="identifiers cross-references,protein-protein interactions" ................................................................... 4.2.4. PARSE HPRD ................................................................... - download HPRD tabulated data (file HPRD_Release_6_01012007.txt of http://www.hprd.org/download ) (registration required) (save it to directory piana/data/externalDBs/hprdDB/) - uncompress the files (using gunzip -d) (save them to directory piana/data/externalDBs/hprdDB/) - transfer information from file to pianaDB_paper --> on directory piana/code/dbParsers/hprdParser execute the following command: $> python2.3 hprd2piana.py --input-file-name="../../../data/externalDBs/hprdDB/HPRD_Release_6_01012007.txt" --piana-dbname="pianaDB_paper" --piana-dbhost="localhost" --database-name="hprd" --database-version="Release_6_01012007" --database-description="Human Protein Reference Database" --log-file="hprd2piana.log" --time-control --database-information="identifiers cross-references,protein-protein interactions" .................................................................... 4.2.5. PARSE BIND .................................................................... --> protein interactions Some interactions provided by BIND are 'ambiguous': only gene names or protein descriptions are given (as opposed to cases where 'good' codes such as uniprot are provided). PIANA checks if there are several sequences associated to those geneNames, and in case it finds the geneName to be ambigous, it inserts the interaction into the PIANA database labeled as 'bind_c'. Then, the user can choose to ignore all _c interactions using the configuration file parameter ignore-unreliable ................................................................... - download BIND data (files *.psi.xml.gz from ftp://ftp.blueprint.org/pub/BIND/data/divisions/psi) --> you can download all the datasets or just those that are interesting for you - uncompress the files (using gunzip -d) (save it to directory piana/data/externalDBs/bindDB/) -> Instead of inserting information directly to the piana database, we first create a local database for this data and then transfer that data from that local database to the piana database. It simplifies the parsing, and it is always nice to have a separate database for a particular external database - create a local bind database bindDB on your mysql server machine, using script piana/code/dbCreation/create_psi_tables.sql -> create the mips database where mips data will be inserted [localhost]$> mysql mysql> create database bindDB; -> create tables --> on directory piana/code/dbCreation execute the following command: $dbCreation> mysql --database=bindDB --host=localhost < create_psi_tables.sql - parse BIND data and insert information into your local bind database bindDB --> on directory piana/code/dbParsers/psiParser execute the following commands (one for each bind file you have downloaded): $psiParser> python parsePSI.py --psi-dbname=bindDB --psi-dbhost=localhost --input-file=../../../data/externalDBs/bindDB/bindfungi.1.psi.xml --mode=psiDB --psi-db=bind $psiParser> python parsePSI.py --psi-dbname=bindDB --psi-dbhost=localhost --input-file=../../../data/externalDBs/bindDB/file_whatever.psi.xml --mode=psiDB --psi-db=bind $psiParser> python parsePSI.py ......................................... - transfer information from bindDB to pianaDB_paper --> on directory piana/code/dbParsers/psiParser execute the following command. $> python2.3 psi2piana.py --piana-dbname=pianaDB_paper --piana-dbhost=localhost --psi-dbname=bindDB --psi-dbhost=localhost --psi-db=bind --database-name="bind" --database-version="April 2007" --database-description="Biomolecular Interaction Network Database" --insert-psidb-ids --time-control --log-file="bind2piana.log" --database-information="identifiers cross-references,protein-protein interactions"python psi2piana.py --piana-dbname=pianaDB_paper --piana-dbhost=localhost --psi-dbname=bindDB --psi-dbhost=localhost --psi-db=bind ...................................................... 4.2.5 PARSE IntAct --> protein interactions ...................................................... - download Intact tabulated data (file intact.zip) of ftp://ftp.ebi.ac.uk/pub/databases/intact/current ) (save it to directory piana/data/externalDBs/intactDB/) - uncompress the files (using gunzip -d) (save them to directory piana/data/externalDBs/intactDB/) - transfer information from file to pianaDB_paper --> on directory piana/code/dbParsers/intactParser execute the following command: $> python2.3 intact2piana.py --input-file-name=../../../data/externalDBs/intactDB/intact.txt --piana-dbname="pianaDB_paper" --piana-dbhost="localhost" --database-name="intact" --database-version="23 april 2007" --database-description="IntAct protein-protein interaction database" --database-information="protein-protein interactions" --time-control --log-file="./intact2piana.log" ...................................................... 4.2.6 PARSE BioGRID --> protein interactions ...................................................... - download BioGRID data (file BIOGRID-ALL-?.?.??.tab.zip) of http://www.thebiogrid.org/downloads.php) (save it to directory piana/data/externalDBs/biogridDB/) - uncompress the files (using gunzip -d) (save them to directory piana/data/externalDBs/biogridDB/) - transfer information from file to pianaDB_paper --> on directory piana/code/dbParsers/biogridParser execute the following command: $> python2.3 biogrid2piana.py --input-file=../../../data/externalDBs/biogridDB/BIOGRID-ALL-SINGLEFILE-2.0.26.tab.txt --piana-dbname="pianaDB_paper" --piana-dbhost="localhost" --database-version="2.0.26" --database-description="The BioGRID's curated set of physical and genetic interactions" --database-information="protein-protein interactions" --time-control --log-file="./biogrid2piana.log" --database-name="biogrid" ...................................................... 4.2.6 PARSE MINT --> protein interactions ...................................................... - download MINT data (file year-day-month-full.mitab25.txt) from ftp://mint.bio.uniroma2.it/pub/release/MITAB/ from the most recent directory (e.g. 2007-04-05) (save it to directory piana/data/externalDBs/mintDB/) - uncompress the files (using gunzip -d) (save them to directory piana/data/externalDBs/mintDB/) - transfer information from file to pianaDB_paper --> on directory piana/code/dbParsers/mintParser execute the following command: $> python2.3 mint2piana.py --input-file-name=./../../../data/externalDBs/mintDB/2007-04-05-full.mitab25.txt --piana-dbname=pianaDB_paper --piana-dbhost=localhost --database-name=mint --database-version=2007-04-05 --database-description="the Molecular INTeraction database" --database-information="protein-protein interactions" ...................................................................... PARSE any interaction databases that comply with the HUPO PSI standard 1.0 (http://psidev.sourceforge.net/mi/xml/doc/user/) ...................................................................... UNFINISHED - for any psi data file, follow the same steps as for the MIPS ,HPRD and BIND data: 1. create a local database xxxDB for these data, containing the tables listed in piana/code/dbCreation/create_psi_tables.sql 2. use parsePSI.py to populate the local database xxxDB $> python parsePSI.py --psi-dbname=xxxDB --psi-dbhost=localhost --input-file=interactions.hupo_psi_standard --mode=psiDB --psi-db=flag (Attention! flag can be mips or hprd. You have to choose the one that resembles more to the data you are parsing... it would be great to have a standard that is respected by everyone, but since it is not the case, the parser has to adapt to specificities of the different databases... if you don't know which flag to use, set it to hprd (if it doesn't work, try with mips ;-) (if it still doesn't work, you'll have to take a look to the data and modify the content handler PSIContentHandler.py.... Good luck!!! ))) 3. transfer the information from xxxDB to your pianaDB $> python psi2piana.py --piana-dbname=pianaDB_paper --piana-dbhost=localhost --psi-dbname=xxxDB --psi-dbhost=localhost --psi-db=datalabel Attention! datalabel is the source database label that will be associated to the data parsed... and it must appear in dictionaries (or lists) source_databases, interaction_databases, interaction_source_databases_colors in file piana/code/PianaDB/PianaGlobals.py Attention! before transferring the information to your piana database, make sure that all detection methods in your psi data files appear in PianaGlobals.method_names. To check this, do a "select distinct method from interactionMethod;" once you have filled your local psi database. Then, check in PianaGlobals.method_names that each of those methods appears in one dictionary value. Do not change the keys of the method_names dictionary: if you think none of the keys listed corresponds to the method in your psi data, you must create a new dictionary entry with the key you wish to give to that method and the value used in your psi data. Attention! if this doesn't work for your PSI file (it happens... standards are standard but their use is not always standard) you have another possibility: use the XMLFlattener provided at the HUPO web site (http://psidev.sourceforge.net/mi/xml/doc/user/ section "tools") to convert your PSI file to a PIANA psi_flat format. Then, use the parser provided with the code: piana/code/dbParsers/psi_flatParser/psi_flat2piana.py Read piana/code/dbParsers/psi_flatParser/README.psi_flat_format for more info on how to do this Attention! If it still doesn't work, you always have the possibility of doing your own parser (with any programming language) to transform the PSI data file to a piana text format (see below, section "PARSE your own interaction data") and then use the parser piana_text_int2pianaDB.py (see below) .................................................................... 4.2.8. PARSE STRING --> predictions of protein interactions and COG (Cluster of Orthologous Genes) information for proteins ................................................................... UNFINISHED .................................................................... 4.2.9. PARSE interaction predictions of interactions based on distant sequence/structure patterns --> protein interactions .................................................................... UNFINISHED - (this data is provided both in pianaDB_limited (http://sbi.imim.es/piana#db) and as flat text file in piana/data/externalDBs/oriDB/interact.dat) - we have called 'ori' to the set of interactions that were found as explained in this article: (Espadaler J., O. Romero-Isart, et al (2005) "Prediction of protein-protein interactions using distant conservation of sequence patterns and structure relationships" Bioinformatics 21(16):3360-8) - parse 'ori' data and insert information into your local piana database pianaDB_paper --> on directory piana/code/dbParsers/oriParser execute the following command: $> python ori2piana.py --piana-dbname=pianaDB_paper --piana-dbhost=localhost --database-version="2007.1.16" --database-name=ori --database-description="predicted interactions from distant structure/sequence patterns" --database-information="protein-protein interactions" --log-file="ori2piana.log" --time-control .................................................................. 4.2.10. PARSE your own data about protein interactions .................................................................. UNFINISHED ============================================================ 5. Eliminate an external database previously inserted into PIANA database ============================================================ If you want to delete from your PIANA database the information coming from one or more specific external databases without deleting all the database, you must execute the following command in piana/code/dbModification: $> python2.3 delete_info_multiple_external_db_from_piana_db.py --piana-dbname=pianaDB_limited --piana-dbhost=sefarad To specify which databases you want to drop, you must open the script with a text editor, and add the database names in the "list_databases" variable, as follows: list_databases=["intact","biogrid","hprd","bind","mint","mips"] If you don't remember which is the name assigned to the database you want to delete, you can obtain it executing the following command in piana/code/execs directory: $> python2.3 piana.py --print-reference-card When external databases are deleted, all information in the database is checked, so, it can take at least one hour to finish. All information associated with this database will be deleted unless another database contains the same information. ============================================================ 6. Updating PIANA database ============================================================ If you want to mantain and add new information into an existing PIANA database, you can just add the new version of an external database as it is explained in this document. There is only one restriction: two external databases cannot have the same external database identifier (those names that are given to external databases when parsing data). If you want to check which names are currently used, execute the following command in piana/code/execs $> python2.3 piana.py --print-reference-card and search in the output for currently used databases. You can insert different versions of the same database, but you must assign it a new database identifier (database-name parameter)