------------------------
README.populate_piana_db
------------------------

This file explains how to create a PIANA database and how to maintain it.

============================================================
INDEX
============================================================

1. Introduction

2. Create a PIANA database

3. Drop a PIANA database

4. Introduce an external database into PIANA database

	4.1 Parsing databases of protein codes and information
		4.1.1. Parse taxonomy
		4.1.2. Parse Uniprot and TrembL
		4.1.3. Parse NCBI GenBank
		4.1.4. Parse NCBI BLAST NR Dataset
		4.1.5. Parse correspondences between pdb identifiers and gi identifiers
		4.1.6. Parse correspondences between uniprot identifiers and gi identifiers
		4.1.7. Parse correspondences between pdb identifiers and uniprot identifiers
		4.1.8. Parse geneID identifiers
		4.1.9. Parse geneID - geneName correspondences
		4.1.10. Parse RefSeq identifiers
		4.1.11. Parse COG and KOG (Cluster of ortologous genes) information
		4.1.12. Parse SCOP information
		4.1.13. Parse GO information

	4.2 Parsing databases of protein-protein interactions
		4.2.1. DIP
		4.2.2. MIPS
		4.2.3. HPRD
		4.2.4. BIND
		4.2.5. Intact
		4.2.6. BioGRID
		4.2.7. MINT
		4.2.8. Parse any interaction database that comply with the HUPO PSI standard (UNFINISHED)
		4.2.8. String (UNFINISHED)
		4.2.9. OriDB (UNFINISHED)
		4.2.10. Parse your own data about protein interactions (UNFINISHED)
		
5. Eliminate an external database previously inserted into
   Piana database

6. Updating a PIANA database


============================================================
1. INTRODUCTION
============================================================

PIANA requires a mysql database in order to work. This database
contains the proteins, the information related to these proteins and
the interactions between the proteins.

But don't panic! You don't need to know sql, as PIANA makes the
database existance completely trasparent to you. But something you do
need to do is having this database created before starting to work
with PIANA. You've got two possibilities:

1 - Using the database we provide in our webpage. 

 (THIS OPTION IS NOT AVAILABLE IN THE BETA VERSION OF v1.4)

This database (called pianaDB_limited) is provided as a mysql dump and
is ready to be used with PIANA. You can also use this database as a
starting point for creating your own database by adding more data
using the parsers we provide.  Read piana/README.pianaDB_limited for
more information on this database and how to "install" it on your
machine.  Once pianaDB_limited is on your machine you can follow
instructions provided below to add more data to it. Take into account
that pianaDB_limited does not contain all the information that PIANA
is capable of containing.

2 - Create the database from scratch 

You can easily create your own database by using the parsers we
provide. On the positive side (one must always think positively) you
will be able to choose exactly which data to insert (and therefore not
loosing precious time while MySQL searches through irrelevant data) as
well as getting a taste on how wonderfully organized biological
databases are. On the negative side, it will take 2-3 days of
processing time to populate your PIANA database.


============================================================
2. CREATE A PIANA DATABASE
============================================================
If you are going to add data to an existing database
 (eg. pianaDB_limited) you should skip this section
------------------------------------------------------------

If you need to create your own piana database (because you are the
only user of PIANA in your lab, or because you don't have access to
the main piana database, or because you don't like the database we
provide along with the code) you'll need to follow these
steps. Remember that you must have privileges to create databases on
the MySQL server.


2.1 -> create a piana database (you choose the name of the database)
       on the machine that will act as piana database server
       (piana code and piana database can be on different machines)

 2.1.1 - [mysql_machine_server]$> mysql 
 2.1.2 - mysql> create database name_of_your_piana_db;

2.2 -> use script piana/code/dbCreation/create_piana_tables.sql to
       create the tables of the database

 2.2.1 - [machine with piana code]$> cd piana/code/dbCreation
         
         $> mysql --database=name_of_your_piana_db --host=mysql_machine_server <  create_piana_tables.sql

The database has been created: now you need to populate it (section 4)

============================================================
3. DROP A PIANA DATABASE
============================================================

If you don't want to use anymore a PIANA database, you only have to 
execute the following DROP DATABASE command in your MySQL Server.
But if you only want to delete all tables and mantain the name of the
database, you must run the following command:

   [machine with piana code]$> cd piana/code/dbCreation
   $> mysql --database=name_of_your_piana_db --host=mysql_machine_server < drop_piana_tables.sql

==================================================================
4. INTRODUCE INFO FROM EXTERNAL DATABASES INTO your PIANA DATABASE
==================================================================

Generalizing, there are two steps to do for any external data you want
to insert into your piana database:

  1 - download data from the internet
 
      each subdirectory of piana/data/externalDBs has a README file
      explaining how to obtain the data files of a particular
      database. Download the files of the databases you want to insert
      into pianaDB to the directories where the README files are.

  2 - parse the data

      then, you'll need to use parsers under piana/code/dbParsers to
      transfer the information from these files to your PIANA database. 
      The parser in directory name piana/code/dbParsers/xxxxParser will
      parse data files of external databases. To get a description of 
      each parser, do
 
      $> python name_of_parser.py --help
      
      It is important to know that in all parsers there are two mandatory
      arguments: 

          --database-name: this will be the internal database identifier
	                   to the inserted database
	  --database-version: this will be the version label that identifies 
                            the version of the external database that has been 
                            inserted into PIANA
	  
      In many parsers there are some other optional database related arguments:

      	  --database-information: indicates which kind of information this database
	  		    is going to insert into PIANA database
			    It can be one or more of the following:
			    	- protein sequences
				- protein attributes
				- identifiers cross-references
				- protein-protein interactions
				
	Finally, in many parsers there are other optional arguments, that are:

	   --time-control: prints to standard error the progress of the parsing
	   		   (i.e. lines or proteins processed for unit of time)
	   --verbose: prints to standard details of the parsing process
	   --log-file: prints to a file specified in this option a summary of the
	   	       data introduced into the database.


This generalization is beautiful, but I understand you need more
precise instructions. Here they are... each of the following
subsections explains how to parse a particular type of biological
data. If you don't want it in PIANA, just skip the corresponding
section

Obviously, before inserting protein-protein interaction data (PPI
data), you need to insert information about the proteins
themselves. PPI parsers only insert interactions if both proteins of
the interaction appear in pianaDB. To make sure you do things
correctly, follow the order detailed in this file. One good thing
about using pianaDB_limited as a base for your piana database is that
information for proteins has already been inserted.

You are not obliged to do so, but I suggest you save the downloaded
data under the corresponding directories in piana/data/externalDBs
Once the external database has been parsed you can delete it from the
disk to save space.

The following lines describe in detail how to populate a database
called 'pianaDB_paper' on a local mysql server (ie. localhost). If
your system requires identification for the mysql server, you'll need
to add as well parameters --piana-dbuser=your_user_name and
--piana-dbpass=your_pass to each command. If you want to add
information to the database we provide along with the code
(pianaDB_limited) just change pianaDB_paper to pianaDB_limited and
'localhost' to the name of your mysql server machine.

Depending on the computational power available (and the speed of the
mysql server) this process can take from a few days to a full week. 
Find other things to do while the parsers do their job :-)

If you want to insert your own data into a PIANA database you've got
several options:

1. format your data as indicated in (section "PARSE your own data:
   protein interactions") and use the parser described in that
   section

2. create a new parser for your data.

   Creating your own parser is quite easy: you do not need to know sql
   to do it, just a little python. PIANA provides its developers with
   an easy-to-use library to access and insert information into piana
   databases (class PianaDBaccess.py). There is a parser template in 
   piana/code/dbParsers/templateParser that will be useful to follow,
   as it provides to developers the basic schema of a parser.
   
3. do not insert the interactions into the piana database: instead, 
   you can create your networks directly from a text file

   PIANA command add-interactions-file lets you add interactions to a
   network from a text file formatted as described in
   README.piana_interaction_data_format

However, in all cases you'll need at least a PIANA database with 
information for proteins: otherwise, PIANA would not know which
are the proteins involved in the interactions.

* ============================================================ *
* 4.1. PARSING DATABASES OF PROTEIN CODES AND INFORMATION      *
* ============================================================ *

Attention: all piana parsers have a flag '--help: when this flag 
           is written in the command line, the parser outputs
           information about its usage and exits
           (eg. python taxonomy2piana.py --help)

Attention: all piana parsers have a flag '--verbose': when this 
           flag is written in the command line together with
           the other arguments, the parser will output information 
           about the process to your screen.

           Therefore, if you wish to see what's going on, you 
           should set this flag

    (eg. python taxonomy2piana.py --taxonomy-file=the_file  --piana-dbname=pianaDB_paper --piana-dbhost=localhost --verbose)

Attention: all piana parsers have a flag '--log_file': when this
           flag is written in the command line, the parser saves
	   a file with the information inserted with this parser
	   (eg. number of inserted proteins, number of inserted
	   uniprot codes...) 
	   (eg. python taxonomy2piana.py --log-file="./taxonomy2piana.log")

Attention: when parsing protein data, you must respect the order 
           described in this file. However, when parsing
           protein interactions you can follow any order you
           wish.

Attention: if you have limited disk space, you can delete the
           downloaded files after parsing them. Once a file
           has been parsed, PIANA will not use it anymore.


........................................................
4.1.1 PARSE TAXONOMY

--> protein information: species information
........................................................

  - download taxonomy data (file taxdump.tar.gz in
    ftp://ftp.ncbi.nih.gov/pub/taxonomy/ )

  - untar the file (using tar -xzvf)
    (save it to directory piana/data/externalDBs/taxonomyDB/)

  - parse taxonomy data and insert information into pianaDB_paper
    --> on directory piana/code/dbParsers/taxonomyParser execute the
        following command:

    $> python taxonomy2piana.py --taxonomy-file=../../../data/externalDBs/taxonomyDB/names.dmp --piana-dbname=pianaDB_paper --piana-dbhost=localhost --database-name="ncbi_taxonomy" --database-version="Apr 03 2007" --log_file="./taxonomy2piana.log"
    
...............................................................
4.1.2 PARSE UNIPROT SWISSPROT AND TREMBL

--> protein information: sequences, codes and info from uniprot
...............................................................

  - download swissprot data (file uniprot_sprot.dat.gz in
    ftp://ftp.expasy.org/databases/uniprot/knowledgebase/ )

  - download trembl data (file uniprot_trembl.dat.gz in
    ftp://ftp.expasy.org/databases/uniprot/knowledgebase/ )

  - uncompress the files (using gunzip -d)
    (save them to directory piana/data/externalDBs/uniprotDB/)

  - you need to follow instructions on
    piana/data/externalDBs/uniprotDB/README.rg_deleted before
    parsing the data. Biopython has a problem with certain fields in
    the data, and you need to get rid of those problems. Do it for
    both files (sprot and trembl)
      
    Briefly, you just have to do:
    
      $> sed 's/^\(RG\|RX\)/RA/g' uniprot_sprot.dat | grep -v -P "^OH" > uniprot_sprot_rg_rx_oh_deleted.dat
      $> sed 's/^\(RG\|RX\)/RA/g' uniprot_trembl.dat | grep -v -P "^OH" > uniprot_trembl_rg_rx_oh_deleted.dat
      
  - parse uniprot data and insert information into pianaDB_paper (do
    it for both files: first sprot and then trembl)

    --> on directory piana/code/dbParsers/uniprotParser execute the
        following commands:
	
    $> python uniprot2piana.py --input-file=uniprot_sprot_rg_rx_oh_deleted.dat --piana-dbname=pianaDB_paper --piana-dbhost=localhost 
    --mode="scratch" --database-name="swissprot" --database-version="Apr 03 2007" --database-description="Uniprot manually curated database" --log-file="./uniprot_log_file.log" --database-information="protein sequences,protein attributes, identifiers cross-references" --time-control
	
    $> python uniprot2piana.py --input-file=../../../data/externalDBs/uniprotDB/uniprot_trembl_rg_rx_oh_deleted.dat --piana-dbname=pianaDB_paper
     --piana-dbhost=localhost --mode="scratch" --database-name="trembl" --database-version="Apr 03 2007" --database-description="Uniprot complete not manually curated database" --log-file="./trembl_log_file.log" --time-control --database-information="protein sequences,protein attributes, identifiers cross-references"
     
     
...............................................................
4.1.3 PARSE NCBI GenBank

--> protein information: sequences and codes from NCBI
...............................................................


  - First of all, you must download a file that contains information
    on species for GenBank codes gi. This file will be used by parsers genpept2piana.py,
    nr2piana.py, pdbaa2piana.py and swissprot2piana.py:
    	
    	- download gi_taxid_prot.dmp.gz from
    	ftp://ftp.ncbi.nih.gov/pub/taxonomy/

  	- uncompress the file (using gunzip -d)
    	(save it to directory piana/data/externalDBs/taxonomyDB/gi_vs_tax/)
    
  - download genpept data (file rel_xxx.fsa_aa.gz in
    ftp://ftp.ncbi.nih.gov/genbank/ )

  - uncompress the file (using gunzip -d)
    (save it to directory piana/data/externalDBs/genpeptDB/)

  - parse genepept data and insert information into pianaDB_paper
    --> on directory piana/code/dbParsers/genpeptParser execute the
        following command:

   $> python2.3 genpept2piana.py --input-file-name=./../../data/externalDBs/genpeptDB/relXXX.fsa_aa --piana-dbname=pianaDB_paper --piana-dbhost=localhost --tax-id-file=../../../data/externalDBs/taxonomyDB/gi_vs_tax/gi_taxid_prot.dmp --database-name="genpept" --database-version="release 158 - 16/2/2007" --database-description="ncbi genbank database" --piana-dbuser="piana" --log-file="./genpept2piana.log" --database-information="protein sequences,protein attributes, identifiers cross-references"
   

......................................................................
4.1.4 PARSE NCBI NON REDUNDANT (NR) dataset

--> protein information: sequences and codes from NCBI (and some uniprot
    correspondences)
......................................................................

  - download ncbi nr data (file nr.gz in
    ftp://ftp.ncbi.nih.gov/blast/db/FASTA)

  - uncompress the file (using gunzip -d)
    (save it to directory piana/data/externalDBs/ncbi_nrDB/)

  - parse nr data and insert information into pianaDB_paper
    --> on directory piana/code/dbParsers/nrParser execute the
        following command:

    $> python2.3 nr2piana.py --input-file-name="./../../data/externalDBs/ncbi_nrDB/nr" --piana-dbname="pianaDB_paper" --piana-dbhost=localhost --tax-id-file="./../../data/externalDBs/taxonomyDB/gi_vs_tax/gi_taxid_prot.dmp" --database-name="nr" --database-version="Apr 09 2007" --database-description="non-redundant ncbi database" --database-information="protein sequences,protein attributes,identifiers cross-references" --time-control


.......................................................................
4.1.5. PARSE NCBI2PDB correspondences

--> protein information: correspondences between pdb codes and gi codes
.......................................................................

  - download ncbi2pdb data (file pdbaa.gz in
    ftp://ftp.ncbi.nih.gov/blast/db/FASTA/ )

  - uncompress the file (using gunzip -d)
    (save it to directory piana/data/externalDBs/misc_ncbiDB/)
  
  - parse ncbi2pdb data and insert information into pianaDB_paper
    --> on directory piana/code/dbParsers/misc_ncbiParser execute the
        following command:

    $> python2.3 pdbaa2piana.py --input-file-name="./../../data/externalDBs/misc_ncbiParser/pdbaa" --piana-dbname=pianaDB_paper --piana-dbhost=localhost --tax-id-file="./../../data/externalDBs/taxonomyDB/gi_vs_tax/gi_taxid_prot.dmp" --database-name="ncbi2pdb_pdbaa" --database-version="Apr 09 2007" --database-description="correspondence between pdb and gi identifiers" --database-information="identifiers cross-references" --piana-dbuser="piana" --log-file="./pdbaa2piana.log"


......................................................................
4.1.6. PARSE NCBI2UNIPROT correspondences

--> protein information: correspondences between uniprot codes and gi
   codes
......................................................................

  - download ncbi2uniprot data (file swissprot.gz in
    ftp://ftp.ncbi.nih.gov/blast/db/FASTA/ )

  - uncompress the file (using gunzip -d)
    (save it to directory piana/data/externalDBs/misc_ncbiDB/)
  
  - parse ncbi2uniprot data and insert information into pianaDB_paper
    --> on directory piana/code/dbParsers/misc_ncbiParser execute the
        following command:

    $> python2.3 swissprot2piana.py --input-file-name="./../../data/externalDBs/misc_ncbiParser/swissprot" --piana-dbname=pianaDB_paper --piana-dbhost=localhost --tax-id-file="./../../data/externalDBs/taxonomyDB/gi_vs_tax/gi_taxid_prot.dmp" --database-name="ncbi2uniprot swissprot" --database-version="Apr 09 2007" --database-description="correspondences between uniprot and gi identifiers" --database-information="identifiers cross-references" --time-control --log-file="./ncbiswissprot2piana.log"
    
    
.....................................................................
4.1.7. PARSE PDB2UNIPROT correspondences

--> protein information: correspondence between pdb codes and uniprot
    codes
.....................................................................

  - download mapping.txt (plain text file) from
    http://www.bioinf.org.uk/pdbsprotec/
    (save it to directory piana/data/externalDBs/pdbsprotecDB/)
  
  - parse pdb2uniprot data and insert information into pianaDB_paper
    --> on directory piana/code/dbParsers/pdbsprotecParser execute the
        following command:

    $> python2.3 pdbsprotec2piana.py --input-file-name="./../../data/externalDBs/pdbsprotec/mapping.txt" --piana-dbname="pianaDB_paper" --piana-dbhost=localhost --database-name="pdbsprotec" --database-version="15-Jan-2007" --database-description="correspondences between pdb and uniprot identifiers" --database-information="identifiers cross-references" --time-control
    
......................................................................
4.1.8. PARSE geneID correspondences

--> protein information: correspondences between gi identifiers and 
   geneID identifiers
......................................................................

  - download gene_info data (file gene2accession.gz in
    ftp://ftp.ncbi.nih.gov/gene/ )

  - uncompress the file (using gunzip -d)
    (save it to directory piana/data/externalDBs/misc_ncbiDB/)
  
  - parse gene_info data and insert information into pianaDB_paper
    --> on directory piana/code/dbParsers/misc_ncbiParser execute the
        following command:

    $> python2.3 gene2accession_parser.py --input-file-name="./../../data/externalDBs/gene2accession" --piana-dbname="pianaDB_paper" --piana-dbhost=localhost --tax-id-file="./../../data/externalDBs/taxonomyDB/gi_vs_tax/gi_taxid_prot.dmp" --database-name="gene" --database-version="" --database-description="correspondences between ncbi accession number and geneID identifiers" --database-information="identifiers cross-references" --time-control --log-file="./gene2accession2piana.log"

.....................................................................
4.1.9. PARSE geneID - geneName correspondences

--> protein information: correspondence between geneID identifiers and
geneName identifiers
.....................................................................

  - download gene_info file(file gene_info in
    ftp://ftp.ncbi.nih.gov/gene/ )
  
  - uncompress the file (using gunzip -d)
  
  - parse gene_info data and insert information into pianaDB_paper
    --> on directory piana/code/dbParsers/misc_ncbiParser execute the
        following command:
	
    $> python2.3 gene_info2piana.py --input-file-name="./../../data/externalDBs/gene_info" --piana-dbname="pianaDB_paper" --piana-dbhost="localhost" --database-name="gene_info" --database-version="Apr 19 2007" --database-description="Gene NCBI database - gene_info" --time-control --log-file="./gene_info2piana.log" --database-information="identifiers cross-references"

    
 .....................................................................
4.1.10. PARSE RefSeq correspondences

--> protein information: correspondence between RefSeq identifiers and
gi identifiers
.....................................................................

  - download RefSeq catalog data (file RefSeq-release??.catalog.gz in
    ftp://ftp.ncbi.nih.gov/refseq/release/release-catalog/ )
  
  - uncompress the file (using gunzip -d)
  
  - parse RefSeq Catalog data and insert information into pianaDB_paper
    --> on directory piana/code/dbParsers/refseqParser execute the
        following command:
	
	
    $> python2.3 refseq2piana.py --input-file-name="./../../data/externalDBs/refseq/RefSeq-release22.catalog" --piana-dbname="pianaDB_paper" --piana-dbhost=localhost --tax-id-file="./../../data/externalDBs/taxonomyDB/gi_vs_tax/gi_taxid_prot.dmp" --database-name="refseq" --database-version="release22 April 2007" --database-description="NCBI RefSeq Database" --piana-dbuser="piana" --time-control --log-file="./refseq2piana.log" --database-information="identifiers cross-references"
    

..................................................................
4.1.11. PARSE COG AND KOG

--> protein information: clusters of orthologous genes (KOG for 7
    eukaryotic and COG for 66 complete genomes)
.................................................................

  - download cog data (files whog and myva=gb in
    ftp://ftp.ncbi.nih.gov/pub/COG/COG/)

  - download kog data (files kog and kyva=gb in
    ftp://ftp.ncbi.nih.gov/pub/COG/KOG/)

    (save them to directory piana/data/externalDBs/cogDB/)

  - parse cog data and insert information into pianaDB_paper
    --> on directory piana/code/dbParsers/cogParser execute the
        following command:
    
    $> python2.3 xyva_gb2piana.py --input-file-name="../../../data/externalDBs/cogDB/myva=gb" --piana-dbname="pianaDB_paper" --piana-dbhost=localhost --database-name="COG-myva=gb" --database-version="2003" --database-description="Cluster of orthologous genes" --time-control --database-information="identifiers cross-references,protein attributes" --log-file="cog_myva2piana.log"

    $> python2.3 xog2piana.py --input-file-name="../../../data/externalDBs/cogDB/whog" --piana-dbname="pianaDB_paper" --piana-dbhost=localhost --database-name="COG-whog" --database-version="2003" --database-description="Cluster of orthologous genes" --time-control --database-information="identifiers cross-references,protein attributes" --log-file="cog_xog2piana.log"

  - parse kog data and insert information into pianaDB_paper
    --> on directory piana/code/dbParsers/cogParser execute the
        following command:

    $> python2.3 xyva_gb2piana.py --input-file-name="../../../data/externalDBs/cogDB/kyva=gb" --piana-dbname="pianaDB_paper" --piana-dbhost=localhost --database-name="KOG-kyva=gb" --database-version="2003" --database-description="Eucariotic Cluster of orthologous genes" --time-control --database-information="identifiers cross-references,protein attributes"

    $> python2.3 xog2piana.py --input-file-name="../../../data/externalDBs/cogDB/kog" --piana-dbname="pianaDB_paper" --piana-dbhost=localhost --database-name="KOG-kog" --database-version="2003" --database-description="Eucariotinc Cluster of orthologous genes" --time-control --database-information="identifiers cross-references,protein attributes" --log-file="kog_xog2piana.log"
  	

.............................................................
4.1.12. PARSE SCOP

--> protein information: structural domains information
.............................................................

  - download SCOP data (file dir.cla.scop.txt_X.XX in
    http://scop.mrc-lmb.cam.ac.uk/scop/parse/index.html )

    (save it to directory piana/data/externalDBs/scopDB/)

  - parse scop data and insert information into pianaDB_paper
    --> on directory piana/code/dbParsers/scopParser execute the
        following command:

    $> python2.3 scop2piana.py --input-file-name="../../../dir.cla.scop.txt_1.71" --piana-dbname="pianaDB_paper" --piana-dbhost="localhost" --database-name="SCOP" --database-version="Release 1.71" --database-description="SCOP - Structural Classification of Proteins" --database-information="protein attributes" --time-control --log-file="scop2piana.log"

..............................................................
4.1.13. PARSE GO

--> protein information: Gene Ontology terms for proteins
...............................................................

  - download GO data (file go_dateYYYYMM-assocdb-tables.tar.gz in
    http://archive.godatabase.org/latest/)

  - uncompress the file (using tar -xzvf) 
    (save it to directory piana/data/externalDBs/goDB/)

  -> Instead of inserting information directly to the piana database,
     we first create a local database for this data and then transfer
     that data from that local database to the piana database. It
     simplifies the parsing, and it is always nice to have a separate
     database for a particular external database


  - create a local go database goDB on your mysql server machine
             
             - create the name of the database (on some systems, this
               can only be done with root rights)
                        
                  [localhost]$> mysql 
                         mysql> create database name_of_your_go_db;

	     - the tables of the database will be created by
               parseGO.py (it is trasparent to you, don't worry about
               it)
               -> you'll need rights for creating tables... if not,
                  ask the root to run parseGO.py for you

  - insert go data into your local mysql go database

       $> cd piana/code/dbParsers/goParser

         -> Attention: if it is the first time you are creating the go 
           database, this script will generate errors when trying to 
           'drop' the former tables
           You can ignore these errors... Example of error to ignore: 
           ERROR 1051 at line 1: Unknown table 'assoc_rel'

       $goParser> python2.3 parseGO.py --input-go-dir="./../../../data/externalDBs/GO/go_200704-assocdb-tables" --go-dbname="name_of_your_go_db" --go-dbhost="localhost"
       

  - insert go information to pianaDB_paper

       $goParser> python2.3 go2piana.py --go-dbname="name_of_your_go_db" --go-dbhost="localhost" --piana-dbname="pianaDB_paper" --piana-dbhost="localhost" --database-name="go" --database-version="April 2007" --database-description="GO. Gene Ontology" --insert-protein-go --insert-go-info --time-control --database-information="protein attributes"

       if you are going to do clustering based on GO, you can add
        flags '--insert-go-distance' and '--threshold=4'

         --threshold=4 means that the distance will be set to
          INFINITE when the terms are separated by more than 4.
          This speeds up the process of calculating distances and
          does not affect the clustering because those terms are too
          far away

        However, be ready for a very long parsing... it takes several
        days to calculate distances for all GO terms
	
	I suggest instead that you parse GO without the flag
        '--insert-go-distance' and then calculate the distances only
        for GOs associated to proteins in your interest network using
        the flags '--input-file' and '--input-proteins-type'

        for example:

        $goParser> (TO BE COMPLETED)

        To learn more about calculating distances only for a limited
        number of GO terms, read
        piana/code/dbParsers/goParser/README.limiting_parsing_to_specific_gos
  
     
* =========================================== *
* 4.2  PARSING DATABASES OF INTERACTIONS      *
* =========================================== *

Populate PIANA with data from third-party interaction databases by
using the parsers provided. Remember as well that if you have
interactions in a text file, you can also use it directly instead of
inserting it into the database. Read command 'add-interactions' in
piana/code/execs/conf_files/general_template.piana_conf

......................................................
4.2.1 PARSE DIP

--> protein interactions
......................................................

  - download DIP tabulated data (file dip????????.txt in section "Files - Full"
    of http://dip.doe-mbi.ucla.edu ) (registration required)
    (save it to directory piana/data/externalDBs/dipDB/)

  - uncompress the file (using gunzip -d)
    (save it to directory piana/data/externalDBs/dipDB/)

  - transfer information from file to pianaDB_paper
    --> on directory piana/code/dbParsers/dipParser execute the
        following command:

    $> python2.3 diptxt2piana.py --input-file-name=../../../data/externalDBs/dipDB/dip????????.txt --piana-dbname="pianaDB_paper" --piana-dbhost="localhost" --database-name="dip" --database-version="19.2.2007" --database-description="DIP (Database of Interacting Proteins)" --time-control --log-file="diptxt2piana.log" --database-information="protein-protein interactions"

.....................................................................
4.2.2. PARSE MIPS

--> protein interactions
.....................................................................

 - download MIPS data (file mppi.gz from http://mips.gsf.de/proj/ppi/)

 - uncompress the file (using gunzip -d)
    (save it to directory piana/data/externalDBs/mipsDB/)

  -> Instead of inserting information directly to the piana database,
     we first create a local database for this data and then
     transfer that data from that local database to the piana
     database. It simplifies the parsing, and it is always nice to
     have a separate database for a particular external database

 - create a local mips database mipsDB on your mysql server machine,
   using script piana/code/dbCreation/create_psi_tables.sql

     -> create the mips database where mips data will be inserted

     [localhost]$> mysql
     mysql> create database mipsDB;

     -> create tables
        --> on directory piana/code/dbCreation execute the following
            command:
     $dbCreation> mysql --database=mipsDB --host=localhost < create_psi_tables.sql

 - parse MIPS data and insert information into your local mips
   database mipsDB
    --> on directory piana/code/dbParsers/psiParser execute the
        following command:

    $psiParser> python parsePSI.py --psi-dbname=mipsDB --psi-dbhost=localhost --input-file=../../../data/externalDBs/mipsDB/mppi --mode=psiDB --psi-db=mips

 - transfer information from mipsDB to pianaDB_paper
    --> on directory piana/code/dbParsers/psiParser execute the
        following command:

    $> python2.3 psi2piana.py --piana-dbname=pianaDB_paper --piana-dbhost=localhost --psi-dbname=mipsDB --psi-dbhost=localhost  --psi-db=mips --database-name="mips" --database-version="March 2007" --database-description="The MIPS Mammalian Protein-Protein Interaction Database" --insert-psidb-ids --time-control --log-file="mips2piana.log" --database-information="identifiers cross-references,protein-protein interactions"


...................................................................
4.2.4. PARSE HPRD
...................................................................

   - download HPRD tabulated data (file HPRD_Release_6_01012007.txt
    of http://www.hprd.org/download ) (registration required)
    (save it to directory piana/data/externalDBs/hprdDB/)

  - uncompress the files (using gunzip -d)
    (save them to directory piana/data/externalDBs/hprdDB/)

  - transfer information from file to pianaDB_paper
    --> on directory piana/code/dbParsers/hprdParser execute the
        following command:

    $> python2.3 hprd2piana.py --input-file-name="../../../data/externalDBs/hprdDB/HPRD_Release_6_01012007.txt" --piana-dbname="pianaDB_paper" --piana-dbhost="localhost" --database-name="hprd" --database-version="Release_6_01012007" --database-description="Human Protein Reference Database" --log-file="hprd2piana.log" --time-control --database-information="identifiers cross-references,protein-protein interactions"


....................................................................
4.2.5. PARSE BIND
....................................................................


--> protein interactions

   Some interactions provided by BIND are 'ambiguous': only gene names
   or protein descriptions are given (as opposed to cases where 'good'
   codes such as uniprot are provided). PIANA checks if there are
   several sequences associated to those geneNames, and in case
   it finds the geneName to be ambigous, it inserts the interaction
   into the PIANA database labeled as 'bind_c'. Then, the user can
   choose to ignore all _c interactions using the configuration file
   parameter ignore-unreliable
...................................................................

  - download BIND data (files *.psi.xml.gz from
    ftp://ftp.blueprint.org/pub/BIND/data/divisions/psi)
      --> you can download all the datasets or just those that are
          interesting for you

  - uncompress the files (using gunzip -d)
    (save it to directory piana/data/externalDBs/bindDB/)

  -> Instead of inserting information directly to the piana database,
     we first create a local database for this data and then
     transfer that data from that local database to the piana
     database. It simplifies the parsing, and it is always nice to
     have a separate database for a particular external database

  - create a local bind database bindDB on your mysql server machine,
    using script piana/code/dbCreation/create_psi_tables.sql

     -> create the mips database where mips data will be inserted

     [localhost]$> mysql
     mysql> create database bindDB;

     -> create tables
        --> on directory piana/code/dbCreation execute the following
            command:
     $dbCreation> mysql --database=bindDB --host=localhost <  create_psi_tables.sql

  - parse BIND data and insert information into your local bind
    database bindDB
    --> on directory piana/code/dbParsers/psiParser execute the
        following commands (one for each bind file you have
        downloaded):

    $psiParser> python parsePSI.py --psi-dbname=bindDB --psi-dbhost=localhost --input-file=../../../data/externalDBs/bindDB/bindfungi.1.psi.xml --mode=psiDB --psi-db=bind
    $psiParser> python parsePSI.py --psi-dbname=bindDB --psi-dbhost=localhost --input-file=../../../data/externalDBs/bindDB/file_whatever.psi.xml --mode=psiDB --psi-db=bind
    $psiParser> python parsePSI.py .........................................

  - transfer information from bindDB to pianaDB_paper
    --> on directory piana/code/dbParsers/psiParser execute the
        following command. 

    $> python2.3 psi2piana.py --piana-dbname=pianaDB_paper --piana-dbhost=localhost --psi-dbname=bindDB --psi-dbhost=localhost --psi-db=bind --database-name="bind" --database-version="April 2007" --database-description="Biomolecular Interaction Network Database" --insert-psidb-ids --time-control --log-file="bind2piana.log" --database-information="identifiers cross-references,protein-protein interactions"python psi2piana.py --piana-dbname=pianaDB_paper --piana-dbhost=localhost --psi-dbname=bindDB --psi-dbhost=localhost --psi-db=bind
    
    
......................................................
4.2.5 PARSE IntAct

--> protein interactions
......................................................

  - download Intact tabulated data (file intact.zip)
    of ftp://ftp.ebi.ac.uk/pub/databases/intact/current )
    (save it to directory piana/data/externalDBs/intactDB/)

  - uncompress the files (using gunzip -d)
    (save them to directory piana/data/externalDBs/intactDB/)

  - transfer information from file to pianaDB_paper
    --> on directory piana/code/dbParsers/intactParser execute the
        following command:

    $> python2.3 intact2piana.py --input-file-name=../../../data/externalDBs/intactDB/intact.txt --piana-dbname="pianaDB_paper" --piana-dbhost="localhost" --database-name="intact" --database-version="23 april 2007" --database-description="IntAct protein-protein interaction database" --database-information="protein-protein interactions" --time-control --log-file="./intact2piana.log"
    
    
......................................................
4.2.6 PARSE BioGRID

--> protein interactions
......................................................

  - download BioGRID data (file BIOGRID-ALL-?.?.??.tab.zip)
    of http://www.thebiogrid.org/downloads.php)
    (save it to directory piana/data/externalDBs/biogridDB/)

  - uncompress the files (using gunzip -d)
    (save them to directory piana/data/externalDBs/biogridDB/)

  - transfer information from file to pianaDB_paper
    --> on directory piana/code/dbParsers/biogridParser execute the
        following command:

    $> python2.3 biogrid2piana.py --input-file=../../../data/externalDBs/biogridDB/BIOGRID-ALL-SINGLEFILE-2.0.26.tab.txt --piana-dbname="pianaDB_paper" --piana-dbhost="localhost"  --database-version="2.0.26" --database-description="The BioGRID's curated set of physical and genetic interactions" --database-information="protein-protein interactions" --time-control --log-file="./biogrid2piana.log" --database-name="biogrid"
    
    
......................................................
4.2.6 PARSE MINT

--> protein interactions
......................................................

  - download MINT data (file year-day-month-full.mitab25.txt)
    from ftp://mint.bio.uniroma2.it/pub/release/MITAB/
    from the most recent directory (e.g. 2007-04-05)
    (save it to directory piana/data/externalDBs/mintDB/)

  - uncompress the files (using gunzip -d)
    (save them to directory piana/data/externalDBs/mintDB/)

  - transfer information from file to pianaDB_paper
    --> on directory piana/code/dbParsers/mintParser execute the
        following command:

    $> python2.3 mint2piana.py --input-file-name=./../../../data/externalDBs/mintDB/2007-04-05-full.mitab25.txt --piana-dbname=pianaDB_paper --piana-dbhost=localhost --database-name=mint --database-version=2007-04-05 --database-description="the Molecular INTeraction database" --database-information="protein-protein interactions"

    
......................................................................
PARSE any interaction databases that comply with the HUPO PSI standard 1.0
(http://psidev.sourceforge.net/mi/xml/doc/user/)
......................................................................

  UNFINISHED

  - for any psi data file, follow the same steps as for the MIPS ,HPRD
    and BIND data:

     1. create a local database xxxDB for these data, containing the
        tables listed in piana/code/dbCreation/create_psi_tables.sql

     2. use parsePSI.py to populate the local database xxxDB
          $> python parsePSI.py --psi-dbname=xxxDB --psi-dbhost=localhost --input-file=interactions.hupo_psi_standard --mode=psiDB --psi-db=flag

          (Attention! flag can be mips or hprd. You have to choose the
           one that resembles more to the data you are parsing... it
           would be great to have a standard that is respected by
           everyone, but since it is not the case, the parser has to
           adapt to specificities of the different databases... if you
           don't know which flag to use, set it to hprd (if it doesn't
           work, try with mips ;-) (if it still doesn't work, you'll
           have to take a look to the data and modify the content
           handler PSIContentHandler.py.... Good luck!!! )))

     3. transfer the information from xxxDB to your pianaDB
          $> python psi2piana.py --piana-dbname=pianaDB_paper --piana-dbhost=localhost --psi-dbname=xxxDB --psi-dbhost=localhost --psi-db=datalabel

          Attention! datalabel is the source database label that will
                     be associated to the data parsed... and it must
                     appear in dictionaries (or lists)
                     source_databases, interaction_databases,
                     interaction_source_databases_colors in file
                     piana/code/PianaDB/PianaGlobals.py

          Attention! before transferring the information to your piana
                     database, make sure that all detection methods in
                     your psi data files appear in
                     PianaGlobals.method_names. To check this, do a
                     "select distinct method from interactionMethod;"
                     once you have filled your local psi database.
                     Then, check in PianaGlobals.method_names that
                     each of those methods appears in one dictionary
                     value. Do not change the keys of the method_names
                     dictionary: if you think none of the keys listed
                     corresponds to the method in your psi data, you
                     must create a new dictionary entry with the key
                     you wish to give to that method and the value
                     used in your psi data.


   Attention! if this doesn't work for your PSI file (it
              happens... standards are standard but their use is not
              always standard) you have another possibility: use the
              XMLFlattener provided at the HUPO web site
              (http://psidev.sourceforge.net/mi/xml/doc/user/ section
              "tools") to convert your PSI file to a PIANA psi_flat
              format. Then, use the parser provided with the code:
              piana/code/dbParsers/psi_flatParser/psi_flat2piana.py
              Read
              piana/code/dbParsers/psi_flatParser/README.psi_flat_format
              for more info on how to do this

   Attention! If it still doesn't work, you always have the
              possibility of doing your own parser (with any
              programming language) to transform the PSI data file to
              a piana text format (see below, section "PARSE your own
              interaction data") and then use the parser
              piana_text_int2pianaDB.py (see below)

....................................................................
4.2.8. PARSE STRING

--> predictions of protein interactions and COG (Cluster of
    Orthologous Genes) information for proteins
...................................................................

  UNFINISHED

....................................................................
4.2.9. PARSE interaction predictions of interactions based on distant
sequence/structure patterns

--> protein interactions
....................................................................


  UNFINISHED

  - (this data is provided both in pianaDB_limited
    (http://sbi.imim.es/piana#db) and as flat text file in
     piana/data/externalDBs/oriDB/interact.dat)

     - we have called 'ori' to the set of interactions that were found
       as explained in this article:

       (Espadaler J., O. Romero-Isart, et al
        (2005) "Prediction of protein-protein interactions using distant
        conservation of sequence patterns and structure relationships"
        Bioinformatics 21(16):3360-8)

  - parse 'ori' data and insert information into your local piana
    database pianaDB_paper
    --> on directory piana/code/dbParsers/oriParser execute the
        following command:

    $> python ori2piana.py --piana-dbname=pianaDB_paper --piana-dbhost=localhost --database-version="2007.1.16" --database-name=ori --database-description="predicted interactions from distant structure/sequence patterns" --database-information="protein-protein interactions" --log-file="ori2piana.log" --time-control

..................................................................
4.2.10. PARSE your own data about protein interactions
..................................................................

UNFINISHED


============================================================
5. Eliminate an external database previously inserted into
   PIANA database
============================================================

If you want to delete from your PIANA database the information coming
from one or more specific external databases without deleting all the database,
you must execute the following command in piana/code/dbModification:

$> python2.3 delete_info_multiple_external_db_from_piana_db.py --piana-dbname=pianaDB_limited --piana-dbhost=sefarad

To specify which databases you want to drop, you must open the script with a 
text editor, and add the database names in the "list_databases" variable, as follows:

	list_databases=["intact","biogrid","hprd","bind","mint","mips"]
      
If you don't remember which is the name assigned to the database
you want to delete, you can obtain it executing the following command in
piana/code/execs directory:

  $> python2.3 piana.py --print-reference-card
      
When external databases are deleted, all information in the database
is checked, so, it can take at least one hour to finish. All
information associated with this database will be deleted unless
another database contains the same information.

============================================================
6. Updating PIANA database
============================================================

If you want to mantain and add new information into an existing
PIANA database, you can just add the new version of an external
database as it is explained in this document. There is only one 
restriction: two external databases cannot have the same external 
database identifier (those names that are given to external databases
when parsing data). If you want to check which names are currently used,
execute the following command in piana/code/execs

   $> python2.3 piana.py --print-reference-card
   
and search in the output for currently used databases.

You can insert different versions of the same database, but you
must assign it a new database identifier (database-name parameter)