Biotools> PhyST Download, installation instructions, and tutorial


Dependencies
Before using PhyST the following dependencies and modules need to be installed on your computer. These dependencies will allow PhyST to run on your operating system.
  • Mechanize
  • Biopython
  • Beautiful Soup
  • time
  • cookielib
  • math
  • argparse
  • multiprocessing
  • cttypes
  • Please visit The Standard Python Library Documentation for more details on the function of each module.
  • For detailed installation procedures of these modules please check The BioV Suite Documentation
  • PhyST was designed and tested on the latest Mac operative system (EL Capitan 10.11.4), but it will likely run on any other unix system

The following links must be working in order for PhyST to run:
The following programs must be installed to peform inferences of TMS and generate multiple alignments:


Download the program:
PhyST is compressed with zip. Right click on the following link and select 'Save Link As': physt.zip

Expand the file using the following command in your terminal:
unzip physt.zip

Make physt.py executable using the command:
chmod +x physt.py

Make sure PhyST is accessible in your environment or is located in a directory of your choosing.


Before starting:
Here is a brief description of the most important command-line arguments that PhyST accepts:
  • 1. Detailed explanation of all input arguments can be accessed by writing --help or -h
  • 2. The order of the input parameters does not matter (e.g writing 1.E.5 -h -u is equivalent to -h 1.E.5 -u ). In addition, a single - can be used to concatenate mutliple options. (e.g physt.py 1.E.5 -u -f can be rewritten as physt 1.E.5 -uf )
  • 3. The program can run without any arguments EXCEPT for the TCDB ID ( e.g 1.E.41 ). All other input parameters will take defualt values.
  • 4. By default, the program will run two psi-blast iterations as an attempt to capture a broader range of genetic variability within the family under analysis. However, the number of iterations can be controlled with the option -i.
  • 5. The Option -f allows PhyST to run on any protein family predefined by the user, even if it is not listed in TCDB.


Example 1: Using default settings.
  • Command-line Input: ./physt.py 1.E.5

  • Explanation: when only TCDB ID numbers are given, the program runs with default values for all other input parameters. That is, PhyST:
    • will not run Clustal Omega.
    • runs 2 psi-blast interations with an E-value cutoff of e-4 or 0.0001
    • requires a minimum alignment coverage of 50% for the query protein
    • generates an output file with all the sequences in fasta format for the top psi-blast hits
    • Runs HMMTOP to infer the TMS topology of top psi-blast hits
  • The program estimates the phyla and TMS composition within the input family (1.E.5) and returns a flat file named after that family (i.e. 1.E.5.txt). The output file is placed in the same directory were physt.py was executed.


  • Output example 1
    Output example  1


    Output example 1

    Click on the image to see the enlarged version. You can also find the output file generated by PhyST on this link.


Example 2: Running multiple Families and generating a multiple alignment with Clustal Omega
If instructed, PhyST will create a multiple alignment with all the homologous proteins found for any given TCDB family. Nevertheless there are important factors to take into consideration:
  • When using multiple families as input, make sure there is at least one space between TCDB identification numbers; do not use commas or any other character (e.g. 1.E.5 1.E.21 instead of 1.E.5,1.E.21).
  • PhyST generates a file for each input TCDB family containg the sequences in fasta format of all homologous proteins found. These sequences can be used for multiple types of analyses, one of which is to generate multiple alignments. If the option -u is given, PhyST will run Clustal Omega for the user. Of course, if this option is not given, after PhyST is done running the user still has the choice to perform the alignment with any other program.
  • Commandline Input: ./physt.py 1.E.9 1.E.21 -u

  • Explanation: The option -u instructs PhyST to run Clustal Omega on the sequence files generated (with extention .faa) for each input family (i.e. 1.E.9 and 1.E.21). This produces files with the respective multiple alignment and extension .aln

  • When using the -u option, a file with extension .aln is created in the directory named clustalout which also contains the sequence files created for both families.
    • If the user prefers an alternative to Clustal Omega, the sequence files can be used with other programs such as ClustalX to generate the alignment.
    • All the other parameters including query coverage, E-value cutoff and number of psi-blast iterations take default values as described in Example 1 above.

  • Output of Example 2


    Output of Example 2

    Click on the image to see the enlarged version. You can also find the output file generated by PhyST for family 1.E.9 on this link.
    Output of Example 2

    Click on the image to see the enlarged version. You can also find the output file generated by PhyST for family 1.E.21 on this link.


Example 3: Running PhyST with custom number of psiblast iterations and generating a multiple alignment with Clustal Omega.
PhyST allows the user to run any number of psi-blast iterations on any query family using the -i input parameter. By default PhyST runs two iterations but any number greater than zero is acceptable.
  • It is recommended to run more than one psi-blast iteration in order to capture as much genetic variability as possible within individual families.
  • Commandline Input: ./physt.py 1.E.7 -u -i 3

  • Explanation: The family 1.E.7 is run with three psi-blast iterations as indicated by -i 3. The flag -u instructs PhyST to use the sequence file with extension .faa in the clustalout directory and generate a multiple alignment with Clustal Omega, which is saved in a file with extension .aln in the same directory as

  • Output of Example 3


    Output of Example 3


    Output of Example 3

    Click on the image to see the enlarged version. You can also find the output file generated by PhyST for family 1.E.7 on this link.


Example 4: Running multiple families with custom E-value cutoff, minimum alignment coverage and creating a multiple alignment using Clustal Omega.
Physt allows the user to customize the psi-blast E-value cutoff and the minimum alignment coverage with the -e and -a input options.
  • The option -e should be followed by the E-value cutoff (e.g. -e 0.0001 or equivalently -e 1e-4).
  • To change the minimum alignment coverage, we need to use the option -a followed by an interger value denoting a percentage (e.g. -a 70 for 70% minimum coverage).
  • Command line: ./physt.py 1.E.8 1.E.34 -e 1e-6 -a 70 -u

  • Explanation:
    The argument -e 1e-6 instructs PhyST to use 0.000001 as the psi-blast E-value cutoff for both families.

    The argument -a 70 instructs PhyST to select psi-blast hits where at least 70% of the query protein is aligned.

    The flag -u instructs PhyST to run Clustal Omega on the output sequences and a multiple alignment file (with extension .aln) is created in the directory clustalout.
    • All other input parameters take default values as described in the Example 1 above.

  • Output of Example 4

    Output of Example 4
    Click on the image to see the enlarged version. You can also find the output file generated by PhyST for family 1.E.8 on this link.
    Output of Example 4
    Click on the image to see the enlarged version. You can also find the output file generated by PhyST for family 1.E.34 on this link.


Example 5: Running custom protein families that are not present in TCDB.
Physt allows the user to run protein families that are not necessarily classified in TCDB. As long as all input proteins are evolutionarily related, they do not need to be transporters. This is because PhyST assumes that any protein in the family is representative of the entire family. That is, running psi-blast on any family member will tend to give similar results, and the differences can be significanlty captured by running more than one psi-blast iteration.
  • PhyST uses the -f flag to indicate that it will read files with sequences in fasta format instead of TCDB family identifiers.
  • The sequences of top psi-blast matches and multiple alignments will be placed in a directory named unknownfam which are created in the same directory where physt.py is executed.
  • Command line: ./physt.py -u -f unknownfam/seqs2.faa   unknownfam/seqs1.faa

  • Explanation: In this example PhyST will interpret the files with a given path unknownfam/seqs2.faa and unknownfam/seqs1.faa as input families. It will then run psi-blast on the members of each family using the default two interations. The sequences of the top hits will be saved on the directory unknownfam, and finally a multiple alignment will be generated with Clustal Omega (i.e. file with extension .aln).
    • When the -f input parameter is given, the user is required to provide PhyST with files containing the family sequences in fasta format. Therefore it is important to specify the location of the FASTA file correctly by giving full or relative paths to the corresponding input files (e.g. /Users/joe/project/families/myfamily.faa, ../families/myfamily.faa, etc.)
    • Families are processed in order of appearance in the command line


    Output for Family tmp1 outside TCDB

    Click on the image to see the enlarged version. You can also find the output file generated by PhyST for family seqs2 on this link

    Output for Family seqs1 outside TCDB

    Output for Family seqs1 outside TCDB

    Output for Family seqs1 outside TCDB

    Click on the image to see the enlarged version. You can also find the output file generated by PhyST for family seqs1 on this link