TFPredict has moved to https://github.com/draeger-lab/TFpredict.

Documentation

Contents



Introduction

The tool TFpredict provides an effective means for the identification and structural annotation of transcription factors (TFs), based on sequence homology features inferred from their amino acid sequence using the tool BLAST. In short, TFpredict combines sequence similarity searching with supervised machine learning methods (e.g., SVM, KNN, Naive Bayes) from the WEKA package for the identification of TFs and the prediction of their structural superclass. Furthermore, using the domain detection tool InterProScan in conjunction with a gene ontology-based filter the sequence regions spanned by DNA-binding domains are identified. If the tool is used in conjunction with SABINE the DNA motif of the TF may be determined in another prediction step. For this purpose, TFpredict generates a machine-readable text file which can be post-processed using the tool SABINE to perform the inference of the DNA-motif recognized by the given TF.

How to get started

TFpredict is available as a stand-alone version and as an online version. We recommend using the online version , as it does not require any installation and provides a user-friendy webinterface. If you prefer to locally install your own copy of TFpredict, you can get the latest stand-alone version at our download section. The stand-alone version of TFpredict is equipped with a command-line interface which can be used for the batch processing of multiple protein sequences given in FASTA format. For convenience, TFpredict uses the webservice version of InterproScan. Thus, installing the perl stand-alone version of InterProScan (approx. 40GB) is not required. To support applications, which require the processing of a large number of sequences (e.g., the genome-wide prediction of TFs in a specific organism) TFpredict can alternatively be used with a local installation of InterProScan.

Installation

To extract the tool from the packed archive, which can be obtained from our download section, use the command:

tar -xzf tf_predict.tar.gz


TFpredict is completely implemented in Java and provided as a runnable JAR file. All platforms (Windows, Mac, Linux) are supported provided that Java (JDK 1.6 or later) and BLAST (NCBI BLAST 2.2.27+ or later) is installed. You can download the latest version of BLAST from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/.

Requirements

  • Java (JDK 1.6 or later)
  • BLAST (NCBI BLAST 2.2.27+ or later)

The analysis framework of TFpredict is entirely written in Java. Thus, it requires that Java Virtual Machine (JDK version 1.6 or newer) is installed on your system.

Starting the program

In order to use the stand-alone version of TFpredict, execute the runnable jar file TFpredict.jar:

java -jar TFpredict.jar <input_fasta_file> [<OPTIONS>]

Please note that the path to your BLAST installation has to be provided either by setting the environment variable BLAST_PATH=<pathToBlast> or by using the command-line argument -blastPath <pathToBlast>, where <pathToBlast> is the directory above the bin folder which contains the BLAST executables.
To display the usage of the script and an overview of the command line options, use the command:

java -jar TFpredict.jar --help


Short tutorial
This tutorial describes how the stand-alone version of TFpredict can be applied to analyze a protein sequence of interest using TFpredict. The documentation of the online version can be found below the input mask of the respective tools.

First, you have to generate an input file in FASTA format (see format specification) or example input file).
The input file should contain the following information about the protein under study:


To run TFpredict on the example input file, use the command:

java -jar TFpredict.jar example.input


If you want to postprocess the results generated by TFpredict with SABINE in order to predict DNA-motives for transcription factors identified among the input protein sequences, you have to pass two additional arguments to the program. First the destination to which the output file shall be written has to be specified, and second the correct species has to be provided. Please ensure that the given species is supported by SABINE (see List of supported species).

An exemplary call of the program which facilitates the post-processing of the results using the tool SABINE is shown here:

java -jar TFpredict.jar example.input -sabineOutfile example.output -species "Homo sapiens"


TFpredict return an output file which (see format specification or example output file), which contains the results of the performed prediction steps in the SABINE input file format (see format specification)

If you have a local installation of the tool InterProScan, which shall be used by TFpredict, you have to pass the destination of the main executable of InterProScan as an argument to TFpredict.

Assuming that InterProScan was installed to the directory "/opt/iprscan" you could use the following command:

java -jar TFpredict.jar example.input -iprscanPath /opt/iprscan/bin/iprscan



Command-line options
The optional command-line arguments accepted by TFpredict are listed in the following.

-sabineOutfile
Output file for post-processing of the results with SABINE.
 
-species <organism name>
Organism name (e.g., Homo sapiens). Argument has to be specified if output file for SABINE shall be created. See list of organisms supported by SABINE).
 
-tfClassifier <classifier name>
Classifier used for TF/non-TF classification (possible values: SVM_linear, NaiveBayes, KNN)
 
-superClassifier <classifier name>
Classifier used for superclass prediction (possible values: SVM_linear, NaiveBayes, KNN)
 
-iprscanPath
Path to iprscan executable. Only needed if you have a local installation of InterProScan which shall be used by TFpredict.
 
-blastPath
Path to bin directory containing BLAST executables (e.g. /opt/blast/latest). Only needed if environment variable BLAST_PATH is not set.
 
-ignoreCharacteristicDomains
No classification based on predefined InterPro domains.
 


TFpredict file format specification

To analyse a given protein with TFpredict the tool needs the corresponding amino acid sequence and organism. This information has to be formatted as specified in the TFpredict input file format description.

The results of TFpredict are returned to the user via the standard output. Optionally, an output file can be generated which can be processed using SABINE in order to predict the DNA-binding specificity of transcription factors identified among the protein sequences analyzed by TFpredict. See the SABINE input file format specification for a detailed description of the file format.

The input file format description specifies the input data for an individual TF. You can pack multiple TFs in one input file to sequentially process larger datasets with SABINE. In addition to the general description of the file formats, example input and output files for SABINE are provided.

TFpredict input file

>Identifier_1
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP
DEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAK
SVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHE
RCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNS
SCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELP
PGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPG
GSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD

>Identifier_2
...


> Identifier_3
...


View Example



Johannes Eichner
http://www.ra.cs.uni-tuebingen.de/software/SABINE/intro.htm
© 2008 University of Tübingen, Germany