TFPredict has moved to https://github.com/draeger-lab/TFpredict.

Introduction

Transcription factors (TF) are the key regulators of cell- and tissue-specific regulation of gene expression and play a crucial role in the orchestration of diverse biological processes, such as cell differentiation and the adaptation to changed environmental conditions. The induction or activation of target genes is achieved by the specific recognition of a DNA-motif located in the corresponding promoter regions, which is specificly recognized by the DNA-binding domain(s) of a TF. The specific interactions between TFs and their target genes are of high relevance for a more profound understanding of transcriptional gene expression in eukaryotes.

In recent work, we presented a novel method for the inference of the DNA-motif recognized by a certain TF, which is inferred from sequence-based features using Support Vector Regression. This method has been implemented in the tool SABINE (Stand-Alone BINding specificity Estimator) which is also available from our website. Besides the protein sequence, SABINE requires knowledge of the structural superclass and the DNA-binding domains of the input TF. Here, we present TFpredict, a tool which can 1) reliable distuinguish TFs from other proteins, 2) predict the structural superclass of a TF and 3) detects its the DNA-binding domains. As TFpredict returns all structural information needed by SABINE to predict the DNA-motif of a given TF, we recommend the combined use of the two complementary tools.

TFpredict employs supervised machine learning methods implemented in the WEKA package for the classification of protein sequences. First a binary classifier is used for the discrimination of TFs from other proteins (Non-TFs) and in a second step a multi-class classifier is employed for superclass (Basic domain, Zinc Finger, Helix-turn-helix, Beta scaffold or Other) prediction. The second prediction step is complemented by a look-up in the TransFac TF Classification in which the superclass of the input TF may already be annotated. To obtain the feature representation of the input sequences, a BLAST+ search is performed and the homology of a query sequence to other proteins with known class is captured using a novel feature representation called bit score percentile features. Next, the domain composition of the query sequence is reconstructed using the tool InterProScan and the DNA-binding domains are filtered based on a pre-defined set of GO terms.