Documentation

Contents



Introduction

The data analysis pipeline RPPApipe offers various tools for the preprocessing, annotation, visualization and statistical analysis of reverse-phase protein array (RPPA) data. Most notably, RPPApipe provides special support for complex experimental designs, where the expression and modification of specific signaling proteins is monitored over time in response to diverse environmental stimuli. In RPPApipe we implemented various novel concepts for the simplification of complex time-resolved expression profiles as well as for the detection and visualization of differentially expressed and/or modified proteins. The RPPApipe software is part of our web-based Bioinformatics Toolbox.

How to get started

First, you should prepare your raw expression data in conformity with the RPPApipe format specification. Please note that RPPApipe does not offer functions for image analysis. As input data the tool expects a tab-separated file containing a positive expression value for each protein and sample, which corresponds to the protein-normalized, blank-corrected mean fluorescence intensity of the respective spots (see raw data). You can either prepare your own data or use an example dataset to become familiar with the tool. For convenience, we prepared a short tutorial which can be used as a starting point.

Short tutorial

After having finished this tutorial you will know the basic functions offered by RPPApipe. In this tutorial we assume that we have a dataset consisting of time-series measurements for two different treatments which shall be compared to a common control. First, please download the corresponding RPPA data and class file specifying the experimental design of our example study. Please note, that you can also download example datasets for other experimental designs.



You can import your raw data into the RPPApipe software by using the upload tool. Please upload the two files according to the intructions shown in the webinterface. It should be noted that the appropriate file type has to be selected for the RPPA data file and class file, respectively, before upload. The uploaded file can be inspected by clicking the eye symbol in the corresponding history items displayed in the right frame.



In the next step, the uploaded raw data has to be labeled according to the study design. For this purpose, start the sample class definition tool and select 'Class file' as method for the definition of sample classes. Then, select the uploaded class file and click 'Execute'. Consequently, the design information contained in the class file will be stored in the column headers of the expression data file. Please note, that the sample class information could also have been added manually. However, using class files may be more convenient for large datasets with complex study designs and has the advantage of allowing for the use of workflows.



In order to scale and annotate the data switch to the preprocessing tool by clicking on the link in the left frame. Then, select 'median centering' as scaling method and click the checkbox 'Use controls as reference for scaling'. Make sure that the log-transformation option is checked and select 'Descriptions' and 'UniProt IDs' in the annotation menu. When clicking 'Execute', the raw data values will be divided by the median of the selected control samples and then log2-transformed. Negative values in the dataset corresponding to invalid measurements, for instance, caused by high background signals, will be estimated based on the KNN method. Additionally, the matching gene descriptions and UniProt IDs will be retrieved via the BioMart webservice and added to the expression table.



Differential protein expression can be detected based on statistical methods applicable to multi-condition, time-series datasets. Switch to the differential expression detection tool and select the appropriate design ('replicated time-series data'). Select 'mean fold-change' as statistical method to compute the fold-changes averaged across time points for each analyte and each of the two treatments. Check the option 'Sort proteins' to ensure that deregulated proteins will appear on top and click 'Execute'. The returned expression table contains two additional columns, each corresponding to the fold-changes calculated for a certain treatment.
Apply the same tool again in order to assess the significance of differential expression. To this end, select the design type 'Replicated time-series data', choose 'Linear Models (LIMMA)' as statistical method, check the option 'Correct for multiple testing' and select the 'Benjamini-Hochberg method' for p-value adjustment. The resulting table should contain one additional column corresponding to the adjusted p-values obtained from a moderated F-test.



You may visualize the proteins which are differentially expressed or modified upon certain treatments by means of volcano plots. For this purpose, switch to the plot generator tool by clicking on the link in the left frame. Then select the uploaded RPPA data file and choose 'Volcano plot for differential expression' as plot type. Leave the checkbox 'Use experimental design' activated and click 'Execute' to generate the plots automatically based on the provided design information. The generated PDF file can be viewed in your browser or downloaded by clicking the eye or disk symbol in the corresponding history item.



As treatment-induced differences in the expression levels of unmodified and modified proteins may also be of interest, the same procedure can be applied for the plot type 'Volcano plot for differential modification'. Please note that the resulting plot is a unique feature of RPPApipe and was specifically conceived and implemented for use with RPPA data.



In order to simplify the interpretation of the observed effects, the analyzed proteins can be mapped to canonical pathways from KEGG, BioCarta or Reactome. The amount of deregulation observed in the pathways covered best by the proteins profiled in the given experiment can be visualized as a bar plot by selecting 'Pathway profile diagram' as plot type.



The clustering tool can be used to perform a cluster analysis on your dataset. When checking the box 'use experimental design' the complete statistical analysis, cluster analysis and visualization is performed automatically based on the given design information. Leave this option checked and select 'regulation states' as data used for clustering. Leave the default parameters and click 'Execute'. Regulation states provide an abstracted, discrete representation of time-series expression profiles and simplify the interpretation of the effects observed in complex datasets. After this data transformation step, which involves the calculation of fold-changes and statistics at individual time points and for the complete time-series, a clustering is performed based on a custom-designed scoring matrix and the results are returned in the form of a heatmap with adjacent dendrograms. As explained in the legend of the heatmap plot, the colors of the cells correspond to different regulation states and the numbers refer to the time points of the first response (i.e., deregulation). Please note, that this feature is uniquely offered by the RPPApipe software and may considerably enhance the interpretability of effects observable from complex RPPA datasets.



Using the enrichment analysis tool you can detect overrepresented pathways or functional relationships among the proteins in each cluster. Choose 'clustering result' as input type and select the zip archive with the protein lists returned from your cluster analysis. Leave the default parameters and click 'Execute' to perform a pathway enrichment analysis against the KEGG database. The generated Excel spreadsheet can be downloaded by clicking the 'disk symbol' in the history item. The file will contain a table for each cluster. In each table the significantly enriched pathways are listed along with a significance value (p-value from hypergeometric test) and the proteins in common between the corresponding pathway and cluster.



For the detailed inspection of individual pathways of interest, we recommend using the InCroMAP software we designed this software such that its input format is compatible with the format used by RPPApipe. You can use RPPApipe to process your raw data, perform a statistical and/or cluster analysis and detect relevant signaling pathways. The calculated fold-change and p-value columns can then be imported into InCroMAP and pathways of interest can be interactively displayed and overlaid with your RPPA expression data.



Tool descriptions

Every analysis tool which is part of RPPApipe framework contains a short description and step-by-step instructions on how to use the tool directly below its input mask. Additionally, you can find further information on these tools in the following:

Upload data

After image analysis you can import your expression data and study design as text files complying with the the file format specification of RPPApipe. Currently, our software supports three different types of experimental designs. If you want to analyze a small dataset which shall not be processed automatically using a predefined workflow, the interactive sample class definition tool may be a more appropriate alternative method to import your study design.

Define sample classes

Using this tool you can specify the experimental design of your study. This is achieved by manually assigning data columns (samples) in your expression data file to specific groups and time points. Alternatively, a class file which was uploaded in advance can be used for design specification. The latter option is also compatible with workflows. Currently, RPPApipe supports different study designs which cover most applications of the RPPA platform. Most notably, special support is provided for studies, where time-course measurements shall be compared and multiple replicates are available for each time point. The given design information is stored in the column headers corresponding to the different samples in your expression data file.

Preprocess and annotate data

This tool offers various functions for data preprocessing and annotation with additional metadata, such as cross-references to external databases. We recommend to center your data to the median of the control samples. Performing this scaling step will simplify the interpretation of your data, as after log-transformation positive values indicate upregulation and negative values indicate downregulation. Please note, that all subsequent analysis tools expect the data to be log-transformed. Thus, the corresponding checkbox in the webinterface should be checked. If your dataset contains negative values, for instance, as a consequence of background subtraction, these values will become undefined after log-transformation and have to be estimated using one of the implemented approaches for missing value imputation. This step is automatically omitted if all values are positive. Finally, you can add annotation columns to your data, which can be obtained via the BioMart ID conversion webservice for the organisms human, mouse and rat.

Detect differential expression

After labeling your data according to your experimental design, you can use this tool to detect differential protein expression by statistical analysis data. Depending on your study design, the tool offers functions for the calculation of fold-changes and diverse statistics. If the paired sample groups design is chosen, the fold-changes are computed for all predefined group pairs. For the multiple sample groups design, the fold-changes are computed for all possible contrasts, i.e., all pairs of groups, and for the replicated time-series data design the fold-changes are averaged across the time points. The p-values returned from statistical analyses can be corrected for multiple hypotheses testing using an appropriate method (e.g., Bonferroni, Benjamini-Hochberg, etc.). Optionally, all proteins in your dataset can be sorted according to their differential expression or filtered based on a predefined cutoff value.

Generate plots

After calculating fold-changes and p-values for your dataset, you can employ this tool to visualize differential protein modification, differential expression and pathway deregulation. Among the currently implemented visualizations are, for instance, heatmaps, volcano plots for differential expression and modification, respectively, as well as venn diagrams and pathway profile diagrams. As described in more detail in the section plot types, specific data columns (e.g., fold-changes or p-values) which shall be used for plotting have to be selected. For volcano plots or venn diagrams you can also specifiy appropriate cutoff values for the detection of differential expression. All generated plots can be downloaded in PDF format.

Cluster data

You can use this tool either directly after data preprocessing or after having finished your statistical analysis to detect clusters of co-expressed proteins in your dataset. In the former case the statistical analysis will be performed automatically based on the provided study design information and you can decide if either fold-changes or regulation states shall be clustered. In the latter case you have to select the processed data or precomputed fold-change columns which shall be used for clustering. In both cases a heatmap with a colorbar indicating the cluster memberships of the proteins is generated. The clusters can be also downloaded as a zip archive containing a text file for each cluster. You can find additional information on the different types of heatmaps and clustering parameters directly in the webinterface of the tool.

Perform enrichment analysis

In order to assist you in drawing biological hypotheses from your dataset and clustering results, we implemented the enrichment analysis tool. The tool can be employed in two different contexts. Firstly, you can use it to detect gene sets overrepresented among the deregulated proteins observed for your a priori defined sample groups. Secondly, it can be applied to further process the outcome of a cluster analysis. Currently, we support the calculation of enrichments against the pathway databases KEGG, Reactome and BioCarta as well as against specific types of Gene Ontology terms. We use a specifically adapted implementation of the hypergeometric overrepresentation test which accounts for the fact that RPPA experiments cannot be performed for the whole proteome, but only for selected proteins of interest. Thus, we limit the universe to proteins which are both included in the experiments and contained in one of the tested gene sets.

Workflows

All tools provided by the modular RPPA software were designed for application in custom-built or predefined workflows. Please note that the use of workflows requires registration and logging in to a user account. You can choose between different predefined workflows when clicking on the link "all workflows" which is displayed below the tools in the left frame. After login you can also click the button 'Workflow' displayed in the top menu to create your own workflow. For the creation of custom workflows an interactive editor is provided. Save your workflow and it will be shown along with the predefined workflows when clicking the link "all workflows" in the left frame.



Data types

The tools which are part of the RPPApipe software operate on specific types of data which are described in the following.

Raw data

We recommend to use the image analysis software of the array manufacturer or free software packages, such as P-SCAN to infer numerical signal levels from the array images. First, a background correction should be performed by subtraction of the spot signals of blank assays. Normally, each sample and protein is represented on the array by multiple spots corresponding to a serial dilution. The mean fluorescence intensities (MFI) can be calculated for each sample and protein from a dose-response curve of the blank-corrected raw fluorescence intensities (RFI). Next, normalized fluorescence intensities (NFI) are calculated by correcting for the amount of immobilized protein on each spot. For more detailed information on RPPA image processing, the reader is referred to the publications of Spurrier et al. and Pirnia et al.

Processed data

The generation of processed data typically involves the following steps: (1) the labeling of your raw data according to your experimental design, (2) the scaling of your data (e.g., centering to the median of the controls), (3) the log-transformation of the data and (4) the annotation of the proteins with additional metadata. After data preprocessing and annotation a statistical analysis can be performed. Please note that all further analysis steps, for instance, the calculation of fold-changes, will only produce valid results if data preprocessing was performed properly.

Fold-changes

Fold-changes correspond to the ratio of the expression levels observed for two conditions (e.g., treatment and corresponding control). Please note, that since the processed RPPA data is expected to be log-transformed, we compute the difference between the two conditions instead of the ratio. The result is a log2(fold-change) which is defined on a symmetric scale with center zero. For instance, a fold-change of 1 corresponds to two-fold upregulation, a fold-change of 2 indicates 4-fold upregulation and a fold-change of -1 corresponds to two-fold downregulation. Please also note, that the method used for fold-change calculation depends on the experimental design. Specifically, fold-changes can be computed for defined pairs of groups (paired sample groups), all pairs of groups (multiple sample groups) or averaged over time (replicated time-series data), depending on your experimental design.

P-values

RPPApipe offers various statistical methods which can be used for assessing the significance of differential expression based on p-values. Depending on the experimental design appropriate statistical approaches for data analyses are proposed. If desired, the p-values can also be corrected for multiple hypotheses testing using established adjustment methods.

Regulation states

Since the reverse-phase protein array platform is typically applied in complex studies, which involve the profiling of multiple experimental conditions over time, we developed a concept for transforming the time-resolved expression profiles to a discrete expression states. These expression states provide an abstracted and simplified representation of the complex expression changes observed under a specific condition. For this purpose, the replicated time-series measurements for two compared conditions (e.g., treatment vs. control or mutant vs. wild-type) are condensed to discrete classes, indicating the direction and strength of differential expression. These classes (i.e., regulation states), are derived from various statistical comparisons performed for individual time points as well as the whole time-series. Differential expression at individual time points is determined based on fold-changes and a moderated t-test implemented in the LIMMA package for R/Bioconductor. Deregulation over time is detected based on mean fold-changes and a moderated F-test from the LIMMA statistics. The following graphics illustrates how regulation states are inferred from the time-series expression profile observed for an analyte under a specific condition:



For each of the five timepoints ti in the above illustration a fold-change fi and a t-test p-value pi is calculated. Additionally, the mean fold-change across all time-points and an F-test p-value is computed for the complete time-series. The regulation states are then assigned based on the following definitions:



Plot types

Our software offers various types of plots which can be used to visualize differentially expressed and modified proteins given your RPPA data. These visualizations include different types of heatmaps and volcano plots as well as venn diagrams and special plots for depicting pathway alterations. A detailed description of all plot types and variants is given in the following.

Heatmaps

We implemented different types of heatmaps in the RPPApipe software, which may be generated in order to inspect the results of a cluster analysis or simply for data visualization.

Ordinary heatmaps

Ordinary heatmap representations for displaying processed expression data or fold-changes by means of a color gradient ranging from blue (downregulation) to red (upregulation). If only the data visualization without cluster analysis shall be performed, the RPPApipe plot generator tool can be employed. An example heatmap is shown in the following:



Heatmaps from clustering of regulation states

If the cluster analysis tool is used for heatmap generation, dendrograms may be added to the heatmap and a colorbar will indicate the formed clusters. If time-series data shall be clustered and the user chooses to infer regulation states automatically based on the experimental design, the tool will return a condensed heatmap, which shows the direction, strength and onset of deregulation for each protein and experimental condition. In the following example the colors correspond to regulation states and the numbers indicate the time point of the first response:



Heatmaps from clustering of fold-changes

If a cluster analysis is performed on the basis of fold-changes, the heatmap is not condensed and complemented with profile plots. These plots show the expression profiles of the proteins in each cluster. Each point on the x-axis corresponds to one of the sample groups and the log2(fold-changes) are plotted on the y-axis. A red line indicates the cluster centroids, and the standard deviations are depicted as a grey, transparent tube. The colors used in the legends are consistent with the colors used in the colorbar on top of the corresponding heatmap plot.





Volcano plots

RPPApipe is capable of generating two complementary types of volcano plots. The first variant is an ordinary volcano plot for differential expression. The second one is a unique feature of this software and depicts differential modification in the sense that a protein is modified in response to a certain experimental condition.

Volcano plot for differential expression

In a volcano plot the strength of the deregulation measured in terms of the log2(fold-change) is plotted against the significance of differential expression given by the -log10(p-value). Each protein corresponds to one dot, whereby downregulated proteins are present in the upper-left and upregulated ones in the upper-right area of the plot. As for the heatmap plot a column containing the protein identifiers can be selected. For the sake of clarity, these identifiers are only shown for a limited number of strongly deregulated proteins. Adequate fold-change and p-value cutoffs plotted as horizontal and vertical dashed lines, respectively, can be defined by the user. An example plot is shown below: reverse-phase protein array data. Please see *Plot types* for a detailed description of all possible visualizations. Depending on the selected plot type, you have to select specific metadata columns (e.g., gene symbols or analyte identifiers) and data columns (e.g., fold-changes or p-values) which provide the basis for your visualization. Some of the plots (e.g., venn diagrams) additionally require the definition of appropriate fold-change and/or p-value cutoffs for the selection of differentially expression genes. The generated illustrations can be viewed and downloaded in PDF format.



Volcano plot for differential modification

This illustration was derived from an ordinary volcano plot and specifically adjusted for use with RPPA data, where typically different post-translational modifications of the same protein are quantitatively profiled. The visualization proposed here facilitates the identification of differentially modified proteins, such as signaling proteins which are phosphorylated in response to a specific extracellular stimulus. For this purpose, the user has to select the respective metadata columns containing the gene symbols and modifications. Unmodified proteins and their protein modifications are indicated by different point shapes and connected by edges. These edges are highlighted and the corresponding protein names are displayed if the fold-change between the two protein forms exceeds a predefined cutoff. An example plot is shown below:



Venn diagrams

For the generation of venn diagrams you have to define fold-change and p-value cutoffs for the selection of deregulated proteins. Furthermore, the respective data columns have to be provided for the two compared sample groups. Along with the corresponding numbers, the identifiers of the top upregulated and downregulated proteins are displayed in red and green, respectively, for each subset of the venn diagram. The venn diagram for three sets can be generated in a similar manner as the one for two sets described above. An example plot is shown below:

                  

Pathway profile diagram

Since reverse-phase array analysis are typically focused to a small fraction of the proteome, ordinary enrichment analysis may be inappropriate. Nonetheless, the alterations in relevant signaling pathways become apparent from the special diagrams provided here. Our concept is to map the differentially expressed analytes to genes which in turn correspond to pathway nodes in the databases KEGG, BioCarta or Reactome. Depending on your study design, the best covered pathways, i.e., the pathways in which the most proteins were measured, are considered and the fraction of deregulated proteins among the measured proteins is illustrated as a bar plot. An example plot is shown below:





Experimental designs

RPPApipe supports three different types of study designs, which are explained in the following. Your experimental design can be imported into the software in the form of a class file, which can be used to assign your samples to different groups and time points.

Multiple sample groups

If you would like to compare protein expression between more than two different groups of samples, you should choose the 'multiple sample groups' design. Example applications for this design are, for instance, the comparison of different organs, tissues or cell types.

Paired sample groups

The 'paired sample groups' design is applicable to studies, in which differential protein expression shall be compared between paired groups of samples at a fixed point in time. This design is appropriate if, for instance, diverse treatments shall be compared to the corresponding controls, or different tumors shall be compared to normal tissue.

Replicated time-series data

Since the RPPA platform is particularly useful for profiling protein expression in large amounts of samples depending on different experimental factors, such as dose and time, we also support experimental designs, where time-series measurements have been performed for different conditions, each with multiple biological replicates. A typical use case would be a study, in which the effect of different treatments has been monitored over time.



RPPApipe file format specification

In the following we provide a detailed description of file types which can be imported into the RPPApipe software. Expression data which has been derived from the array images as described in the subsection image analysis can be uploaded as an RPPA expression data file in which rows correspond to modified/unmodified proteins and columns correspond to samples. Uploading a class file makes it possible to group the samples according to your experimental design.

RPPA expression data file

Analyte_ID Gene_SymbolModificationSample_1 Sample_2 Sample_3 ...
AKT1 AKT1 none 4.74832974335.21350313865.0656299063...
AKT1_P-S473AKT1 P-S473 5.20704588355.02078123484.3841827262...
AKT1_P-T308AKT1 P-T308 4.21949793645.08971306015.0578841769...
ALDH1A1 ALDH1A1 none 5.08281271644.93948657235.0071733934...
BAD_P-S112 BAD P-S112 5.24883003715.00156898725.0362211715...
... ... ... ... ... ... ...

Class file

A class file contains one row for each class in your microarray dataset. Each line consists of a 'class label', an equals sign and a list of comma-separated column indices corresponding to the columns in the RPPA expression data file which shall be assigned to the respective class label. An examplary class file for an expression data file with 2 metadata columns and 12 data columns for the organs liver, kidney, heart and brain, each measured in triplicate, is shown here:

liver = 3,4,5
kidney = 6,7,8
heart = 9,10,11
brain = 12,13,14
...

If you would like to compare pairs of sample groups, for instance, liver vs. kidney and heart vs. brain, you can use the following class file format:

liver = 3,4,5 : kidney = 6,7,8
heart = 9,10,11 : brain = 12,13,14
...

If we assume we have data for a second time point available in the columns 15-26, we can specify comparisons of time course experiments using the following format:

(liver_3day = 3,4,5 ; liver_14day = 15,16,17) : (kidney_3day = 6,7,8 ; kidney_14day = 18,19,20)
(heart_3day = 9,10,11 ; heart_14day = 21,22,23) : (brain_3day = 12,13,14 ; brain_14day = 24,25,26)
...

Please note that the sample group identifiers have to be of the form '[groupID]_[timepointID]'. If your class file does not contain appropriately formatted group IDs, generic identifiers (e.g., T1, T2, etc.) will be used in the generated plots.

Example datasets

Experimental design RPPA expression data file Class file
Multiple sample groups rppa_data_multi.csv class_file_multi.txt
Paired sample groups rppa_data_paired.csv class_file_paired.txt
Replicated time-series data rppa_data_timeseries.csv class_file_timeseries.txt



Johannes Eichner
http://www.ra.cs.uni-tuebingen.de/software/SABINE/intro.htm
© 2008 University of Tübingen, Germany