MacLeaps Quick Start Guide

Last changed: 2013-06-25 15:30

1. Download MacLeaps

Download the latest stable version of PLINK to your computer from our downloads page.
This Quick Start Guide will focus on the command line version of MacLeaps, but all explanations given here also hold for the GUI version.

To execute MacLeaps on the command line, open the command line / shell and navigate to the folder where you extracted the downloaded archive. There you should find a file called MacLeaps.jar, which you can run by executing the command:
java -jar MacLeaps.jar
(On Windows systems you might need to add the Java binaries to your PATH variable first. If you don't know how to do this, there is a nice tutorial in this blog article.)

2. Download PLINK

MacLeaps requires the genotype files to be in the PLINK format and uses the PLINK tool for data management and some simple SNP statistics. You can download PLINK for your operating system from the PLINK website.
Extract it on your computer and remember the path to the executables.
Note: You can also add the path to your PATH environment variable, so that MacLeaps can find PLINK itself.

Parameter: -pp [path_to_plink_directory]
Example: -pp /home/smith/bin/plink

3. Prepare your data

Tools, such as PLINK, already offer a wide spectrum of quality control methods and other functions to pre-process GWAS data. MacLeaps was not designed to be a front end for PLINK, because PLINK is already a nice tool for itself.
Any kind of data preparation like removing individuals or SNPs with a high missing rate or markers that violate the Hardy-Weinberg equilibrium has to be performed prior to using MacLeaps. It should be noted, though, that Support Vector Machines (SVMs) cannot really handle missing values, so you should consider using genotype imputation for missing values before handing the data to MacLeaps.

In the end, you should have your GWAS data ready either in the normal ped file format (*.ped) or the binary ped file format (*.bed). Since those GWAS data can be pretty large, we suggest that you go with the binary file format, which can easily be 20 times smaller than the ped files.

Parameter: -p [genotype_data_file]
Example: -p /data/T2D_prefiltered.bed

4. Prepare an output folder

For various reasons, the different processing steps of MacLeaps will write the intermediate results as files on your disk. This may require some disk space, but it allows MacLeaps to continue analyses that have been aborted for whatever reasons as fast as possible. It also allows for a quick rerun of analyses with slightly changed parameters, because only those steps that differ will have to be repeated.

Create an output folder that you have write-access to and that has enough free disk space. A meaningful name is also helpful to keep track of multiple analyses, e.g. "T1D_combined_cohorts_prefiltered".

Parameter: -o [output_directory]
Example: -o /results/T2D_5CV

5. Choose a validation method

A) k-fold cross-validation

In a k-fold cross-validation, the GWAS data is randomly split into k equally sized subsets of individuals, where the ratio of cases to control is preserved (stratification). Each subset is used once for validation by training an SVM model on all other subsets combined and comparing the predicted class with the actual class of the validation subset.
Common choices for k are 2, 5 or 10.

Parameter: -k [folds]

You also need to supply a studyname, which will be the prefix of all files generated during the analysis.

Parameter: -s [studyname]
Example: java -jar MacLeaps.jar -p T2D.bed -s T2D_5CV -k 5 -o T2D_5CV

B) Between-study validation

This is essentially a 2-fold cross-validation, but the data is not split randomly but defined by a file containing a list of individuals. A common use case for this is when you have to separate studies and want to investigate how well a model trained on one will perform on the other one. For this, both studies have to be merged into a single PLINK file and the split file has to contain only the individuals of one of the studies.
Note: The individuals are expected to consist of Family ID (FID) and Individual ID (IID), so you can use .fam files (subsequent columns will be ignored).

Parameter: -sf [splitfile]

You need to supply two studynames here, separated via a slash ('/'), which will be the prefix of all files generated during the analysis.

Parameter: -s [studynameA/studynameB]
Example: java -jar MacLeaps.jar -p T2D_combined.bed -sf /data/french_cohort.fam -s french/german -o T2D_french_vs_german

C1) Model training only

Situations can occur, where you either can not or may not disclose the genetic information to other people for various reasons, e.g. legal issues or limited bandwidth. Or you simply want to train a model once and store it to use it later.
To make sure that the trained model can be checked for compatibility with other data sets, the allele information and SNP subsets need to be stored, too. Using this option will create a ZIP file containing all relevant files. This file will be named "[studyname].zip".

Parameter: -cm
Example: java -jar MacLeaps.jar -p T2D_german.bed -cm -s T2D_german -o T2D_german_model

C2) Model validation only

Models created as described in the previous step can then be validated on other data sets. For this, the files in the ZIP file need to be extracted in a folder of your choice. This model is then used to predict the phenotypes of the given data set and the prediction are compared with the actual phenotypes to yield the predictive performance of the model.

Parameter: -mo [model_bim_file]
Example: java -jar MacLeaps.jar -p T2D_french.bed -mo T2D_german.bim -s T2D_french -o T2D_german_on_french

6. Additional options for the analysis

SNP selection

The default method for selecting different SNP subsets is to only include those that reach a certain p-value threshold in the training set. Those thresholds are 10-8, 10-7, 10-6, 10-5, 10-4, 10-3 and 10-3. (Note: Files resulting from this filtering will contain the string "p1e-8", "p1e-7", etc. in their name.)
There is also the option to only include the top n SNPs in the training set, where the best 3, 10, 30, 100, 300 and 1000 SNPs based on their p-value will be selected.

Parameter: -t
Example: java -jar MacLeaps.jar -p T2D.bed -s T2D_5CV -k 5 -t -o T2D_5CV_topN

SVM with Radial Basis Function (RBF) kernel

In addition to the standard linear SVM, an SVM with an RBF kernel can be used to train a model.

Parameter: -dr or --do-rbf

Since this introduces an additional parameter gamma into the SVM model, the optimal parameter has to be determined for each model training using a time consuming grid-search. The number of repeats and folds for this inner cross-validation can be specified by the following options:

Parameter: -gr [gridsearch_repeats] -gf [gridsearch_folds]
Example: java -jar MacLeaps.jar -p T2D.bed -s T2D_5CV -k 5 -dr -gr 5 -gf 2 -o T2D_5CV

There is also an experimental heuristic to guess the optimal parameters for the RBF kernel much fast, but the features need to be normalized for that optimality cannot be guaranteed.

Parameter: -n -ht
Example: java -jar MacLeaps.jar -p T2D.bed -s T2D_5CV -k 5 -dr -n -ht -o T2D_5CV

General pipeline options

By default, all available CPUs will be used for the SVM analysis, i.e. the number of threads will be set to the number of CPUs. You can specify a different number of threads using this option:

Parameter: -nt [number_of_threads]
Example: java -jar MacLeaps.jar -p T2D.bed -s T2D_5CV -k 5 -o T2D_5CV -nt 8

To speed up file reading and processing, MacLeaps will convert non-binary PLINK files (.ped) to binary PLINK files (.bed) before the analysis starts. If you want to suppress this behavior, you can tell MacLeaps to run in minimalistic mode.

Parameter: -m
Example: java -jar MacLeaps.jar -p T2D.bed -s T2D_5CV -k 5 -o T2D_5CV -m

MacLeaps will only output general information during the analysis. If there are errors or other problems occur, you can tell it to be more verbose about what is going on right now.

Parameter: -v
Example: java -jar MacLeaps.jar -p T2D.bed -s T2D_5CV -k 5 -o T2D_5CV -v

7. Interpreting the results

In all modes of operation that perform a validation (so everything execept "Create only model"), two to three CSV files will be generated that contain the results of the analysis. The values in each line represent the results across all folds of the cross-validation (if applicable) for the SNP subset named in the first colums:
[studyname].snpcount.csv
Statistics about the number of SNPs selected for the various thresholds. Columns:
  1. SNP subset name, e.g. 1e-5 for SNPs with p-value < 10-5 or 30 for the top 30 SNPs
  2. Minimum number of SNPs
  3. Average number of SNPs
  4. Maximum number of SNPs
  5. Standard deviation of the number of SNPs
Example:
#SNPs min avg max stdev
1e-8 0 1.40 3 1.14
1e-7 0 2.80 4 1.64
1e-6 3 6.00 9 2.45

[studyname].auc.lin.csv
Performance results for the linear SVM. Columns:
  1. SNP subset name, e.g. 1e-5 for SNPs with p-value < 10-5 or 30 for the top 30 SNPs
  2. Minimum AUC during model training
  3. Average AUC during model training
  4. Maximum AUC during model training
  5. Minimum AUC in the validation
  6. Average AUC in the validation
  7. Maximum AUC in the validation
  8. Standard deviation of the AUC in the validation
  9. Number of folds that contain at least one SNP in this subset
Example:
#train. min avg max valid. min avg max stdev count
1e-8 0.58660 0.63733 0.69780 0.54910 0.57905 0.59530 0.02070 4
1e-7 0.63810 0.68340 0.71890 0.53290 0.57758 0.62880 0.04080 4
1e-6 0.64990 0.72140 0.78110 0.62430 0.65854 0.68360 0.02447 5

[studyname].auc.rbf.csv
Performance results for the SVM with RBF kernel. Same format as for linear SVM.
[studyname].auc.nlin.csv
Performance results for the linear SVM with normalized data. Same format as for linear SVM.
[studyname].auc.nrbf.csv
Performance results for the SVM with RBF kernel and normalized data. Same format as for linear SVM.


A. Using simulation data

In general, it doesn't matter where the genotype information comes from, it can be real measured data or just simulated one. PLINK offers the functionality to generate simulation data based on a few parameters. Instead of creating the data set manually and then use it as input, MacLeaps can be started with some additional options and it will call PLINK to create the simulation data automatically.

To use this option, the user needs to create a file containing simulation parameters in a format described on the PLINK website. This file tells PLINK what kinds of SNPs to simulate. The format looks roughly like this:
99900   null      0.05 0.95  1.00 1.00
100     disease   0.05 0.95  1.50 mult
    
This means that 99900 SNPs will be simulated (named "null_0", "null_1", ...) with a "risk" variant with allele frequency between 0.05 and 0.95 with an odds ration (OR) of 1.0 for the heterozygous and homozygous case, and 100 SNPs (named "disease_0", ...) with an allele frequency between 0.05 and 0.95 with an OR of 1.5 heterozygous and multiplicative risk, so an OR of 2.25, in the homozygous case.

You can also find an example under examples/gwas.sim in the downloaded ZIP file.

Having prepared such a file you can supply it to MacLeaps together with additional parameters that specify how many cases and controls and what disease prevalence should be simulated.

Parameter: --sim-file [simulation_parameters] --sim-ncases [no_of_cases] --sim-ncontrols [no_of_controls] --sim-prevalence [prevalence]
Example: java -jar MacLeaps.jar --sim-file gwas.sim --sim-ncases 500 --sim-ncontrols 500 --sim-prevalence 0.05 -s simulation -o simulation