Machine Learning Analysis Pipeline for
Genome-Wide Accociation Study SNP data
You may download a stand-alone version as runnable JAR file in our downloads section.
Using machine learning methods, GWAS data can be analysed for more complex relations between single nucleotide polymorphisms (SNPs) and diseases
than simple statistical methods that look at each SNP separately.
However, this often requires the use of several tools and the necessity of intensive data conversion and user interaction.
We developed an automated pipeline that uses state-of-the-art machine learning algorithms to create a disease risk model
based on a given GWAS SNP dataset and assesses its predictive performance for unseen datasets.
The pipeline can either use a first dataset for training the model and a second for validation, or perform a nested k-fold cross-validation on a single dataset.
For each training set a basic case/control association analysis is performed to estimate the association between each single SNP and the phenotype.
Using this information the dataset is filtered to create multiple subsets that contain only SNPs below a certain p-value threshold
and for each subset a model is trained using a support vector machine (LIBSVM: linear and RBF kernel) and tested on its corresponding validation subset.
The prediction performance is measured as the area under the ROC curve (AUC) and visualized in a plot showing average AUC and standard deviation for each p-value threshold.
The only required input for this pipeline is the SNP data, all other parameters use default values, but can be specified by the user, if wanted.
During the whole process, the pipeline takes care of the necessary conversions between different data formats
and stores all intermediate data and final results to allow for subsequent analysis of single steps.
Additionally, if a second analysis is performed on the same dataset with different parameters, e.g., adding another p-value threshold,
the pipeline will not repeat steps to create data that is still valid.
This project is promoted by: