The Metabolomics Benchmark Dataset (MBD) ---------------------------------------- IR spectorscopy is fast but measures additive superpositions of compound signals. Practical metabolomics requires a fast screening method for metabolic monitoring. Factorisation methods aim to estimate the original signals from mixture signals, but require constructed datasets with known subsignals for method development. Here, we strive to facilitate method development by providing an artificial IR benchmark dataset that implements a metabolomics data mining situation. The MBD is a benchmark dataset specially designed to facilitate the development of factorisation methods for metabolomic data mining. To this end, it consists of the additive mixtuer of 4 signal components from PNU (Base), Mannitol (A), Ascorbic Acid (B), and Bovine Serum Albumine (C). While Base, being a complex testing mixture for IR devices, serves as a background, the other compounds represent spiked-in signals. A and B are constitutive signals thatare spiked to every samples, but can be regulated to an (u)p, (d)own, or (n)ormal level. C represents a facultative signal that indicates an altered metabolic state. A metabolic state therefore is described by the following regular expression code: (u|n|d)(u|n|d)(0|1), where the first term denotes the state of A, the second term the state of B, and the final binary term the presence of C. Thus, 'nn0' indicates that A and B have normal concentration states and C is absent. Similarly, 'un1' indicates that A is upregulated, while B preserves its normal concentration, but C is spiked in. Detailed information about the single sample composition is given in the file 'pipetting-plan.xls', which can be processed using a spreadsheed program. It contains two tables 'SampleDefinition' and 'PipettingPlan'. In the 'SampleDefinition' table detailed information about the concentration (with respect to PNU) each mastermix (+X,X,-X) had, which state was composed of how much quantity of each mastermix, as well as the final concentration of a compound in a standard sample. The table 'PipettingPlan' provides the real pipetted quantities of each mastermix. Each half of a state was created using the standard definition from 'SampleDefinition'. To each sample of the other half a randomised concentration was spiked-in in such a way that the standard concenctrations were altered by random numbers between -5 and 5 obtained from www.random.org. For instance, the row <---schnipp---> Sample ID State Count State ID base [µl] -A [µl] A [µl] +A [µl] -B [µl] B [µl] +B [µl] C [µl] Rand base [µl] Rand -A [µl] Rand A [µl] Rand +A [µl] Rand -B [µl] Rand B [µl] Rand +B [µl] Rand C [µl] Real base [µl] Real -A [µl] Real A [µl] Real +A [µl] Real -B [µl] Real B [µl] Real +B [µl] Real C [µl] Sample Randomised Sample Name ... 78 nn0 50 0 10 0 0 10 0 0 0 -5 5 4 -5 -4 -2 -2 50 0 15 0 0 6 0 0 1 irval_nn0_no078_sid078_r1 ... <---schnapp---> in table 'PipettingPlan' means that in the sample 'irval_nn0_no078_sid078_r1' of the class 'nn0' the standard concentrations of 50µl Base, 10µl A, and 10µl B, and 0µl C, have been altered to 50µl Base, 15µl A, and 6µl B, and 0µl C by addint 5µl to the concentration of A and -4 to the concentration of B. The following table enlists the measured states, number of samples, and their meaning: samples A B C code definition ----------------------------------------------------------------------------------- 150 n n 0 nn0 Normal healthy state 50 n n 1 nn1 Disease state with C and no cross-regulation 50 u n 0 un0 Disease state A up regulated 50 u n 1 un1 Disease state A up regulated by C 50 d n 0 dn0 Disease state A down regulated 50 d n 1 dn1 Disease state A down regulated by C 50 n u 0 nu0 Disease state B up regulated 50 n u 1 nu1 Disease state B up regulated by C 50 n d 0 nd0 Disease state B down regulated 50 n d 1 nd1 Disease state B down regulated by C Each sample was measured three times to obtain averaged signals. The raw measurements are available in mbd-xy-raw.tar.gz comprising the mixture signals diluted in PNU. The pure signal triplicates of the spiked in compounds are available in compound-xy-raw.zip. The format are xy-files, each containing a two-column table where the first column comprises the wavelength and, separated by a comma, the second column the measured intensity. Files having the same filename in different directories correspond to the same sample. In addition to the raw files, we also provide a preprocessed dataset as well as benchmark software written in R. The preprocessed dataset consists of averaged and smoothed IR spectra stored in tables (named 'X.tab'). Only spectra having a Pearson correlation above 0.95 were first smoothed using the Savitzky-Golay method (filter length was 15) and subsequently averaged. The corresponding class information is available in tables named 'Y.tab' (containing +1 is a signale belongs to a class and -1 otherwise). Supporting information for each preparation was stored in files names 'Info.tab'. All tab files were created using the R command 'write.table' writing headers and row names, being the related filenames. In addition to the pure preprocessed signals, higher order derivatives, first and second order, were computed by the Savitzky-Golay method. The provided software comprises a driver routine for comparing four different factorisation methods: ICA, NMF, PLSR, and BrierScoreMF with respect to their feature detection ability. Therefore, each dataset is first factorised and then the wavelet represenation of its signal decomposition is compared to the wavelet representation of the spiked-in signals. Based on this and a randomisation control, the significance of feature detection can be estimated. To run the software first install the following R packages: wavelets pls fastICA corpcor bootstrap Matrix NMFN for your R installation. The driver 'skript-wavelets-compare-factorisations.R' has the following command line interface, when run using 'Rscript': usage: skript-wavelets-compare-factorisations.R For ist uses a 'reference.tab' file comprising the preprocessed signals for Base, A, B, and C from a subdirectory of 'reference/'. For please specify a related preprocessed directory from the 'benchmark/' subdirectory. The parameter 'wt-filter' specifies the class of wavelets used for comparison. A more detailed description is part of an article currently submitted. Please do not hesitate to contact me in the case of further questions: carsten.henneges@uni-tuebingen.de