1 Variability and heterogeneity in high-dimensional data

Individual diversity and variability is one of the most complex issues to deal within high-dimensional studies of large populations, as the ones currently performed in biomedical analyses using omic technologies. DECO is a method that combines two main computational procedures: (i) a Recurrent-sampling Differential Analysis (RDA) that performs combinatorial sampling without replacement to select multiple sample subsets followed by a differential analysis (LIMMA); and (ii) a Non-Symmetrical Correspondence Analysis (NSCA) on differential events, which would improve the characterization and assigment of features and samples in a common multidimensional space. This second step combines in a single statistic (h-statistic) both the feature-sample changes detected and the predictor-response information provided by NSCA.

The statistical procedure followed in both parts of the method are detailed in the original publication [1], but this brief vignette explains how to use DECO to analyze multidimensional datasets that may include heterogeneous samples. The aim is to improve characterization and stratification of complex sample series, mostly focusing on large patient cohorts, where the existence of outlier or mislabeled samples is quite possible.

Thus, DECO can reveal exclusive associations between features and samples based in specific differential signal and provide a better way for the stratifycation of populations using multidimensional large-scale data. The method could be applied to data derived from different omic technologies since LIMMA has been demonstrated to be applicable to non-transcriptomic data. Along the vignette, we used genome-wide expression data obtained with microarrays or with RNA-seq (either for genes, miRNAs, ncRNAs, etc).