Simultaneously generating multiple omic measurements (e.g. transcriptomics, metabolomics, proteomics or epigenomics) of the same molecular system for one particular study is not always possible. As a consequence, researchers sometimes combine compatible data generated in different labs or in different batches. In such cases, data will usually be affected by an unwanted effect associated to the experimentation event (lab, batch, technology, etc.) that, especially for high throughput molecular assays, may result in important levels of noise contaminating the biological signal. This unwanted source of variation is commonly known as ``batch effect’’ and is very frequently seen as the first source of variability in the omic dataset, standing out over the experimental conditions under study.
Removing batch effects becomes then necessary in order to obtain meaningful results from statistical analyses. Provided that the omic experiment has been designed in such a way that batch effects are not confounded with the effects of interest (treatment, disease, cell type, etc.), the so-called Batch Effect Correction Algorithms (BECAs) can be used to remove, or at least mitigate, systematic biases.Therefore these methods are extremely useful to combine data from different laboratories or measured at different times. One of these BECAs is the ARSyN method , which relies on the ANOVA-Simultaneous Components Analysis (ASCA) framework to decompose the omic signal into experimental effects and other unwanted effects. ARSyN applies Principal Component Analysis (PCA) to estimate the systematic variation due to batch effect and then removes it from the original data.
BECAs have been traditionally applied to remove batch effects from omic data of the same type, as for example gene expression. However, while removing batch effects from a single omic data type with an appropriate experimental design is relatively straightforward, it can become unapproachable when dealing with multiomic datasets. In the multiomic scenario, each omic modality may have been measured by a different lab or at a different moment in time, and so it is obtained within a different batch. When this is the case, the batch effect will be confounded with the ``omic type effect’’ and will be impossible to remove from the data. However, in some scenarios, the multiomic batch effect can be corrected. MultiBaC is the first BECA dealing with batch effect correction in multiomic datasets. MultiBaC can remove batch effects across different omics generated within separate batches provided that at least one common omic data type is included in all the batches.
The MultiBaC package includes two BECAs: the ARSyN method for correcting batch effect from a single omic data type and the MultiBac method, which deals with the batch effect problem on multi-omic assays.
ARSyN (ASCA Removal of Systematic Noise) is a method that combines Analysis of Variance (ANOVA) and Principal Component Analysis (PCA) for the identification of structured variation from the estimated ANOVA models for experimental and unwanted effects on an omic data matrix. ARSyN can remove undesired signals to obtain noise-filtered data for further analysis . In MultiBaC package, ARSyN has been adapted for filtering the noise associated to identified or unidentified batch effects. This adaptation has been called ARSyNbac.
In the ARSyN method, the ANOVA model separates the signal identified with each one of the factors involved in the experimental design from the residuals. The algorithm can be applied on multi-factorial experimental designs. One of the factors in the model can be the batch each sample belongs to, if this information is known. In such case, the ANOVA model is applied to separate the batch effect from the remaining effects and residuals. The PCA analysis will hence detect the possible existence of a structured variation due to the batch effect, that is identified with the principal components explaining a given proportion of the total variation in the data, which can be set by the user.
However, ARSyN can also be applied when the batch factor is not known since the PCA on the residual matrix can detect correlated structure associated to a source of variation not included in the experimental design. We alert of this signal in the residuals when the first eigenvalues of the PCA are noticeably higher than the rest, because if there is not any structure the eigenvalues will be approximately equal. In this case the selection of components is controlled by the beta argument. Components that represent more than beta times the average variability are identified as systematic noise and removed from the original data.
Nueda MJ, Ferrer A, Conesa A.(2012). ARSyN: A method for the identification and removal of systematic noise in multifactorial time course microarray experiments. Biostatistics 13:553-66.
The yeast expression example data sets were collected from the Gene Expression Omnibus (GEO) database and from three different laboratories (batches). In all of them, the effect of glucose starvation in yeast was analyzed. Lab A is the Department of Biochemistry and Molecular Biology from Universitat de Valencia (accession number GSE11521) ; Lab B is the Department of Molecular and Cellular Biology from Harvard University (accession number GSE56622) ; and Lab C is the Department of Biology from Johns Hopkins University (accession number GSE43747) .
After a proper data pre-processing for each case, a voom transformation (with limma R package) was applied when necessary. Finally TMM normalization was performed on the whole set of samples from all labs. A reduced dataset was obtained by selecting 200 omic variables from each data matrix and just 3 samples from lab A. This yeast multiomic reduced dataset is included in MutiBaC package to illustrate the usage of the package. The gene expression matrices can be loaded by using the data(“multiyeast”) instruction.
The three studies used equivalent yeast strains and experimental conditions but, as shown in Figure 1 , the main effect on expression is due to data belonging to different labs, which are the batches in this case.
The MultiBaC package uses MultiAssayExperiment objects, a type of Bioconductor container for multiomic studies, that can be created from a list of matrices or data.frame objects. These matrices must have features in rows and samples in columns as shown next for one of the data matrices from the yeast example (A.rna). It is important that all data matrices share the variable space. If the number of omic variables and order are not the same, the createMbac function will select the common variables. Hence, it is mandatory that rows are named with the same type of identifiers.
## A.rna_Glu+_1 A.rna_Glu+_2 A.rna_Glu+_3 A.rna_Glu-_1 A.rna_Glu-_2 ## YOR324C 7.174264 6.976815 7.482661 8.020596 7.736636 ## YGL104C 4.239493 4.284775 3.100898 4.957403 3.673252 ## YOR142W 8.819824 8.496966 9.026971 10.374525 10.294006 ## YOR052C 6.721211 7.011932 7.557519 8.504503 8.586738 ## YGR038W 5.878483 5.894121 6.468361 7.856822 7.806318 ## YER087W 5.131089 5.295029 5.750657 2.920250 5.496136 ## A.rna_Glu-_3 ## YOR324C 7.922425 ## YGL104C 1.587184 ## YOR142W 10.979224 ## YOR052C 8.221168 ## YGR038W 8.108436 ## YER087W 2.879254
A MultiAssayExperiment object needs to be created for each batch (lab in this example). The mbac new data structure is a list of MultiAssayExperiment objects and can be easily generated with the createMbac function in the package. The resulting mbac object will be the ARSyNbac input.
data_RNA<- createMbac (inputOmics = list(A.rna, B.rna, C.rna), batchFactor = c("A", "B", "C"), experimentalDesign = list("A" = c("Glu+", "Glu+", "Glu+", "Glu-", "Glu-", "Glu-"), "B" = c("Glu+", "Glu+", "Glu-", "Glu-"), "C" = c("Glu+", "Glu+", "Glu-", "Glu-")), omicNames = "RNA")
These are the arguments for the createMbac function:
The mbac R structure generated by the createMbac function is an S3 object that initially contains just one slot, the ListOfBatches object. This mbac structure will incorporate more elements as they are created when running ARSyNbac or MultiBaC functions. These new slots are CorrectedData, PLSmodels, ARSyNmodels or InnerRelation and are described next:
The function to remove batch effects or unwanted noise from a single omic data matrix in the MultiBaC package is the ARSyNbac function, which allows for the following arguments:
ARSyNbac (mbac, batchEstimation = TRUE, Interaction=FALSE, Variability = 0.90, beta = 2, modelName = "Model 1", showplot = TRUE)
Therefore, the ARSyNbac function offers two types of analysis: ARSyNbac batch effect correction, when the batch information is provided, and ARSyNbac noise reduction, if the batch information is unknown. In the following sections we explain how to proceed with each one of them.
When the batch is identified in the batchFactor argument of the mbac input object, its effect can be estimated and removed by choosing batchEstimation = TRUE. Moreover, a possible interaction between the experimental factors and the batch factor can be studied by setting interaction=TRUE.
par(mfrow = c(1,2)) arsyn_1 <- ARSyNbac(data_RNA, modelName = "RNA", Variability = 0.95, batchEstimation = TRUE, Interaction = FALSE) plot(arsyn_1, typeP="pca.cor", bty = "L", pch = custom_pch, cex = 3, col.per.group = custom_col, legend.text = c("Color: Batch", names(data_RNA$ListOfBatches), "Fill: Cond.", levels(colData(data_RNA$ListOfBatches$A)$tfactor)), args.legend = list("x" = "topright", "pch" = c(NA, 15, 15, 15, NA, 15, 3), "col" = c(NA, "brown2", "dodgerblue", "forestgreen", NA, 1, 1), "bty" = "n", "cex" = 1.2))
According to the left plot in Figure 2, two principal components (PCs) have been selected to explain al least 95% of the total variability of batch effect. These two PCs are shown in the right panel and now the main source of variability in the data (PC1) is given by the experimental condition, while samples are not clustered by batches anymore.
par(mfrow = c(1,2)) arsyn_2 <- ARSyNbac(data_RNA, modelName = "RNA", Variability = 0.95, batchEstimation = TRUE, Interaction = TRUE) plot(arsyn_2, typeP="pca.cor", bty = "L", pch = custom_pch, cex = 3, col.per.group = custom_col, legend.text = c("Color: Batch", names(data_RNA$ListOfBatches), "Fill: Cond.", levels(colData(data_RNA$ListOfBatches$A)$tfactor)), args.legend = list("x" = "topright", "pch" = c(NA, 15, 15, 15, NA, 15, 3), "col" = c(NA, "brown2", "dodgerblue", "forestgreen", NA, 1, 1), "bty" = "n", "cex" = 1.2))