1 Introduction

Data from different experimental platforms and/or batches exhibit systematic variation – i.e., batch effects. Therefore, when conducting joint analysis of data from different batches, a key first step is to align the datasets.

corralm is a multi-table adaptation of correspondence analysis designed for single-cell data, which applies multi-dimensional optimized scaling and matrix factorization to compute integrated embeddings across the datasets. These embeddings can then be used in downstream analyses, such as clustering, cell type classification, trajectory analysis, etc.

See the vignette for corral for dimensionality reduction of a single matrix of single-cell data.

2 Loading packages and data

We will use the SCMixology datasets from the CellBench package (Tian et al. 2019).

library(corral)
library(SingleCellExperiment)
library(ggplot2)
library(CellBench)
library(MultiAssayExperiment)

scmix_dat <- load_all_data()[1:3]

These datasets include a mixture of three lung cancer cell lines:

  • H2228
  • H1975
  • HCC827

which was sequenced using three platforms:

  • 10X
  • CELseq2
  • Dropseq
scmix_dat
## $sc_10x
## class: SingleCellExperiment 
## dim: 16468 902 
## metadata(3): scPipe Biomart log.exprs.offset
## assays(2): counts logcounts
## rownames(16468): ENSG00000272758 ENSG00000154678 ... ENSG00000054219
##   ENSG00000137691
## rowData names(0):
## colnames(902): CELL_000001 CELL_000002 ... CELL_000955 CELL_000965
## colData names(14): unaligned aligned_unmapped ... cell_line sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## 
## $sc_celseq
## class: SingleCellExperiment 
## dim: 28204 274 
## metadata(3): scPipe Biomart log.exprs.offset
## assays(2): counts logcounts
## rownames(28204): ENSG00000281131 ENSG00000227456 ... ENSG00000148143
##   ENSG00000226887
## rowData names(0):
## colnames(274): A1 A10 ... P8 P9
## colData names(15): unaligned aligned_unmapped ... cell_line sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## 
## $sc_dropseq
## class: SingleCellExperiment 
## dim: 15127 225 
## metadata(3): scPipe Biomart log.exprs.offset
## assays(2): counts logcounts
## rownames(15127): ENSG00000223849 ENSG00000225355 ... ENSG00000133789
##   ENSG00000146674
## rowData names(0):
## colnames(225): CELL_000001 CELL_000002 ... CELL_000249 CELL_000302
## colData names(14): unaligned aligned_unmapped ... cell_line sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

Each sequencing platform captures a different set of genes. In order to apply this method, the matrices need to be matched by features (i.e., genes). We’ll find the intersect of the three datasets, then subset for that as we proceed.

First, we will prepare the data by: 1. adding to the colData the sequencing platform (Method in colData for each SCE), and 2. subsetting by the intersect of the genes.

platforms <- c('10X','CELseq2','Dropseq')
for(i in seq_along(scmix_dat)) {
  colData(scmix_dat[[i]])$Method<- rep(platforms[i], ncol(scmix_dat[[i]]))
}

scmix_mae <- as(scmix_dat,'MultiAssayExperiment')
scmix_dat <- as.list(MultiAssayExperiment::experiments(MultiAssayExperiment::intersectRows(scmix_mae)))