1 Exploring the 1.3 million brain cell scRNA-seq data from 10X Genomics

Package: TENxBrainData
Author: Aaron Lun (), Martin Morgan
Modification date: 30 December, 2017
Compilation date: 2024-11-05

The TENxBrainData package provides a R / Bioconductor resource for representing and manipulating the 1.3 million brain cell single-cell RNA-seq (scRNA-seq) data set generated by 10X Genomics. It makes extensive use of the r Biocpkg("HDF5Array") package to avoid loading the entire data set in memory, instead storing the counts on disk as a HDF5 file and loading subsets of the data into memory upon request.

2 Initial work flow

2.1 Loading the data

We use the TENxBrainData function to download the relevant files from Bioconductor’s ExperimentHub web resource. This includes the HDF5 file containing the counts, as well as the metadata on the rows (genes) and columns (cells). The output is a single SingleCellExperiment object from the SingleCellExperiment package. This is equivalent to a SummarizedExperiment class but with a number of features specific to single-cell data.

library(TENxBrainData)
tenx <- TENxBrainData()
tenx
## class: SingleCellExperiment 
## dim: 27998 1306127 
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(2): Ensembl Symbol
## colnames(1306127): AAACCTGAGATAGGAG-1 AAACCTGAGCGGCTTC-1 ...
##   TTTGTCAGTTAAAGTG-133 TTTGTCATCTGAAAGA-133
## colData names(4): Barcode Sequence Library Mouse
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

The first call to TENxBrainData() will take some time due to the need to download some moderately large files. The files are then stored locally such that ensuing calls in the same or new sessions are fast.

The count matrix itself is represented as a DelayedMatrix from the DelayedArray package. This wraps the underlying HDF5 file in a container that can be manipulated in R. Each count represents the number of unique molecular identifiers (UMIs) assigned to a particular gene in a particular cell.

counts(tenx)
## <27998 x 1306127> DelayedMatrix object of type "integer":
##            AAACCTGAGATAGGAG-1 ... TTTGTCATCTGAAAGA-133
##     [1,]                    0   .                    0
##     [2,]                    0   .                    0
##     [3,]                    0   .                    0
##     [4,]                    0   .                    0
##     [5,]                    0   .                    0
##      ...                    .   .                    .
## [27994,]                    0   .                    0
## [27995,]                    1   .                    0
## [27996,]                    0   .                    0
## [27997,]                    0   .                    0
## [27998,]                    0   .                    0

2.2 Exploring the data

To quickly explore the data set, we compute some summary statistics on the count matrix. We increase the DelayedArray block size to indicate that we can use up to 2 GB of memory for loading the data into memory from disk.

options(DelayedArray.block.size=2e9)

We are interested in library sizes colSums(counts(tenx)), number of genes expressed per cell colSums(counts(tenx) != 0), and average expression across cells `rowMeans(counts(tenx)). A naive implement might be

lib.sizes <- colSums(counts(tenx))
n.exprs <- colSums(counts(tenx) != 0L)
ave.exprs <- rowMeans(counts(tenx))

However, the data is read from disk, disk access is comparatively slow, and the naive implementation reads the data three times. Instead, we’ll divide the data into column ‘chunks’ of about 10,000 cells; we do this on a subset of data to reduce computation time during the exploratory phase.

tenx20k <- tenx[, seq_len(20000)]
chunksize <- 5000
cidx <- snow::splitIndices(ncol(tenx20k), ncol(tenx20k) / chunksize)

and iterate through the file reading the data and accumulating statistics on each iteration.

lib.sizes <- n.exprs <- numeric(ncol(tenx20k))
tot.exprs <- numeric(nrow(tenx20k))
for (i in head(cidx, 2)) {
    message(".", appendLF=FALSE)
    m <- as.matrix(counts(tenx20k)[,i, drop=FALSE])
    lib.sizes[i] <- colSums(m)
    n.exprs[i] <- colSums(m != 0)
    tot.exprs <- tot.exprs + rowSums(m)
    }
ave.exprs <- tot.exprs / ncol(tenx20k)

Since the calculations are expensive and might be useful in the future, we annotate the tenx20k object

colData(tenx20k)$lib.sizes <- lib.sizes
colData(tenx20k)$n.exprs <- n.exprs
rowData(tenx20k)$ave.exprs <- ave.exprs

Library sizes follow an approximately log normal distribution, and are surprisingly small.

hist(
    log10(colData(tenx20k)$lib.sizes),
    xlab=expression(Log[10] ~ "Library size"),
    col="grey80"
)

Expression of only a few thousand genes are detected in each sample.

hist(colData(tenx20k)$n.exprs, xlab="Number of detected genes", col="grey80")