1 Introduction

Droplet-based single-cell RNA sequencing (scRNA-seq) facilitates measuring the transcriptomes of thousands of cells in a single run. Pooling cells from different samples or conditions before cell partitioning and library preparation can significantly lower costs and reduce batch effects. The task of assigning each cell of a pooled sample to its sample of origin is called demultiplexing. If genetically diverse samples are pooled, single nucleotide polymorphisms in coding regions can be used for demultiplexing. When working with genetically similar or identical samples, an additional experimental step is required to label the cells with a sample-specific barcode oligonucleotide before pooling. Several techniques have been developed to label cells or nuclei with oligonucleotides based on antibodies (Stoeckius et al. 2018; Gaublomme et al. 2019) or lipids (McGinnis et al. 2019). These oligonucleotides are termed hashtag oligonucleotides (HTOs) and are sequenced together with the RNA molecules of the cells resulting in an (HTOs x droplets) count matrix in addition to the (genes x droplets) matrix with RNA read counts.

The demuxmix package implements a method to demultiplex droplets based on HTO counts using negative binomial regression mixture models. demuxmix can be applied to the HTO counts only, but better results are often achieved if the total number of genes detected per droplet (not the complete transcription profile) is passed to the method along with the HTO counts to leverage the positive association between genes detected and HTO counts. Further, demuxmix provides estimated error rates based on its probabilistic mixture model framework, plots for data quality assessment, and multiplet identification, as outlined in the example workflows in this vignette. Technical details of the methods are described in the man pages.

2 Installation

The demuxmix package is available at Bioconductor and can be installed via BiocManager::install:

if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("demuxmix")

The package only needs to be installed once. Load the package into an R session with

library(demuxmix)

3 Quick start

A matrix of raw HTO counts (HTO x cells) and a vector with the number of detected genes per droplet are needed to run demuxmix with default settings. Empty and low-quality droplets should be removed before running demuxmix. A gene with at least one read is usually considered as detected. Here, we simulate a small example dataset.

library(demuxmix)

set.seed(2642)
class <- rbind(
    c(rep(TRUE,  220), rep(FALSE, 200)),
    c(rep(FALSE, 200), rep(TRUE,  220))
)
simdata <- dmmSimulateHto(class)
hto <- simdata$hto
dim(hto)
#> [1]   2 420
rna <- simdata$rna
length(rna) == ncol(hto)
#> [1] TRUE

The dataset consists of 420 droplets with cells labeled with two different HTOs. The first 200 droplets are singlets labeled with the first HTO, followed by another 200 singlets labeled with the second HTO. The remaining 20 droplets are doublets, which are positive for both HTOs. Next, we run demuxmix to assign droplets to HTOs.

dmm <- demuxmix(hto, rna = rna)
summary(dmm)
#>       Class NumObs     RelFreq   MedProb      ExpFPs          FDR
#> 1     HTO_1    198 0.475961538 0.9999994 0.419487123 0.0021186218
#> 2     HTO_2    197 0.473557692 0.9999985 0.287367473 0.0014587181
#> 3   singlet    395 0.949519231 0.9999992 0.706854596 0.0017895053
#> 4 multiplet     20 0.048076923 0.9999995 0.005837305 0.0002918652
#> 5  negative      1 0.002403846 0.9922442 0.007755814 0.0077558137
#> 6 uncertain      4          NA        NA          NA           NA
classes <- dmmClassify(dmm)
table(classes$HTO)
#> 
#>       HTO_1 HTO_1,HTO_2       HTO_2    negative   uncertain 
#>         198          20         197           1           4

The object dmm contains the mixture models used to classify the droplets. The data frame returned by summary shows that 198 droplets were assigned to HTO_1 and 197 to HTO_2, respectively. Since these results meet our expectations and the estimated error rates are reasonably low, we ran dmmClassify to obtain the classifications for each droplet as a data frame with one row per droplet. The first column HTO of this data frame contains the final classification results.

A histogram of the HTO values overlayed with the components from the mixture model can be plotted for quality control. The following command plots a panel with one histogram per HTO in the dataset.

plotDmmHistogram(dmm)
Density histograms overlayed with mixture probability mass function. The density histograms show the distribution of the HTO counts for the first HTO (upper figure) and the 2nd HTO (lower figure). The negative component of the mixture model representing non-tagged cells is shown in blue, and the positive component is in red.

Figure 1: Density histograms overlayed with mixture probability mass function
The density histograms show the distribution of the HTO counts for the first HTO (upper figure) and the 2nd HTO (lower figure). The negative component of the mixture model representing non-tagged cells is shown in blue, and the positive component is in red.

4 Demultiplexing droplets with demuxmix

4.1 Example datasets

Two example datasets are introduced in this vignette to illustrate a typical demuxmix workflow. The first dataset is a small simulated dataset used to generate the plots when building this vignette. The alternative second dataset is a real dataset and can be downloaded from the ExperimentHub via the scRNAseq package. Both datasets can be used to go through this vignette by running either the first (simulated data) or the second code block (real data) below. Since the real dataset is much larger, some commands may take up to one minute to complete, which is why this vignette was built with the simulated data.

4.1.1 Simulated dataset

Simulated HTO count data are generated for 650 droplets by the method dmmSimulateHto. The logical matrix class defines for each droplet (column) and HTO (row) whether the droplet is positive or negative for that HTO. Thus, the 3 x 650 matrix class below describes a dataset with 3 hashtags and 650 droplets, of which 50 are doublets (with cells tagged by HTO_1 and HTO_2). The remaining 600 droplets consist of 3 blocks of 200 singlets tagged by one of the three HTOs each.

library(demuxmix)
library(ggplot2)
library(cowplot)

set.seed(5636)
class <- rbind(
    c(rep( TRUE, 200), rep(FALSE, 200), rep(FALSE, 200), rep( TRUE, 50)),
    c(rep(FALSE, 200), rep( TRUE, 200), rep(FALSE, 200), rep( TRUE, 50)),
    c(rep(FALSE, 200), rep(FALSE, 200), rep( TRUE, 200), rep(FALSE, 50))
)
simdata <- dmmSimulateHto(
    class = class,
    mu = c(600, 400, 200),
    theta = c(25, 15, 25),
    muAmbient = c(30, 30, 60),
    thetaAmbient = c(20, 10, 5),
    muRna = 3000,
    thetaRna = 30
)
hto <- simdata$hto
rna <- simdata$rna

htoDf <- data.frame(t(hto), HTO = colSums(hto), NumGenes = rna)
pa <- ggplot(htoDf, aes(x = HTO_1)) +
    geom_histogram(bins = 25)
pb <- ggplot(htoDf, aes(x = HTO_2)) +
    geom_histogram(bins = 25)
pc <- ggplot(htoDf, aes(x = HTO_3)) +
    geom_histogram(bins = 25)
pd <- ggplot(htoDf, aes(x = NumGenes, y = HTO)) +
    geom_point()
plot_grid(pa, pb, pc, pd, labels = c("A", "B", "C", "D"))