1 Introduction

Droplet-based single-cell RNA sequencing (scRNA-seq) facilitates measuring the transcriptomes of thousands of cells in a single run. Pooling cells from different samples or conditions before cell partitioning and library preparation can significantly lower costs and reduce batch effects. The task of assigning each cell of a pooled sample to its sample of origin is called demultiplexing. If genetically diverse samples are pooled, single nucleotide polymorphisms in coding regions can be used for demultiplexing. When working with genetically similar or identical samples, an additional experimental step is required to label the cells with a sample-specific barcode oligonucleotide before pooling. Several techniques have been developed to label cells or nuclei with oligonucleotides based on antibodies (Stoeckius et al. 2018; Gaublomme et al. 2019) or lipids (McGinnis et al. 2019). These oligonucleotides are termed hashtag oligonucleotides (HTOs) and are sequenced together with the RNA molecules of the cells resulting in an (HTOs x droplets) count matrix in addition to the (genes x droplets) matrix with RNA read counts.

The demuxmix package implements a method to demultiplex droplets based on HTO counts using negative binomial regression mixture models. demuxmix can be applied to the HTO counts only, but better results are often achieved if the total number of genes detected per droplet (not the complete transcription profile) is passed to the method along with the HTO counts to leverage the positive association between genes detected and HTO counts. Further, demuxmix provides estimated error rates based on its probabilistic mixture model framework, plots for data quality assessment, and multiplet identification, as outlined in the example workflows in this vignette. Technical details of the methods are described in the man pages.

2 Installation

The demuxmix package is available at Bioconductor and can be installed via BiocManager::install:

if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("demuxmix")

The package only needs to be installed once. Load the package into an R session with

library(demuxmix)

3 Quick start

A matrix of raw HTO counts (HTO x cells) and a vector with the number of detected genes per droplet are needed to run demuxmix with default settings. Empty and low-quality droplets should be removed before running demuxmix. A gene with at least one read is usually considered as detected. Here, we simulate a small example dataset.

library(demuxmix)

set.seed(2642)
class <- rbind(
    c(rep(TRUE,  220), rep(FALSE, 200)),
    c(rep(FALSE, 200), rep(TRUE,  220))
)
simdata <- dmmSimulateHto(class)
hto <- simdata$hto
dim(hto)
#> [1]   2 420
rna <- simdata$rna
length(rna) == ncol(hto)
#> [1] TRUE

The dataset consists of 420 droplets with cells labeled with two different HTOs. The first 200 droplets are singlets labeled with the first HTO, followed by another 200 singlets labeled with the second HTO. The remaining 20 droplets are doublets, which are positive for both HTOs. Next, we run demuxmix to assign droplets to HTOs.

dmm <- demuxmix(hto, rna = rna)
summary(dmm)
#>       Class NumObs     RelFreq   MedProb      ExpFPs          FDR
#> 1     HTO_1    198 0.475961538 0.9999994 0.419487123 0.0021186218
#> 2     HTO_2    197 0.473557692 0.9999985 0.287367473 0.0014587181
#> 3   singlet    395 0.949519231 0.9999992 0.706854596 0.0017895053
#> 4 multiplet     20 0.048076923 0.9999995 0.005837305 0.0002918652
#> 5  negative      1 0.002403846 0.9922442 0.007755814 0.0077558137
#> 6 uncertain      4          NA        NA          NA           NA
classes <- dmmClassify(dmm)
table(classes$HTO)
#> 
#>       HTO_1 HTO_1,HTO_2       HTO_2    negative   uncertain 
#>         198          20         197           1           4

The object dmm contains the mixture models used to classify the droplets. The data frame returned by summary shows that 198 droplets were assigned to HTO_1 and 197 to HTO_2, respectively. Since these results meet our expectations and the estimated error rates are reasonably low, we ran dmmClassify to obtain the classifications for each droplet as a data frame with one row per droplet. The first column HTO of this data frame contains the final classification results.

A histogram of the HTO values overlayed with the components from the mixture model can be plotted for quality control. The following command plots a panel with one histogram per HTO in the dataset.

plotDmmHistogram(dmm)