library(demuxSNP)
library(ComplexHeatmap)
library(viridisLite)
library(Seurat)
library(ggpubr)
library(dittoSeq)
library(utils)
library(EnsDb.Hsapiens.v86)
colors <- structure(viridis(n = 3), names = c("-1", "0", "1"))

1 Introduction

Multiplexing in scRNAseq involves the sequencing of samples from different patients, treatment types or physiological locations together, resulting in significant cost savings. The cells must then be demultiplexed, or assigned back to their respective groups. A number of experimental and computational methods have been proposed to facilitate this, but a universally robust algorithm remains elusive. Below, we introduce some existing methods, highlight the novel features of our approach and its advantages to the user.

1.1 Existing Methods

1.1.1 Cell Hashing

Cells from each group are labelled with a distinct tag (HTO or LMO) which is sequenced to give a counts matrix. Due to non-specific binding, these counts form a bimodal distribution. Such methods are generally computationally efficient. Their classification performance, however, is highly dependent on the tagging quality and many methods do not account for uncertainty in classification (Boggy et al. (2022), Kim et al. (2020) & Stoeckius et al. (2018)).

More recent methods, including demuxmix, assign a probability that a cell is from a particular group, or made up of multiple groups (doublet). This allows users to define a cut-off threshold for the assignment confidence. Accounting for uncertainty is an important feature for these types of algorithms. But, while they give the user greater flexibility in determining which cells to keep, this ultimately results in a trade-off between keeping cells which cannot be confidently called or discarding them - due to issues with tagging quality rather than RNA quality.

1.1.2 SNPs

The second class of methods exploits natural genetic variation between cells and so can only be used where the groups are genetically distinct. Demuxlet (Kang et al. (2018)) uses genotype information from each group to classify samples. This genotyping incurs additional experimental cost. To address this, Souporcell (Heaton et al. (2020)) and Vireo (Huang, McCarthy, and Stegle (2019)) among other methods were developed to classify cells based on their SNPs in an unsupervised manner. Without prior knowledge of the SNPs associated with each group, these unsupervised methods may confuse groups with lower cell counts for other signals in the data.

Demuxlet remains the standard often used to benchmark other methods but its more widespread adoption has been limited by the requirement of sample genotype information.

1.2 demuxSNP Motivation

With cell hashing, we can confidently demultiplex some but not all cells. Using these high confidence cells, we can learn the SNPs associated with each group. This SNP information can then be used to assign remaining cells (which we could not confidently call using cell hashing) to their most similar group based on their SNP profile.

Novel features:

  • Uses both cell hashing and SNP data. Current methods are limited to using one or the other.

  • Selects SNPs based on being located in a gene expressed in a large proportion of cells to reduce noise, computational cost and increase interpretability.

Impact:

  • Users can visually confirm validity (or lack thereof) of existing demultiplexing results in a tangible manner.
  • Users can recover otherwise high quality cells which could not be confidently assigned using other methods.
  • Cells from groups which are present in lower proportions may be classified better than with unsupervised SNP approaches.

Note: the approach used here differs from most SNP methods in that it is supervised. We attain knowledge of which SNPs are associated with which patients from the high confidence cells then use this to train a classifier. It is similar to demuxlet in the sense that the classifier uses group specific SNP information, however our method does not require the expense of genotyping and so may be much more widely applicable.

1.3 Installation


if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("demuxSNP")

## or for development version:

devtools::install_github("michaelplynch/demuxSNP")

1.4 Quick Usage

For full list of arguments and explanation of each function, please refer to relevant documentation.

1.5 Workflow

#Data loading
data(multiplexed_scrnaseq_sce, 
     commonvariants_1kgenomes_subset, 
     vartrix_consensus_snps, 
     package = "demuxSNP")

small_sce<-multiplexed_scrnaseq_sce[,1:100]
ensdb <- EnsDb.Hsapiens.v86::EnsDb.Hsapiens.v86

#Preprocessing
top_genes<-common_genes(small_sce)
vcf_sub<-subset_vcf(commonvariants_1kgenomes_subset, top_genes, ensdb)
small_sce<-high_conf_calls(small_sce)

#Use subsetted vcf with VarTrix in default 'consensus' mode to generate SNPs 
#matrix

small_sce<-add_snps(small_sce,vartrix_consensus_snps[,1:100])

small_sce<-reassign(small_sce,k=5)

table(small_sce$knn)
#> 
#>  Doublet Hashtag1 Hashtag3 Hashtag4 Hashtag5 Hashtag6 
#>       22       10       16        7       13       32

1.5.1 Function Explanation

top_genes <- common_genes(sce): Returns the genes which are expressed (expression > 0) in the highest proportion of cells. These genes are used below to subset the .vcf file.

new_vcf <- subset_vcf(vcf = vcf, top_genes = top_genes, ensdb = ensdb): Subsets a supplied .vcf to SNP locations within the genes supplied. The ranges of the genes are extracted from the EnsDb object.

sce <- high_conf_calls(sce = sce, assay = "HTO"): Takes a SingleCellExperiment object with HTO altExp, runs demuxmix and returns a factor of assigned labels, a logical vector indicating high confidence calls and a logical vector indicating which cells to predict (all).

sce <- add_snps(sce = sce, mat = vartrix_consensus_snps, thresh = 0.8): Adds the SNP data from VarTrix (default consensus mode) to the SingleCellExperiment object as an altExp. Additionally, filters out SNPs with no reads in less than ‘thresh’ proportion of cells.

sce <- reassign(sce,
    k = 10,
    d = 10,
    train_cells = sce$train,
    predict_cells = sce$predict
)

Reassigns cells based on SNP profiles of high confidence cells. Singlet training data is based on high confidence singlet assignment. Doublets are simulated by systematically sampling and combining the SNP profiles of n cells pairs from each grouping combination. Cells to be used for training data are specified by “train_cells” (logical). Cells to be used for prediction are specified by “predict_cells” (logical), this may also include the training data.

2 Exploratory Analysis

We load three data objects: a SingleCellExperiment object containing RNA and HTO counts, a .vcf file of class CollapsedVCF containing SNPs from 1000 Genomes common variants and a matrix containing SNP information for each cell (we will show you how to generate this SNPs matrix using VarTrix outside of R). We have already removed low quality cells (library size<1,000 and percentage of genes mapping to mitochondrial genes>10%).


class(multiplexed_scrnaseq_sce)
#> [1] "SingleCellExperiment"
#> attr(,"package")
#> [1] "SingleCellExperiment"
class(commonvariants_1kgenomes_subset)
#> [1] "CollapsedVCF"
#> attr(,"package")
#> [1] "VariantAnnotation"
class(vartrix_consensus_snps)
#> [1] "matrix" "array"

The HTO or LMO distribution is usually bimodal, with a signal (high counts) and background distribution (low counts) caused by non-specific binding. Ideally, these distributions would be clearly separated with no overlap, but in practice, this is not always the case. In our example data, we see that the signal and noise overlap to varying extents in each group.

We will begin by running Seurat’s HTODemux, a popular HTO demultiplexing algorithm on the data.

logcounts(multiplexed_scrnaseq_sce) <- counts(multiplexed_scrnaseq_sce)
seurat <- as.Seurat(multiplexed_scrnaseq_sce)
seurat <- HTODemux(seurat)
seurat$hash.ID <- factor(as.character(seurat$hash.ID))
multiplexed_scrnaseq_sce$seurat <- seurat$hash.ID

multiplexed_scrnaseq_sce$seurat <- seurat$hash.ID

table(multiplexed_scrnaseq_sce$seurat)
#> 
#>  Doublet Hashtag1 Hashtag2 Hashtag3 Hashtag4 Hashtag5 Hashtag6 Negative 
#>      633      121       29      264      158      177      383      235

Although HTO library size of the Negative group is low, the RNA library size is similar to that of other groups, indicating that they may be misclassified as Negative due to their tagging quality rather than overall RNA quality.

dittoPlot(seurat, "nCount_HTO", group.by = "hash.ID")