library(demuxSNP)
library(ComplexHeatmap)
library(viridisLite)
library(Seurat)
library(ggpubr)
library(dittoSeq)
library(utils)
library(EnsDb.Hsapiens.v86)
colors <- structure(viridis(n = 3), names = c("-1", "0", "1"))
Multiplexing in scRNAseq involves the sequencing of samples from different patients, treatment types or physiological locations together, resulting in significant cost savings. The cells must then be demultiplexed, or assigned back to their respective groups. A number of experimental and computational methods have been proposed to facilitate this, but a universally robust algorithm remains elusive. Below, we introduce some existing methods, highlight the novel features of our approach and its advantages to the user.
Cells from each group are labelled with a distinct tag (HTO or LMO) which is sequenced to give a counts matrix. Due to non-specific binding, these counts form a bimodal distribution. Such methods are generally computationally efficient. Their classification performance, however, is highly dependent on the tagging quality and many methods do not account for uncertainty in classification (Boggy et al. (2022), Kim et al. (2020) & Stoeckius et al. (2018)).
More recent methods, including demuxmix, assign a probability that a cell is from a particular group, or made up of multiple groups (doublet). This allows users to define a cut-off threshold for the assignment confidence. Accounting for uncertainty is an important feature for these types of algorithms. But, while they give the user greater flexibility in determining which cells to keep, this ultimately results in a trade-off between keeping cells which cannot be confidently called or discarding them - due to issues with tagging quality rather than RNA quality.
The second class of methods exploits natural genetic variation between cells and so can only be used where the groups are genetically distinct. Demuxlet (Kang et al. (2018)) uses genotype information from each group to classify samples. This genotyping incurs additional experimental cost. To address this, Souporcell (Heaton et al. (2020)) and Vireo (Huang, McCarthy, and Stegle (2019)) among other methods were developed to classify cells based on their SNPs in an unsupervised manner. Without prior knowledge of the SNPs associated with each group, these unsupervised methods may confuse groups with lower cell counts for other signals in the data.
Demuxlet remains the standard often used to benchmark other methods but its more widespread adoption has been limited by the requirement of sample genotype information.
With cell hashing, we can confidently demultiplex some but not all cells. Using these high confidence cells, we can learn the SNPs associated with each group. This SNP information can then be used to assign remaining cells (which we could not confidently call using cell hashing) to their most similar group based on their SNP profile.
Novel features:
Uses both cell hashing and SNP data. Current methods are limited to using one or the other.
Selects SNPs based on being located in a gene expressed in a large proportion of cells to reduce noise, computational cost and increase interpretability.
Impact:
Note: the approach used here differs from most SNP methods in that it is supervised. We attain knowledge of which SNPs are associated with which patients from the high confidence cells then use this to train a classifier. It is similar to demuxlet in the sense that the classifier uses group specific SNP information, however our method does not require the expense of genotyping and so may be much more widely applicable.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("demuxSNP")
## or for development version:
devtools::install_github("michaelplynch/demuxSNP")
For full list of arguments and explanation of each function, please refer to relevant documentation.
#Data loading
data(multiplexed_scrnaseq_sce,
commonvariants_1kgenomes_subset,
vartrix_consensus_snps,
package = "demuxSNP")
small_sce<-multiplexed_scrnaseq_sce[,1:100]
ensdb <- EnsDb.Hsapiens.v86::EnsDb.Hsapiens.v86
#Preprocessing
top_genes<-common_genes(small_sce)
vcf_sub<-subset_vcf(commonvariants_1kgenomes_subset, top_genes, ensdb)
small_sce<-high_conf_calls(small_sce)
#Use subsetted vcf with VarTrix in default 'consensus' mode to generate SNPs
#matrix
small_sce<-add_snps(small_sce,vartrix_consensus_snps[,1:100])
small_sce<-reassign(small_sce,k=5)
table(small_sce$knn)
#>
#> Doublet Hashtag1 Hashtag3 Hashtag4 Hashtag5 Hashtag6
#> 22 10 16 7 13 32
top_genes <- common_genes(sce)
: Returns the genes which are expressed
(expression > 0) in the highest proportion of cells. These genes are used below
to subset the .vcf file.
new_vcf <- subset_vcf(vcf = vcf, top_genes = top_genes, ensdb = ensdb)
:
Subsets a supplied .vcf to SNP locations within the genes supplied. The ranges
of the genes are extracted from the EnsDb object.
sce <- high_conf_calls(sce = sce, assay = "HTO")
:
Takes a SingleCellExperiment object with HTO altExp, runs
demuxmix
and returns a factor of assigned labels, a logical vector indicating high
confidence calls and a logical vector indicating which cells to predict (all).
sce <- add_snps(sce = sce, mat = vartrix_consensus_snps, thresh = 0.8)
:
Adds the SNP data from VarTrix (default consensus mode) to the
SingleCellExperiment object as an altExp. Additionally, filters out SNPs with no
reads in less than ‘thresh’ proportion of cells.
sce <- reassign(sce,
k = 10,
d = 10,
train_cells = sce$train,
predict_cells = sce$predict
)
Reassigns cells based on SNP profiles of high confidence cells. Singlet training data is based on high confidence singlet assignment. Doublets are simulated by systematically sampling and combining the SNP profiles of n cells pairs from each grouping combination. Cells to be used for training data are specified by “train_cells” (logical). Cells to be used for prediction are specified by “predict_cells” (logical), this may also include the training data.
We load three data objects: a SingleCellExperiment object containing RNA and HTO counts, a .vcf file of class CollapsedVCF containing SNPs from 1000 Genomes common variants and a matrix containing SNP information for each cell (we will show you how to generate this SNPs matrix using VarTrix outside of R). We have already removed low quality cells (library size<1,000 and percentage of genes mapping to mitochondrial genes>10%).
class(multiplexed_scrnaseq_sce)
#> [1] "SingleCellExperiment"
#> attr(,"package")
#> [1] "SingleCellExperiment"
class(commonvariants_1kgenomes_subset)
#> [1] "CollapsedVCF"
#> attr(,"package")
#> [1] "VariantAnnotation"
class(vartrix_consensus_snps)
#> [1] "matrix" "array"
The HTO or LMO distribution is usually bimodal, with a signal (high counts) and background distribution (low counts) caused by non-specific binding. Ideally, these distributions would be clearly separated with no overlap, but in practice, this is not always the case. In our example data, we see that the signal and noise overlap to varying extents in each group.
We will begin by running Seurat’s HTODemux, a popular HTO demultiplexing algorithm on the data.
logcounts(multiplexed_scrnaseq_sce) <- counts(multiplexed_scrnaseq_sce)
seurat <- as.Seurat(multiplexed_scrnaseq_sce)
seurat <- HTODemux(seurat)
seurat$hash.ID <- factor(as.character(seurat$hash.ID))
multiplexed_scrnaseq_sce$seurat <- seurat$hash.ID
multiplexed_scrnaseq_sce$seurat <- seurat$hash.ID
table(multiplexed_scrnaseq_sce$seurat)
#>
#> Doublet Hashtag1 Hashtag2 Hashtag3 Hashtag4 Hashtag5 Hashtag6 Negative
#> 633 121 29 264 158 177 383 235
Although HTO library size of the Negative group is low, the RNA library size is similar to that of other groups, indicating that they may be misclassified as Negative due to their tagging quality rather than overall RNA quality.
dittoPlot(seurat, "nCount_HTO", group.by = "hash.ID")