Elaborating on cell-based quality control in single-cell RNA-seq data

Aaron T. L. Lun1, Davis J. McCarthy2,3 and John C. Marioni1,2,4

1Cancer Research UK Cambridge Institute, Li Ka Shing Centre, Robinson Way, Cambridge CB2 0RE, United Kingdom
2EMBL European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
3St Vincent's Institute of Medical Research, 41 Victoria Parade, Fitzroy, Victoria 3065, Australia
4Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom

2019-05-03

1 Overview

Low-quality cells can often yield misleading results in downstream analyses, by:

forming their own distinct cluster(s), complicating interpretation of the results. This can be most obviously driven by increased mitochondrial proportions or enrichment for nuclear RNAs after cell damage. However, very small libraries can also form their own clusters due to shifts in the mean upon log-transformation.
containing genes that appear to be strongly “upregulated” due to the presence of very small size factors. This is most problematic if some transcripts are present at constant levels in the ambient solution for all cells (i.e., wells or droplets). Small counts will then be greatly inflated upon normalization with these size factors.
containing genes that appear to be strongly “downregulated” due to the loss of RNA upon cell damage. This seems most pronounced with ribosomal protein genes, though other cytoplasmic transcripts are likely to be affected.
distorting the characterization of population heterogeneity during variance estimation or principal components analysis. The first few principal components will capture differences in quality rather than biology, reducing the effectiveness of dimensionality reduction. Similarly, genes with the largest variances will be driven by differences between low- and high-quality cells.

As such, we need to remove these cells at the start of the analysis. Recall that we were defining low-quality cells as those with outlier values for various quality control (QC) metrics, using the isOutlier() and calculateQCMetrics() functions from the scater package (McCarthy et al. 2017). Here, we will examine some of the reasoning behind the outlier-based QC in more detail.

2 Assumptions of outlier identification

An outlier-based definition for low-quality cells assumes that most cells are of high quality. This is usually reasonable and can be experimentally supported in some situations by visually checking that the cells are intact, e.g., on the microwell plate. Another assumption is that the QC metrics are independent on the biological state of each cell. This ensures that any outlier values for these metrics are driven by technical factors rather than biological processes. Thus, removing cells based on the metrics will not misrepresent the biology in downstream analyses.

The second assumption is most likely to be violated in highly heterogeneous cell populations. For example, some cell types may naturally have less RNA or express fewer genes than other cell types. Such cell types are more likely to be considered outliers and removed, even if they are of high quality. The use of the MAD mitigates this problem by accounting for biological variability in the QC metrics. A heterogeneous population should have higher variability in the metrics among high-quality cells, increasing the MAD and reducing the chance of incorrectly removing particular cell types (at the cost of reducing power to remove low-quality cells). Nonetheless, filtering based on outliers may not be appropriate in extreme cases where one cell type is very different from the others.

Systematic differences in the QC metrics can be handled to some extent using the batch= argument in the isOutlier() function. For example, setting batch to the plate of origin will identify outliers within each level of batch, using plate-specific median and MAD estimates. This is obviously useful for accommodating known differences in experimental processing, e.g., sequencing at different depth or different amounts of added spike-in RNA. We can also include biological factors in batch, if those factors could result in systematically fewer expressed genes or lower RNA content. However, this is not applicable in experiments where the factors are not known in advance.

3 Checking for discarded cell types

3.1 In the 416B data set

We can diagnose loss of distinct cell types during QC by looking for differences in gene expression between the discarded and retained cells. To demonstrate, we compute the average count across the discarded and retained pools in the 416B data set.

library(SingleCellExperiment)
sce.full.416b <- readRDS("416B_preQC.rds")

library(scater)
lost <- calcAverage(counts(sce.full.416b)[,!sce.full.416b$PassQC])
kept <- calcAverage(counts(sce.full.416b)[,sce.full.416b$PassQC])

If the discarded pool is enriched for a certain cell type, we should observe increased expression of the corresponding marker genes. No systematic upregulation of genes is apparent in the discarded pool in Figure 1, indicating that the QC step did not inadvertently filter out a cell type in the 416B dataset.

# Avoid loss of points where either average is zero.
capped.lost <- pmax(lost, min(lost[lost>0]))
capped.kept <- pmax(kept, min(kept[kept>0]))

plot(capped.lost, capped.kept, xlab="Average count (discarded)", 
    ylab="Average count (retained)", log="xy", pch=16)
is.spike <- isSpike(sce.full.416b)
points(capped.lost[is.spike], capped.kept[is.spike], col="red", pch=16)
is.mito <- rowData(sce.full.416b)$is_feature_control_Mt
points(capped.lost[is.mito], capped.kept[is.mito], col="dodgerblue", pch=16)

Average counts across all discarded and retained cells in the 416B dataset. Each point represents a gene, with spike-in and mitochondrial transcripts in red and blue respectively.

Figure 1: Average counts across all discarded and retained cells in the 416B dataset. Each point represents a gene, with spike-in and mitochondrial transcripts in red and blue respectively.

We examine this more closely by computing log-fold changes between the average counts of the two pools. The predFC function stabilizes the log-fold change estimates by adding a prior count to the average of each pool. We only examine the log-fold changes rather than formally testing for differential expression, as we are not interested in penalizing intra-pool heterogeneity.

library(edgeR)
coefs <- predFC(cbind(lost, kept), design=cbind(1, c(1, 0)))[,2]
info <- data.frame(logFC=coefs, Lost=lost, Kept=kept, 
    row.names=rownames(sce.full.416b))
head(info[order(info$logFC, decreasing=TRUE),], 20)

##                       logFC      Lost        Kept
## ENSMUSG00000104647 6.844237  7.515034 0.000000000
## Nmur1              6.500909  5.909533 0.000000000
## Retn               6.250501 10.333172 0.196931018
## Fut9               6.096356  4.696897 0.010132032
## ENSMUSG00000102352 6.077614  9.393793 0.206637368
## ENSMUSG00000102379 6.029758  4.244690 0.000000000
## 1700101I11Rik      5.828821  4.483094 0.039199404
## Gm4952             5.698380  6.580862 0.172999108
## ENSMUSG00000106680 5.670156  3.389236 0.005234611
## ENSMUSG00000107955 5.554616  5.268508 0.132532601
## Gramd1c            5.446975  4.435342 0.103783669
## Jph3               5.361082  4.696897 0.139188080
## ENSMUSG00000092418 5.324462  3.395752 0.056931488
## 1700029I15Rik      5.316226  8.199510 0.394588776
## Pih1h3b            5.307439  2.546814 0.000000000
## ENSMUSG00000097176 5.275459  2.541927 0.003772842
## Olfr456            5.093383  2.186536 0.000000000
## ENSMUSG00000103731 5.017303  3.315016 0.107148909
## Klhdc8b            4.933215 19.861036 1.635081878
## ENSMUSG00000082449 4.881422  1.878759 0.000000000

Again, no obvious cell type markers are present in the top set of genes upregulated in the discarded pool. The magnitude of the log-fold changes is less important, attributable to imprecision with few cells in the discarded pool. Large log-fold changes can also be driven by enrichment or depletion of mitochondrial, ribosomal protein or nuclear genes upon cell damage.

3.2 In the PBMC data set

For comparison, we consider the PBMC data set in which we previously identified a platelet population (see the previous workflow). Recall that we relied on the use of the emptyDrops() method from the DropletUtils package to retain the platelets. In contrast, if we had used a naive threshold on the total unique molecular identifier (UMI) count, we would have removed this population during the cell calling step.

sce.pbmc <- readRDS("pbmc_data.rds")
wrong.keep <- sce.pbmc$total_counts >= 1000

lost <- calcAverage(counts(sce.pbmc)[,!wrong.keep])
kept <- calcAverage(counts(sce.pbmc)[,wrong.keep])

The presence of a distinct population in the discarded pool manifests in Figure 2 as a shift to the bottom-right for a number of genes. This includes PF4, PPBP and SDPR that are strongly upregulated in the platelets.

# Avoid loss of points where either average is zero.
capped.lost <- pmax(lost, min(lost[lost>0]))
capped.kept <- pmax(kept, min(kept[kept>0]))

plot(capped.lost, capped.kept, xlab="Average count (discarded)", 
    ylab="Average count (retained)", log="xy", pch=16)
platelet <- c("PF4", "PPBP", "SDPR")
points(capped.lost[platelet], capped.kept[platelet], col="orange", pch=16)

Average counts across all discarded and retained cells in the PBMC dataset, after using a more stringent filter on the total UMI count. Each point represents a gene, with platelet-related genes highlighted in orange.

Figure 2: Average counts across all discarded and retained cells in the PBMC dataset, after using a more stringent filter on the total UMI count. Each point represents a gene, with platelet-related genes highlighted in orange.

These platelet-specific genes are also present among the top set of positive log-fold changes.

coefs <- predFC(cbind(lost, kept), design=cbind(1, c(1, 0)))[,2]
info <- data.frame(logFC=coefs, Lost=lost, Kept=kept, 
    row.names=rownames(sce.pbmc))
head(info[order(info$logFC, decreasing=TRUE),], 20)

##              logFC      Lost       Kept
## PF4       6.615671 4.2289086 0.17421441
## PPBP      6.495522 4.8409530 0.27144834
## HIST1H2AC 6.291680 3.1195253 0.14435051
## GNG11     6.162314 2.5132359 0.10103056
## SDPR      5.950336 2.1430640 0.09754273
## TUBB1     5.610679 1.6478079 0.08989034
## CLU       5.462238 1.3202347 0.05559122
## ACRBP     5.325560 1.1933566 0.05436568
## NRGN      5.118699 1.3155255 0.12992550
## RGS18     5.009974 1.6512487 0.25377562
## MAP3K7CL  4.898185 0.9950268 0.08957073
## SPARC     4.646432 0.6559761 0.02487467
## MMD       4.616159 0.7450948 0.06367575
## PGRMC1    4.503524 0.7335820 0.08248177
## CMTM5     4.166574 0.4469937 0.01643304
## TSC22D1   4.120268 0.5128047 0.05920195
## HRAT92    4.116163 0.4214188 0.01144488
## GP9       4.091146 0.4610813 0.03703803
## CTSA      3.986470 0.8294337 0.27097756
## MARCH2    3.966528 0.5655020 0.12228759

3.3 Avoiding loss of cell types

If cell types are being incorrectly discarded, the most direct solution is to relax the QC filters by increasing nmads= in the isOutlier() calls. We can also avoid filtering on metrics that are associated with genuine biological differences between cell types. The most extreme approach would be to not perform any QC filtering at all, thus guaranteeing that all cell types in the data are retained. However, this obviously comes with an increased risk of retaining more low-quality damaged cells. Such cells will cause problems in downstream analyses as discussed above, which motivates the use of a more strict filter (at least on the first pass) in our workflows.

As an aside, it is worth mentioning that the true technical quality of a cell may be correlated with its type. (This differs from a correlation between the cell type and the QC metrics, as the latter are our imperfect proxies for quality.) This can arise if some cell types are not amenable to dissociation or microfluidics handling during the scRNA-seq protocol. In such cases, it is possible to correctly discard an entire cell type during QC if all of its members are damaged. Indeed, concerns over the computational removal of cell types during QC are probably minor compared to losses in the experimental protocol.

4 Alternative approaches to quality control

4.1 Using fixed thresholds

One alternative strategy is to set pre-defined thresholds on each QC metric. For example, we might remove all cells with library sizes below 100000 and numbers of expressed genes below 4000. This avoids any assumptions associated with the use of outliers to identify low-quality cells. However, it generally requires considerable experience to determine appropriate thresholds for each experimental protocol and biological system. For example, thresholds for read count-based data are simply not applicable for UMI-based data, and vice versa. Indeed, even with the same protocol and system, the appropriate threshold can vary from run to run due to the vagaries of RNA capture and sequencing.

4.2 Using PCA-based outliers

Another strategy is to perform a principal components analysis (PCA) based on the quality metrics for each cell, e.g., the total number of reads, the total number of features and the proportion of mitochondrial or spike-in reads. Outliers on a PCA plot may be indicative of low-quality cells that have aberrant technical properties compared to the (presumed) majority of high-quality cells. This is demonstrated below on a brain cell dataset from Tasic et al. (2016), using functions from scater.

# Obtaining the dataset.
library(scRNAseq)
data(allen)

# Setting up the data.
sce.allen <- as(allen, "SingleCellExperiment")
assayNames(sce.allen) <- "counts"
isSpike(sce.allen, "ERCC") <- grep("ERCC", rownames(sce.allen))

# Computing the QC metrics and running PCA.
library(scater)
sce.allen <- calculateQCMetrics(sce.allen)
sce.allen <- runPCA(sce.allen, use_coldata=TRUE, detect_outliers=TRUE)
table(sce.allen$outlier)

## 
## FALSE  TRUE 
##   374     5

Methods like PCA-based outlier detection and support vector machines can provide more power to distinguish low-quality cells from high-quality counterparts (Ilicic et al. 2016). This is because they are able to detect subtle patterns across many quality metrics simultaneously. However, this comes at some cost to interpretability, as the reason for removing a given cell may not always be obvious. Users interested in the more sophisticated approaches are referred to the scater and cellity packages.

4.3 Using the gene expression profiles

For completeness, we note that outliers can also be identified from the gene expression profiles, rather than QC metrics. We consider this to be a risky strategy as it can remove high-quality cells in rare populations. Even if subpopulations are explicitly captured with a mixture model, removal of outlier cells will simply reinforce the existing model. This may be misleading if it understates the biological heterogeneity in each population.

5 Concluding remarks

All software packages used in this workflow are publicly available from the Comprehensive R Archive Network (https://cran.r-project.org) or the Bioconductor project (http://bioconductor.org). The specific version numbers of the packages used are shown below, along with the version of the R installation.

sessionInfo()

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] scRNAseq_1.9.0                        
##  [2] org.Mm.eg.db_3.8.2                    
##  [3] readxl_1.3.1                          
##  [4] TxDb.Mmusculus.UCSC.mm10.ensGene_3.4.0
##  [5] GenomicFeatures_1.36.0                
##  [6] Matrix_1.2-17                         
##  [7] edgeR_3.26.0                          
##  [8] limma_3.40.0                          
##  [9] BiocNeighbors_1.2.0                   
## [10] TENxBrainData_1.3.0                   
## [11] HDF5Array_1.12.0                      
## [12] rhdf5_2.28.0                          
## [13] Rtsne_0.15                            
## [14] BiocSingular_1.0.0                    
## [15] scran_1.12.0                          
## [16] scater_1.12.0                         
## [17] ggplot2_3.1.1                         
## [18] SingleCellExperiment_1.6.0            
## [19] SummarizedExperiment_1.14.0           
## [20] DelayedArray_0.10.0                   
## [21] BiocParallel_1.18.0                   
## [22] matrixStats_0.54.0                    
## [23] GenomicRanges_1.36.0                  
## [24] GenomeInfoDb_1.20.0                   
## [25] org.Hs.eg.db_3.8.2                    
## [26] AnnotationDbi_1.46.0                  
## [27] IRanges_2.18.0                        
## [28] S4Vectors_0.22.0                      
## [29] Biobase_2.44.0                        
## [30] BiocGenerics_0.30.0                   
## [31] BiocFileCache_1.8.0                   
## [32] dbplyr_1.4.0                          
## [33] knitr_1.22                            
## [34] BiocStyle_2.12.0                      
## 
## loaded via a namespace (and not attached):
##   [1] tidyselect_0.2.5              RSQLite_2.1.1                
##   [3] trimcluster_0.1-2.1           grid_3.6.0                   
##   [5] ranger_0.11.2                 munsell_0.5.0                
##   [7] destiny_2.14.0                codetools_0.2-16             
##   [9] statmod_1.4.30                sROC_0.1-2                   
##  [11] withr_2.1.2                   batchelor_1.0.0              
##  [13] colorspace_1.4-1              highr_0.8                    
##  [15] robustbase_0.93-4             vcd_1.4-4                    
##  [17] VIM_4.8.0                     TTR_0.23-4                   
##  [19] labeling_0.3                  cvTools_0.3.2                
##  [21] GenomeInfoDbData_1.2.1        bit64_0.9-7                  
##  [23] pheatmap_1.0.12               xfun_0.6                     
##  [25] ggthemes_4.1.1                diptest_0.75-7               
##  [27] R6_2.4.0                      ggbeeswarm_0.6.0             
##  [29] robCompositions_2.1.0         rsvd_1.0.0                   
##  [31] RcppEigen_0.3.3.5.0           locfit_1.5-9.1               
##  [33] flexmix_2.3-15                mvoutlier_2.0.9              
##  [35] reshape_0.8.8                 bitops_1.0-6                 
##  [37] assertthat_0.2.1              promises_1.0.1               
##  [39] scales_1.0.0                  nnet_7.3-12                  
##  [41] beeswarm_0.2.3                gtable_0.3.0                 
##  [43] beachmat_2.0.0                processx_3.3.0               
##  [45] rlang_0.3.4                   scatterplot3d_0.3-41         
##  [47] splines_3.6.0                 rtracklayer_1.44.0           
##  [49] lazyeval_0.2.2                BiocManager_1.30.4           
##  [51] yaml_2.2.0                    abind_1.4-5                  
##  [53] httpuv_1.5.1                  tools_3.6.0                  
##  [55] bookdown_0.9                  zCompositions_1.2.0          
##  [57] RColorBrewer_1.1-2            proxy_0.4-23                 
##  [59] dynamicTreeCut_1.63-1         Rcpp_1.0.1                   
##  [61] plyr_1.8.4                    progress_1.2.0               
##  [63] zlibbioc_1.30.0               purrr_0.3.2                  
##  [65] RCurl_1.95-4.12               ps_1.3.0                     
##  [67] prettyunits_1.0.2             viridis_0.5.1                
##  [69] cowplot_0.9.4                 zoo_1.8-5                    
##  [71] haven_2.1.0                   cluster_2.0.9                
##  [73] magrittr_1.5                  data.table_1.12.2            
##  [75] openxlsx_4.1.0                lmtest_0.9-37                
##  [77] truncnorm_1.0-8               mvtnorm_1.0-10               
##  [79] hms_0.4.2                     mime_0.6                     
##  [81] evaluate_0.13                 xtable_1.8-4                 
##  [83] smoother_1.1                  XML_3.98-1.19                
##  [85] rio_0.5.16                    mclust_5.4.3                 
##  [87] gridExtra_2.3                 compiler_3.6.0               
##  [89] biomaRt_2.40.0                tibble_2.1.1                 
##  [91] crayon_1.3.4                  htmltools_0.3.6              
##  [93] pcaPP_1.9-73                  later_0.8.0                  
##  [95] tidyr_0.8.3                   rrcov_1.4-7                  
##  [97] DBI_1.0.0                     ExperimentHub_1.10.0         
##  [99] fpc_2.1-11.2                  MASS_7.3-51.4                
## [101] rappdirs_0.3.1                boot_1.3-22                  
## [103] car_3.0-2                     sgeostat_1.0-27              
## [105] igraph_1.2.4.1                forcats_0.4.0                
## [107] pkgconfig_2.0.2               GenomicAlignments_1.20.0     
## [109] foreign_0.8-71                laeken_0.5.0                 
## [111] sp_1.3-1                      vipor_0.4.5                  
## [113] dqrng_0.2.0                   XVector_0.24.0               
## [115] NADA_1.6-1                    stringr_1.4.0                
## [117] callr_3.2.0                   digest_0.6.18                
## [119] pls_2.7-1                     Biostrings_2.52.0            
## [121] rmarkdown_1.12                cellranger_1.1.0             
## [123] DelayedMatrixStats_1.6.0      kernlab_0.9-27               
## [125] curl_3.3                      modeltools_0.2-22            
## [127] shiny_1.3.2                   Rsamtools_2.0.0              
## [129] Rhdf5lib_1.6.0                carData_3.0-2                
## [131] viridisLite_0.3.0             pillar_1.3.1                 
## [133] GGally_1.4.0                  lattice_0.20-38              
## [135] survival_2.44-1.1             httr_1.4.0                   
## [137] DEoptimR_1.0-8                interactiveDisplayBase_1.22.0
## [139] glue_1.3.1                    xts_0.11-2                   
## [141] zip_2.0.1                     prabclus_2.2-7               
## [143] simpleSingleCell_1.8.0        bit_1.1-14                   
## [145] class_7.3-15                  stringi_1.4.3                
## [147] blob_1.1.1                    AnnotationHub_2.16.0         
## [149] memoise_1.1.0                 dplyr_0.8.0.1                
## [151] irlba_2.3.3                   e1071_1.7-1

References

Ilicic, T., J. K. Kim, A. A. Kołodziejczyk, F. O. Bagger, D. J. McCarthy, J. C. Marioni, and S. A. Teichmann. 2016. “Classification of low quality cells from single-cell RNA-seq data.” Genome Biol. 17 (1):29.

McCarthy, D. J., K. R. Campbell, A. T. Lun, and Q. F. Wills. 2017. “Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R.” Bioinformatics 33 (8):1179–86.

Tasic, B., V. Menon, T. N. Nguyen, T. K. Kim, T. Jarsky, Z. Yao, B. Levi, et al. 2016. “Adult mouse cortical cell taxonomy revealed by single cell transcriptomics.” Nat. Neurosci. 19 (2):335–46.