Here, we describe a few additional analyses that can be performed with single-cell RNA sequencing data. This includes detection of significant correlations between genes and regressing out the effect of cell cycle from the gene expression matrix.
All software packages used in this workflow are publicly available from the Comprehensive R Archive Network (https://cran.r-project.org) or the Bioconductor project (http://bioconductor.org). The specific version numbers of the packages used are shown below, along with the version of the R installation.
sessionInfo()
## R version 4.4.0 RC (2024-04-16 r86468)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] scran_1.31.3 scater_1.31.2
## [3] ggplot2_3.5.1 scuttle_1.13.1
## [5] SingleCellExperiment_1.25.1 SummarizedExperiment_1.33.3
## [7] Biobase_2.63.1 GenomicRanges_1.55.4
## [9] GenomeInfoDb_1.39.14 IRanges_2.37.1
## [11] S4Vectors_0.41.7 BiocGenerics_0.49.1
## [13] MatrixGenerics_1.15.1 matrixStats_1.3.0
## [15] readxl_1.4.3 R.utils_2.12.3
## [17] R.oo_1.26.0 R.methodsS3_1.8.2
## [19] BiocFileCache_2.11.2 dbplyr_2.5.0
## [21] knitr_1.46 BiocStyle_2.31.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.2 gridExtra_2.3
## [3] rlang_1.1.3 magrittr_2.0.3
## [5] compiler_4.4.0 RSQLite_2.3.6
## [7] DelayedMatrixStats_1.25.4 vctrs_0.6.5
## [9] pkgconfig_2.0.3 crayon_1.5.2
## [11] fastmap_1.1.1 magick_2.8.3
## [13] XVector_0.43.1 labeling_0.4.3
## [15] utf8_1.2.4 rmarkdown_2.26
## [17] UCSC.utils_0.99.7 ggbeeswarm_0.7.2
## [19] tinytex_0.50 purrr_1.0.2
## [21] bit_4.0.5 xfun_0.43
## [23] bluster_1.13.0 zlibbioc_1.49.3
## [25] cachem_1.0.8 beachmat_2.19.4
## [27] jsonlite_1.8.8 blob_1.2.4
## [29] highr_0.10 DelayedArray_0.29.9
## [31] BiocParallel_1.37.1 cluster_2.1.6
## [33] irlba_2.3.5.1 parallel_4.4.0
## [35] R6_2.5.1 bslib_0.7.0
## [37] limma_3.59.10 jquerylib_0.1.4
## [39] cellranger_1.1.0 Rcpp_1.0.12
## [41] bookdown_0.39 igraph_2.0.3
## [43] Matrix_1.7-0 tidyselect_1.2.1
## [45] abind_1.4-5 yaml_2.3.8
## [47] viridis_0.6.5 codetools_0.2-20
## [49] curl_5.2.1 lattice_0.22-6
## [51] tibble_3.2.1 withr_3.0.0
## [53] evaluate_0.23 pillar_1.9.0
## [55] BiocManager_1.30.22 filelock_1.0.3
## [57] generics_0.1.3 sparseMatrixStats_1.15.1
## [59] munsell_0.5.1 scales_1.3.0
## [61] glue_1.7.0 metapod_1.11.1
## [63] tools_4.4.0 BiocNeighbors_1.21.2
## [65] ScaledMatrix_1.11.1 locfit_1.5-9.9
## [67] cowplot_1.1.3 grid_4.4.0
## [69] edgeR_4.1.33 colorspace_2.1-0
## [71] GenomeInfoDbData_1.2.12 beeswarm_0.4.0
## [73] BiocSingular_1.19.0 vipor_0.4.7
## [75] cli_3.6.2 rsvd_1.0.5
## [77] fansi_1.0.6 S4Arrays_1.3.7
## [79] viridisLite_0.4.2 dplyr_1.1.4
## [81] gtable_0.3.5 sass_0.4.9
## [83] digest_0.6.35 dqrng_0.3.2
## [85] SparseArray_1.3.7 ggrepel_0.9.5
## [87] farver_2.1.1 memoise_2.0.1
## [89] htmltools_0.5.8.1 lifecycle_1.0.4
## [91] httr_1.4.7 statmod_1.5.0
## [93] bit64_4.0.5
Angel, P., and M. Karin. 1991. “The role of Jun, Fos and the AP-1 complex in cell-proliferation and transformation.” Biochim. Biophys. Acta 1072 (2-3): 129–57.
Bourgon, R., R. Gentleman, and W. Huber. 2010. “Independent filtering increases detection power for high-throughput experiments.” Proc. Natl. Acad. Sci. U.S.A. 107 (21): 9546–51.
Phipson, B., and G. K. Smyth. 2010. “Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn.” Stat. Appl. Genet. Mol. Biol. 9: Article 39.
Simes, R. J. 1986. “An Improved Bonferroni Procedure for Multiple Tests of Significance.” Biometrika 73 (3): 751–54.
Wilson, N. K., D. G. Kent, F. Buettner, M. Shehata, I. C. Macaulay, F. J. Calero-Nieto, M. Sanchez Castillo, et al. 2015. “Combined single-cell functional and gene expression analysis resolves heterogeneity within stem cell populations.” Cell Stem Cell 16 (6): 712–24.
3 Comments on filtering by abundance
Low-abundance genes are problematic as zero or near-zero counts do not contain much information for reliable statistical inference. In applications involving hypothesis testing, these genes typically do not provide enough evidence to reject the null hypothesis yet they still increase the severity of the multiple testing correction. The discreteness of the counts may also interfere with statistical procedures, e.g., by compromising the accuracy of continuous approximations. Thus, low-abundance genes are often removed in many RNA-seq analysis pipelines before the application of downstream methods.
The “optimal” choice of filtering strategy depends on the downstream application. A more aggressive filter is usually required to remove discreteness and to avoid zeroes, e.g., for normalization purposes. By comparison, the filter statistic for hypothesis testing is mainly required to be independent of the test statistic under the null hypothesis (Bourgon, Gentleman, and Huber 2010). Given these differences in priorities, we (or the relevant function) will filter at each step as appropriate, rather than applying a single filter for the entire analysis. For example,
computeSumFactors()
will apply a somewhat stringent filter based on the average count, whilefitTrendVar()
will apply a relatively relaxed filter based on the average log-expression. Other applications will not do any abundance-based filtering at all (e.g.,denoisePCA()
) to preserve biological signal from lowly expressed genes.Nonetheless, if global filtering is desired, it is simple to achieve by simply subsetting the
SingleCellExperiment
object. The example below demonstrates how we could remove genes with average counts less than 1 in the HSC dataset. The number ofTRUE
values indemo.keep
corresponds to the number of retained rows/genes after filtering.