1 Introduction

1.1 Overview

This workflow template is for analyzing single cell RNA-seq (scRNA-seq) data. It is provided by systemPipeRdata, a companion package to systemPipeR (H Backman and Girke 2016). The content of this workflow steps are provided by this Seurat Tutorial. Similar to other systemPipeR workflow templates, a single command generates the necessary working environment. This includes the expected directory structure for executing systemPipeR workflows and parameter files for running command-line (CL) software utilized in specific analysis steps. For learning and testing purposes, a small sample (toy) data set is also included (mainly FASTQ and reference genome files). This enables users to seamlessly run the numerous analysis steps of this workflow from start to finish without the requirement of providing custom data. After testing the workflow, users have the flexibility to employ the template as is with their own data or modify it to suit their specific needs. For more comprehensive information on designing and executing workflows, users want to refer to the main vignettes of systemPipeR and systemPipeRdata.

The Rmd file (SPscrna.Rmd) associated with this vignette serves a dual purpose. It acts both as a template for executing the workflow and as a template for generating a reproducible scientific analysis report. Thus, users want to customize the text (and/or code) of this vignette to describe their experimental design and analysis results. This typically involves deleting the instructions how to work with this workflow, and customizing the text describing experimental designs, other metadata and analysis results.

1.2 Experimental design

Typically, the user wants to describe here the sources and versions of the reference genome sequence along with the corresponding annotations. The standard directory structure of systemPipeR (see here), expects the input data in a subdirectory named data and all results will be written to a separate results directory. The Rmd source file for executing the workflow and rendering its report (here SPscrna.Rmd) is expected to be located in the parent directory.

In this template, a toy single cell dataset is preprocessed/filtered by 10X. Samples taken from peripheral blood mononuclear cells (PBMCs), about 3000 cells.

This dataset will be downloaded on-the-fly as one of the workflow steps in this template. Users can also manually download the dataset and unzip into the data directory.

To use their own scRNA-Seq and reference genome data, users want to move or link the data to the designated data directory and execute the workflow from the parent directory using their customized Rmd file. Beginning with this template, users should delete the provided test data and move or link their custom data to the designated locations. Alternatively, users can create an environment skeleton (named new here) or build one from scratch.

1.3 Workflow steps

The default analysis steps included in this scRNA-Seq workflow template are listed below. Users can modify the existing steps, add new ones or remove steps as needed.

Default analysis steps

  1. Read in single cell count data.
  2. Basic stats on input data.
  3. Create some basic QC on cell counting.
  4. Normalization.
  5. Find high variable genes.
  6. Scaling.
  7. Dim reduction, PCA.
  8. Clustering with tSNE, uMAP.
  9. Find clustering markers (marker gene).
  10. Find cell types.
  11. Visualize cell types and clustering together.

1.4 Load workflow environment

The environment for this scRNA-Seq workflow is auto-generated below with the genWorkenvir function (selected under workflow="scrnaseq"). The name of the resulting workflow directory can be specified under the mydirname argument. The default NULL uses the name of the chosen workflow. An error is issued if a directory of the same name and path exists already. After this, the user’s R session needs to be directed into the resulting SPscrna directory (here with setwd).

library(systemPipeRdata)
genWorkenvir(workflow = "SPscrna", mydirname = "SPscrna")
setwd("SPscrna")

1.4.1 Input data: targets file

Typically for systemPipeR workflows, there is a targets file defines the input files (e.g. FASTQ or BAM) and sample information that will be called in command-line tools. However, this workflow does not require a targets file.

For users who are interested in learning more about targets file, here is a detailed description of the structure and utility of targets files.

2 Quick start

After a workflow environment has been created with the above genWorkenvir function call and the corresponding R session directed into the resulting directory (here SPscrna), the SPRproject function is used to initialize a new workflow project instance. The latter creates an empty SAL workflow container (below sal) and at the same time a linked project log directory (default name .SPRproject) that acts as a flat-file database of a workflow. Additional details about this process and the SAL workflow control class are provided in systemPipeR's main vignette here and here.

Next, the importWF function imports all the workflow steps outlined in the source Rmd file of this vignette (here SPscrna.Rmd) into the SAL workflow container. An overview of the workflow steps and their status information can be returned at any stage of the loading or run process by typing sal.

library(systemPipeR)
sal <- SPRproject()
sal <- importWF(sal, file_path = "SPscrna.Rmd", verbose = FALSE)
sal

After loading the workflow into sal, it can be executed from start to finish (or partially) with the runWF command. Running the workflow will only be possible if all dependent CL software is installed on a user’s system. Their names and availability on a system can be listed with listCmdTools(sal, check_path=TRUE). For more information about the runWF command, refer to the help file and the corresponding section in the main vignette here.

Running workflows in parallel mode on computer clusters is a straightforward process in systemPipeR. Users can simply append the resource parameters (such as the number of CPUs) for a cluster run to the sal object after importing the workflow steps with importWF using the addResources function. More information about parallelization can be found in the corresponding section at the end of this vignette here and in the main vignette here.

sal <- runWF(sal)

Workflows can be visualized as topology graphs using the plotWF function.

plotWF(sal)
Toplogy graph of scRNA-Seq workflow.

Figure 1: Toplogy graph of scRNA-Seq workflow

Scientific and technical reports can be generated with the renderReport and renderLogs functions, respectively. Scientific reports can also be generated with the render function of the rmarkdown package. The technical reports are based on log information that systemPipeR collects during workflow runs.

# Scientific report
sal <- renderReport(sal)
rmarkdown::render("SPscrna.Rmd", clean = TRUE, output_format = "BiocStyle::html_document")

# Technical (log) report
sal <- renderLogs(sal)

The statusWF function returns a status summary for each step in a SAL workflow instance.

statusWF(sal)

3 Workflow steps

The data analysis steps of this workflow are defined by the following workflow code chunks. They can be loaded into SAL interactively, by executing the code of each step in the R console, or all at once with the importWF function used under the Quick start section. R and CL workflow steps are declared in the code chunks of Rmd files with the LineWise and SYSargsList functions, respectively, and then added to the SAL workflow container with appendStep<-. Their syntax and usage is described here.

3.1 Load packages

The first step loads the required packages.

cat(crayon::blue$bold("To use this workflow, following R packages are expected:\n"))
cat(c("'Seurat", "ggplot2", "ggpubr", "patchwork", "dplyr", "tibble",
    "readr'\n"), sep = "', '")
### pre-end
appendStep(sal) <- LineWise(code = {
    library(systemPipeR)
    library(Seurat)
    library(dplyr)
    library(ggplot2)
    library(ggpubr)
    library(patchwork)
}, step_name = "load_packages")

3.2 Load data

In this example, the single cell data is preprocessed/filtered 10x data from a healthy donor. Samples taken from peripheral blood mononuclear cells (PBMCs), about 3000 cells.

Dataset can be downloaded with this link: https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz

If the link is not working, visit 10x website for updated links.

For your real data, please preprocess and put the dataset inside data directory

appendStep(sal) <- LineWise(code = {
    # unzip the data
    untar("data/pbmc3k_filtered_gene_bc_matrices.tar.gz", exdir = "data")
    # load data
    pbmc.data <- Read10X(data.dir = "data/filtered_gene_bc_matrices/hg19/")
    # Use dim to see the size of dataset, example data has
    # 2700 cells x 32738 genes
    dim(pbmc.data)
}, step_name = "load_data", dependency = "load_packages")

3.3 Simple count visulization

We can plot to see how many cells have good expressions.

appendStep(sal) <- LineWise(code = {
    at_least_one <- apply(pbmc.data, 2, function(x) sum(x > 0))
    count_p1 <- tibble::as_tibble(at_least_one) %>%
        ggplot() + geom_histogram(aes(x = value), binwidth = floor(nrow(pbmc.data)/400),
        fill = "#6b97c2", color = "white") + theme_pubr(16) +
        scale_y_continuous(expand = c(0, 0)) + scale_x_continuous(expand = c(0,
        0)) + labs(title = "Distribution of detected genes",
        x = "Genes with at least one tag")

    count_p2 <- tibble::as_tibble(MatrixGenerics::colSums(pbmc.data)) %>%
        ggplot() + geom_histogram(aes(x = value), bins = floor(ncol(pbmc.data)/50),
        fill = "#6b97c2", color = "white") + theme_pubr(16) +
        scale_y_continuous(expand = c(0, 0)) + scale_x_continuous(expand = c(0,
        0)) + labs(title = "Expression sum per cell", x = "Sum expression")

    png("results/count_plots.png", 1000, 700)
    count_p1 + count_p2 + patchwork::plot_annotation(tag_levels = "A")
    dev.off()
}, step_name = "count_plot", dependency = "load_data")