Contents

1 Introduction

scruff is a toolkit for processing single cell RNA-seq FASTQ reads generated by CEL-Seq and CEL-Seq2 protocols. It does demultiplexing, alignment, Unique Molecular Identifier (UMI) filtering, and transcript counting in an automated fashion and generates the gene count matrix, QC metrics and provides visualizations of data quality. This vignette provides a brief introduction to the scruff package by walking through the demultiplexing, alignment, and UMI-counting of a built-in publicly available example dataset (van den Brink, et al. 2017).

2 Quick Start

# Run scruff on example dataset
# NOTE: Requires Rsubread index and TxDb objects for the reference genome.
# For generation of these files, please refer to the Stepwise Tutorial.

library(scruff)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
# Get the paths to example FASTQ, FASTA, and GTF files.
# Please note that because the following files are included in
# scruff R package, we use system.file() function to extract the paths
# to these files. If the user is running scruff on real data, the
# paths to the input FASTQ, FASTA, and GTF files should be provided,
# and there is no need to call system.file() function. For example,
# v1h1R1 <- "/PATH/TO/vandenBrink_1h1_L001_R1_001.fastq.gz"
# fasta <- "/Path/TO/GRCm38_MT.fa"
v1h1R1 <- system.file("extdata",
    "vandenBrink_1h1_L001_R1_001.fastq.gz",
    package = "scruff")
v1h1R2 <- system.file("extdata",
    "vandenBrink_1h1_L001_R2_001.fastq.gz",
    package = "scruff")
vb1R1 <- system.file("extdata",
    "vandenBrink_b1_L001_R1_001.fastq.gz",
    package = "scruff")
vb1R2 <- system.file("extdata",
    "vandenBrink_b1_L001_R2_001.fastq.gz",
    package = "scruff")
fasta <- system.file("extdata", "GRCm38_MT.fa", package = "scruff")
gtf <- system.file("extdata", "GRCm38_MT.gtf", package = "scruff")

Build Rsubread alignment index. This is for the alignment step. For test purpose, here we are aligning the example FASTQ files to the genes on mitochondrial chromosome only.

# NOTE: Rsubread package does not support Windows environment.
if (!requireNamespace("Rsubread", quietly = TRUE)) {
    message("Package \"Rsubread\" needed for \"alignRsubread\"",
        " function to work. ",
        "Please install it if you are using Linux or macOS systems. ",
        "The function is not available in Windows environment.")
} else {
    # Create index files for GRCm38_MT.
    # For details, please refer to Rsubread user manual.
    # Specify the basename for Rsubread index
    indexBase <- "GRCm38_MT"
    Rsubread::buildindex(basename = indexBase,
        reference = fasta,
        indexSplit = FALSE)
}

Now that everything is ready, we can run scruff. In experiment 1h1, cell barcodes 95 and 96 are empty well controls. In experiment b1, cell barcode 95 is bulk sample containing 300 cells. These information can be set by the cellPerWell argument. We apply cell barcode and UMI correction by setting bcEdit to 1 and umiEdit to 1. scruff makes use of the SingleCellExperiment package. The following command returns a SingleCellExperiment object containing UMI filtered count matrix as well as gene and sample annotations and QC metrics.

data(barcodeExample, package = "scruff")

if (!requireNamespace("Rsubread", quietly = TRUE)) {
    message("Package \"Rsubread\" needed. ",
        "Please install it if you are using Linux or macOS systems. ",
        "The function is not available in Windows environment.")
} else {
    sce <- scruff(project = "example",
        experiment = c("1h1", "b1"),
        lane = c("L001", "L001"),
        read1Path = c(v1h1R1, vb1R1),
        read2Path = c(v1h1R2, vb1R2),
        bc = barcodeExample,
        index = indexBase,
        unique = FALSE,
        nBestLocations = 1,
        reference = gtf,
        bcStart = 1,
        bcStop = 8,
        bcEdit = 1,
        umiStart = 9,
        umiStop = 12,
        umiEdit = 1,
        keep = 75,
        cellPerWell = c(rep(1, 94), 0, 0, rep(1, 94), 300, 1),
        cores = 2,
        verbose = TRUE)
}

Visualize data quality.

data(sceExample, package = "scruff")
qc <- qcplots(sceExample)

3 Stepwise Tutorial For CEL-Seq Samples

3.1 Load Example Dataset

The scruff package contains 4 single cell RNA-seq FASTQ example files. Each file has 10,000 sequenced reads.

library(scruff)

# Get the paths to example FASTQ files.
# Please note that because the following files are included in
# scruff R package, we use system.file() function to extract the paths
# to these files. If the user is running scruff on real data, the
# paths to the input FASTQ, FASTA, and GTF files should be provided,
# and there is no need to call system.file() function. For example,
# v1h1R1 <- "/PATH/TO/vandenBrink_1h1_L001_R1_001.fastq.gz"
# fasta <- "/Path/TO/GRCm38_MT.fa"
v1h1R1 <- system.file("extdata",
    "vandenBrink_1h1_L001_R1_001.fastq.gz",
    package = "scruff")
v1h1R2 <- system.file("extdata",
    "vandenBrink_1h1_L001_R2_001.fastq.gz",
    package = "scruff")
vb1R1 <- system.file("extdata",
    "vandenBrink_b1_L001_R1_001.fastq.gz",
    package = "scruff")
vb1R2 <- system.file("extdata",
    "vandenBrink_b1_L001_R2_001.fastq.gz",
    package = "scruff")

3.2 Demultiplex and Assign Cell Specific Reads

Now the FASTQ files are ready to be demultiplexed. scruff package provides built-in predefined cell barcodes barcodeExample for demultiplexing the example dataset. In the example FASTQ files, read 1 contains cell barcode and UMI sequence information. Read 2 contains transcript sequences. The barcode sequence of each read starts at base 1 and ends at base 8. The UMI sequence starts at base 9 and ends at base 12. They can be set via bcStart, bcStop, and umiStart, umiStop arguments. Cell barcode correction can be set by bcEdit parameter. By default, reads with any nucleotide in the barcode and UMI sequences with sequencing quality lower than 10 (Phred score) will be excluded. The following command demultiplexes the example FASTQ reads and trims reads longer than 75 nucleotides. The command returns a SingleCellExperiment object whose colData contains the cell index, barcode, reads, percentage of reads assigned, sample, and FASTQ file path information for each cell. By default, the cell specific demultiplexed fastq.gz files are stored in ./Demultiplex folder.

data(barcodeExample, package = "scruff")
de <- demultiplex(project = "example",
    experiment = c("1h1", "b1"),
    lane = c("L001", "L001"),
    read1Path = c(v1h1R1, vb1R1),
    read2Path = c(v1h1R2, vb1R2),
    barcodeExample,
    bcStart = 1,
    bcStop = 8,
    bcEdit = 1,
    umiStart = 9,
    umiStop = 12,
    keep = 75,
    minQual = 10,
    yieldReads = 1e+06,
    verbose = TRUE,
    overwrite = TRUE,
    cores = 2)

3.3 Alignment

scruff provides an alignment function alignRsubread which is a wrapper function to align in Rsubread package. It aligns the reads to reference sequence index and outputs sequence alignment map files in “BAM” or “SAM” format. For demonstration purpose, the built-in mitochondrial DNA sequence from GRCm38 reference assembly GRCm38MitochondrialFasta will be used to map the reads. First, a Rsubread index for the reference sequence needs to be generated.

# Create index files for GRCm38_MT. For details, please refer to Rsubread
# user manual.
fasta <- system.file("extdata", "GRCm38_MT.fa", package = "scruff")
# NOTE: Rsubread package does not support Windows environment.
if (!requireNamespace("Rsubread", quietly = TRUE)) {
    message("Package \"Rsubread\" needed.",
        " Please install it if you are using Linux or macOS systems.",
        " The function is not available in Windows environment.\n")
} else {
    # Create index files for GRCm38_MT.
    # For details, please refer to Rsubread user manual.
    # Specify the basename for Rsubread index
    indexBase <- "GRCm38_MT"
    Rsubread::buildindex(basename = indexBase,
        reference = fasta,
        indexSplit = FALSE)
}

The following command maps the FASTQ files to GRCm38 mitochondrial reference sequence GRCm38_MT.fa and returns a SingleCellExperiment object. By default, the files are stored in BAM format in ./Alignment folder.

# Align the reads using Rsubread
if (requireNamespace("Rsubread", quietly = TRUE)) {
    al <- alignRsubread(de,
        indexBase,
        unique = FALSE,
        nBestLocations = 1,
        format = "BAM",
        overwrite = TRUE,
        verbose = TRUE,
        cores = 2)
}

3.4 UMI correction and Generation of Count Matrix

Example GTF file GRCm38_MT.gtf will be used for feature counting. Currently, scruff applies the union counting mode of the HTSeq Python package. The following command generates the UMI corrected count matrix for the example dataset by allowing correction of UMIs with Hamming distances smaller than 1 for each gene in each cell.

gtf <- system.file("extdata", "GRCm38_MT.gtf", package = "scruff")
# get the molecular counts of trancsripts for each cell
# In experiment 1h1, cell barcodes 95 and 96 are empty well controls.
# In experiment b1, cell barcode 95 is bulk sample containing 300 cells.
if (requireNamespace("Rsubread", quietly = TRUE)) {
    sce = countUMI(al,
        gtf,
        umiEdit = 1,
        format = "BAM",
        cellPerWell = c(rep(1, 94), 0, 0, rep(1, 94), 300, 1),
        verbose = TRUE,
        cores = 2)
}

3.5 Visualization of QC metrics

The data quality diagnostic information are contained in the colData of the returned SingleCellExperiment object sce. They can be visualized using the qcplots function.

data(sceExample, package = "scruff")
qc <- qcplots(sceExample)
qc