scruff 1.4.2
scruff is a toolkit for processing single cell RNA-seq FASTQ reads generated by CEL-Seq and CEL-Seq2 protocols. It does demultiplexing, alignment, Unique Molecular Identifier (UMI) filtering, and transcript counting in an automated fashion and generates the gene count matrix, QC metrics and provides visualizations of data quality. This vignette provides a brief introduction to the scruff package by walking through the demultiplexing, alignment, and UMI-counting of a built-in publicly available example dataset (van den Brink, et al. 2017).
# Run scruff on example dataset
# NOTE: Requires Rsubread index and TxDb objects for the reference genome.
# For generation of these files, please refer to the Stepwise Tutorial.
library(scruff)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# Get the paths to example FASTQ, FASTA, and GTF files.
# Please note that because the following files are included in
# scruff R package, we use system.file() function to extract the paths
# to these files. If the user is running scruff on real data, the
# paths to the input FASTQ, FASTA, and GTF files should be provided,
# and there is no need to call system.file() function. For example,
# v1h1R1 <- "/PATH/TO/vandenBrink_1h1_L001_R1_001.fastq.gz"
# fasta <- "/Path/TO/GRCm38_MT.fa"
v1h1R1 <- system.file("extdata",
"vandenBrink_1h1_L001_R1_001.fastq.gz",
package = "scruff")
v1h1R2 <- system.file("extdata",
"vandenBrink_1h1_L001_R2_001.fastq.gz",
package = "scruff")
vb1R1 <- system.file("extdata",
"vandenBrink_b1_L001_R1_001.fastq.gz",
package = "scruff")
vb1R2 <- system.file("extdata",
"vandenBrink_b1_L001_R2_001.fastq.gz",
package = "scruff")
fasta <- system.file("extdata", "GRCm38_MT.fa", package = "scruff")
gtf <- system.file("extdata", "GRCm38_MT.gtf", package = "scruff")
Build Rsubread alignment index. This is for the alignment step. For test purpose, here we are aligning the example FASTQ files to the genes on mitochondrial chromosome only.
# NOTE: Rsubread package does not support Windows environment.
if (!requireNamespace("Rsubread", quietly = TRUE)) {
message("Package \"Rsubread\" needed for \"alignRsubread\"",
" function to work. ",
"Please install it if you are using Linux or macOS systems. ",
"The function is not available in Windows environment.")
} else {
# Create index files for GRCm38_MT.
# For details, please refer to Rsubread user manual.
# Specify the basename for Rsubread index
indexBase <- "GRCm38_MT"
Rsubread::buildindex(basename = indexBase,
reference = fasta,
indexSplit = FALSE)
}
Now that everything is ready, we can run scruff. In experiment 1h1, cell barcodes 95 and 96 are empty well controls. In experiment b1, cell barcode 95 is bulk sample containing 300 cells. These information can be set by the cellPerWell
argument. We apply cell barcode and UMI correction by setting bcEdit
to 1 and umiEdit
to 1. scruff makes use of the SingleCellExperiment package. The following command returns a SingleCellExperiment
object containing UMI filtered count matrix as well as gene and sample annotations and QC metrics.
data(barcodeExample, package = "scruff")
if (!requireNamespace("Rsubread", quietly = TRUE)) {
message("Package \"Rsubread\" needed. ",
"Please install it if you are using Linux or macOS systems. ",
"The function is not available in Windows environment.")
} else {
sce <- scruff(project = "example",
experiment = c("1h1", "b1"),
lane = c("L001", "L001"),
read1Path = c(v1h1R1, vb1R1),
read2Path = c(v1h1R2, vb1R2),
bc = barcodeExample,
index = indexBase,
unique = FALSE,
nBestLocations = 1,
reference = gtf,
bcStart = 1,
bcStop = 8,
bcEdit = 1,
umiStart = 9,
umiStop = 12,
umiEdit = 1,
keep = 75,
cellPerWell = c(rep(1, 94), 0, 0, rep(1, 94), 300, 1),
cores = 2,
verbose = TRUE)
}
Visualize data quality.
data(sceExample, package = "scruff")
qc <- qcplots(sceExample)
The scruff package contains 4 single cell RNA-seq FASTQ example files. Each file has 10,000 sequenced reads.
library(scruff)
# Get the paths to example FASTQ files.
# Please note that because the following files are included in
# scruff R package, we use system.file() function to extract the paths
# to these files. If the user is running scruff on real data, the
# paths to the input FASTQ, FASTA, and GTF files should be provided,
# and there is no need to call system.file() function. For example,
# v1h1R1 <- "/PATH/TO/vandenBrink_1h1_L001_R1_001.fastq.gz"
# fasta <- "/Path/TO/GRCm38_MT.fa"
v1h1R1 <- system.file("extdata",
"vandenBrink_1h1_L001_R1_001.fastq.gz",
package = "scruff")
v1h1R2 <- system.file("extdata",
"vandenBrink_1h1_L001_R2_001.fastq.gz",
package = "scruff")
vb1R1 <- system.file("extdata",
"vandenBrink_b1_L001_R1_001.fastq.gz",
package = "scruff")
vb1R2 <- system.file("extdata",
"vandenBrink_b1_L001_R2_001.fastq.gz",
package = "scruff")
Now the FASTQ files are ready to be demultiplexed. scruff package provides built-in predefined cell barcodes barcodeExample
for demultiplexing the example dataset. In the example FASTQ files, read 1 contains cell barcode and UMI sequence information. Read 2 contains transcript sequences. The barcode sequence of each read starts at base 1 and ends at base 8. The UMI sequence starts at base 9 and ends at base 12. They can be set via bcStart
, bcStop
, and umiStart
, umiStop
arguments. Cell barcode correction can be set by bcEdit
parameter. By default, reads with any nucleotide in the barcode and UMI sequences with sequencing quality lower than 10 (Phred score) will be excluded. The following command demultiplexes the example FASTQ reads and trims reads longer than 75 nucleotides. The command returns a SingleCellExperiment
object whose colData
contains the cell index, barcode, reads, percentage of reads assigned, sample, and FASTQ file path information for each cell. By default, the cell specific demultiplexed fastq.gz files are stored in ./Demultiplex
folder.
data(barcodeExample, package = "scruff")
de <- demultiplex(project = "example",
experiment = c("1h1", "b1"),
lane = c("L001", "L001"),
read1Path = c(v1h1R1, vb1R1),
read2Path = c(v1h1R2, vb1R2),
barcodeExample,
bcStart = 1,
bcStop = 8,
bcEdit = 1,
umiStart = 9,
umiStop = 12,
keep = 75,
minQual = 10,
yieldReads = 1e+06,
verbose = TRUE,
overwrite = TRUE,
cores = 2)
scruff provides an alignment function alignRsubread
which is a wrapper function to align
in Rsubread
package. It aligns the reads to reference sequence index and outputs sequence alignment map files in “BAM” or “SAM” format. For demonstration purpose, the built-in mitochondrial DNA sequence from GRCm38 reference assembly GRCm38MitochondrialFasta
will be used to map the reads. First, a Rsubread index for the reference sequence needs to be generated.
# Create index files for GRCm38_MT. For details, please refer to Rsubread
# user manual.
fasta <- system.file("extdata", "GRCm38_MT.fa", package = "scruff")
# NOTE: Rsubread package does not support Windows environment.
if (!requireNamespace("Rsubread", quietly = TRUE)) {
message("Package \"Rsubread\" needed.",
" Please install it if you are using Linux or macOS systems.",
" The function is not available in Windows environment.\n")
} else {
# Create index files for GRCm38_MT.
# For details, please refer to Rsubread user manual.
# Specify the basename for Rsubread index
indexBase <- "GRCm38_MT"
Rsubread::buildindex(basename = indexBase,
reference = fasta,
indexSplit = FALSE)
}
The following command maps the FASTQ files to GRCm38 mitochondrial reference sequence GRCm38_MT.fa
and returns a SingleCellExperiment
object. By default, the files are stored in BAM format in ./Alignment
folder.
# Align the reads using Rsubread
if (requireNamespace("Rsubread", quietly = TRUE)) {
al <- alignRsubread(de,
indexBase,
unique = FALSE,
nBestLocations = 1,
format = "BAM",
overwrite = TRUE,
verbose = TRUE,
cores = 2)
}
Example GTF file GRCm38_MT.gtf
will be used for feature counting. Currently, scruff applies the union counting mode of the HTSeq Python package. The following command generates the UMI corrected count matrix for the example dataset by allowing correction of UMIs with Hamming distances smaller than 1 for each gene in each cell.
gtf <- system.file("extdata", "GRCm38_MT.gtf", package = "scruff")
# get the molecular counts of trancsripts for each cell
# In experiment 1h1, cell barcodes 95 and 96 are empty well controls.
# In experiment b1, cell barcode 95 is bulk sample containing 300 cells.
if (requireNamespace("Rsubread", quietly = TRUE)) {
sce = countUMI(al,
gtf,
umiEdit = 1,
format = "BAM",
cellPerWell = c(rep(1, 94), 0, 0, rep(1, 94), 300, 1),
verbose = TRUE,
cores = 2)
}
The data quality diagnostic information are contained in the colData
of the returned SingleCellExperiment
object sce
. They can be visualized using the qcplots
function.
data(sceExample, package = "scruff")
qc <- qcplots(sceExample)
qc