Contents

1 Introduction

The raer (RNA Adenosine editing in R) package provides tools to characterize A-to-I editing in single cell and bulk RNA-sequencing datasets. Both novel and known editing sites can be detected and quantified beginning with BAM alignment files. At it’s core the raer package uses the pileup routines from the HTSlib C library (Bonfield et al. (2021)) to identify candidate RNA editing sites, and leverages the annotation resources in the Bioconductor ecosystem to further characterize and identify high-confidence RNA editing sites.

Here we demonstrate how to use the raer package to a) quantify RNA editing sites in droplet scRNA-seq dataset, b) identify editing sites with condition specific editing in bulk RNA-seq data, and c) predict novel editing sites from bulk RNA-seq.

2 Installation

The raer package can be installed from Bioconductor using BiocManager.

if (!require("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("raer")

Alternatively raer can be installed from github using BiocManager::install("rnabioco/raer").

3 Characterizing RNA editing sites in scRNA-seq data

Here we will use the raer package to examine RNA editing in droplet-based single cell RNA-seq data. pileup_cells() enables quantification of edited and non-edited bases at specified sites from scRNA-seq data.

For this example we will examine a scRNA-seq dataset from human PBMC cells provided by 10x Genomics. The single cell data was aligned and processed using the 10x Genomics cellranger pipeline.

The PBMC scRNA-seq dataset from 10x Genomics, along with other needed files will downloaded and cached using pbmc_10x()from the raerdata ExperimentHub package. For this vignette, the BAM file was subset to retain 2 million alignments that overlap human RNA editing sites on chromosome 16.

pbmc_10x() returns a list containing a BamFile object, a GRanges object with known RNA editing sites from the REDIportal database Mansi et al. (2021), and a SingleCellExperiment populated with the gene expression data and cell type annotations.

library(raer)
library(raerdata)

pbmc <- pbmc_10x()

pbmc_bam <- pbmc$bam
editing_sites <- pbmc$sites
sce <- pbmc$sce

This dataset contains T-cell, B-cells, and monocyte cell populations.

library(scater)
library(SingleCellExperiment)
plotUMAP(sce, colour_by = "celltype")

3.1 Specifying sites to quantify

Next we’ll select editing sites to quantify. For this analysis we will use RNA editing sites cataloged in the REDIportal database Mansi et al. (2021).

editing_sites
## GRanges object with 15638648 ranges and 0 metadata columns:
##              seqnames    ranges strand
##                 <Rle> <IRanges>  <Rle>
##          [1]     chr1     87158      -
##          [2]     chr1     87168      -
##          [3]     chr1     87171      -
##          [4]     chr1     87189      -
##          [5]     chr1     87218      -
##          ...      ...       ...    ...
##   [15638644]     chrY  56885715      +
##   [15638645]     chrY  56885716      +
##   [15638646]     chrY  56885728      +
##   [15638647]     chrY  56885841      +
##   [15638648]     chrY  56885850      +
##   -------
##   seqinfo: 44 sequences from hg38 genome; no seqlengths

The sites to quantify are specified using a custom formatted GRanges object with 1 base intervals, a strand (+ or -), and supplemented with metadata columns named REF and ALT containing the reference and alternate base to query. In this case we are only interested in A->I editing, so we set the ref and alt to A and G. Note that the REF and ALT bases are in reference to strand. For a - strand interval the bases should be the complement of the + strand bases. Also note that these bases can be stored as traditional character vectors or as Rle() objects to save memory.

editing_sites$REF <- Rle("A")
editing_sites$ALT <- Rle("G")
editing_sites
## GRanges object with 15638648 ranges and 2 metadata columns:
##              seqnames    ranges strand |   REF   ALT
##                 <Rle> <IRanges>  <Rle> | <Rle> <Rle>
##          [1]     chr1     87158      - |     A     G
##          [2]     chr1     87168      - |     A     G
##          [3]     chr1     87171      - |     A     G
##          [4]     chr1     87189      - |     A     G
##          [5]     chr1     87218      - |     A     G
##          ...      ...       ...    ... .   ...   ...
##   [15638644]     chrY  56885715      + |     A     G
##   [15638645]     chrY  56885716      + |     A     G
##   [15638646]     chrY  56885728      + |     A     G
##   [15638647]     chrY  56885841      + |     A     G
##   [15638648]     chrY  56885850      + |     A     G
##   -------
##   seqinfo: 44 sequences from hg38 genome; no seqlengths

3.2 Quantifying sites in single cells using pileup_cells

pileup_cells() quantifies edited and non-edited UMI counts per cell barcode, then organizes the site counts into a SingleCellExperiment object. pileup_cells() accepts a FilterParam() object that specifies parameters for multiple read-level and site-level filtering and processing options. Note that pileup_cells() is strand sensitive by default, so it is important to ensure that the strand of the input sites is correctly annotated, and that the library-type is set correctly for the strandedness of the sequencing library. For 10x Genomics data, the library type is set to fr-second-strand, indicating that the strand of the BAM alignments is the same strand as the RNA. See quantifying Smart-seq2 scRNA-seq libraries for an example of using pileup_cells() to handle unstranded data and data from libraries that produce 1 BAM file for each cell.

To exclude duplicate reads derived from PCR, pileup_cells() can use a UMI sequence, supplied via the umi_tag argument, to only count 1 read for each CB-UMI pair at each editing site position. Note however that by default the bam_flags argument for the FilterParam class is set to include duplicate reads when using pileup_cells(). Droplet single cell libraries produce multiple cDNA fragments from a single reverse transcription event. The cDNA fragments have different alignment positions due to fragmentation despite being derived from a single RNA molecule. scRNA-seq data processed by cellranger from 10x Genomics will set the “Not primary alignment” BAM flag for every read except one read for each UMI. If duplicates are removed based on this BAM flag, then only 1 representative fragment for a single UMI will be examined, which will exclude many valid regions.

To reduce processing time many functions in the raer package operate in parallel across multiple chromosomes. To enable parallel processing, a BiocParallel backend can be supplied via the BPPARAM argument (e.g.  MultiCoreParam()).

outdir <- file.path(tempdir(), "sc_edits")
cbs <- colnames(sce)

params <- FilterParam(
    min_mapq = 255, # required alignment MAPQ score
    library_type = "fr-second-strand", # library type
    min_variant_reads = 1
)

e_sce <- pileup_cells(
    bamfile = pbmc_bam,
    sites = editing_sites,
    cell_barcodes = cbs,
    output_directory = outdir,
    cb_tag = "CB",
    umi_tag = "UB",
    param = params
)
e_sce
## class: SingleCellExperiment 
## dim: 3849 500 
## metadata(0):
## assays(2): nRef nAlt
## rownames(3849): site_chr16_83540_1_AG site_chr16_83621_1_AG ...
##   site_chr16_31453268_2_AG site_chr16_31454303_2_AG
## rowData names(2): REF ALT
## colnames(500): TGTTTGTCAGTTAGGG-1 ATCTCTACAAGCTACT-1 ...
##   GGGCGTTTCAGGACGA-1 CTATAGGAGATTGTGA-1
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

The outputs from pileup_cells() are a SingleCellExperiment object populated with nRef and nAlt assays containing the base counts for the reference (unedited) and alternate (edited) alleles at each position.

The sparseMatrices are also written to files, at a directory specified by output_directory, which can be loaded into R using the read_sparray() function.

dir(outdir)
## [1] "barcodes.txt.gz" "counts.mtx.gz"   "sites.txt.gz"
read_sparray(
    file.path(outdir, "counts.mtx.gz"),
    file.path(outdir, "sites.txt.gz"),
    file.path(outdir, "barcodes.txt.gz")
)
## class: SingleCellExperiment 
## dim: 3849 500 
## metadata(0):
## assays(2): nRef nAlt
## rownames(3849): site_chr16_83540_1_AG site_chr16_83621_1_AG ...
##   site_chr16_31453268_2_AG site_chr16_31454303_2_AG
## rowData names(2): REF ALT
## colnames(500): TGTTTGTCAGTTAGGG-1 ATCTCTACAAGCTACT-1 ...
##   GGGCGTTTCAGGACGA-1 CTATAGGAGATTGTGA-1
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

Next we’ll filter the single cell editing dataset to find sites with an editing event in at least 5 cells and add the editing counts to the gene expression SingleCellExperiment as an altExp().

e_sce <- e_sce[rowSums(assays(e_sce)$nAlt > 0) >= 5, ]
e_sce <- calc_edit_frequency(e_sce,
    edit_from = "Ref",
    edit_to = "Alt",
    replace_na = FALSE
)
altExp(sce) <- e_sce[, colnames(sce)]

With the editing sites added to the gene expression SingleCellExperiment we can use plotting and other methods previously developed for single cell analysis. Here we’ll visualize editing sites with the highest edited read counts.

to_plot <- rownames(altExp(sce))[order(rowSums(assay(altExp(sce), "nAlt")),
    decreasing = TRUE
)]

lapply(to_plot[1:5], function(x) {
    plotUMAP(sce, colour_by = x, by_exprs_values = "nAlt")
})
## [[1]]

## 
## [[2]]