Contents

1 Introduction

This vignette demonstrates how to use the package sSNAPPY to compute directional single sample pathway perturbation scores by incorporating pathway topologies and changes in gene expression, utilizing sample permutation to test the significance of individual scores and comparing average pathway activities across treatments.

The package also provides many powerful and easy-to-use visualisation functions that helps visualising significantly perturbed pathways as networks, detecting community structures in pathway networks, and revealing pathway genes’ involvement in the perturbation.

2 To get ready

2.1 Installation

The package sSNAPPY can be installed using the package BiocManager

if (!"BiocManager" %in% rownames(installed.packages()))
  install.packages("BiocManager")
# Other packages required in this vignette
pkg <- c("tidyverse", "magrittr", "ggplot2", "cowplot", "DT")
BiocManager::install(pkg)
BiocManager::install("sSNAPPY")
install.packages("htmltools")

2.2 Load packages

library(sSNAPPY)
library(tidyverse)
library(magrittr)
library(ggplot2)
library(cowplot)
library(DT)
library(htmltools)
library(patchwork)

2.3 Load data

The example dataset used for this tutorial can be loaded with data() as shown below. It’s a subset of data retrieved from Singhal H et al. 2016, where ER-positive primary breast cancer tumour tissues collected from 12 patients were split into tissue fragments of equal sizes for different treatments.

For this tutorial, we are only looking at the RNA-seq data from samples that were treated with vehicle, R5020(progesterone) OR E2(estrogen) + R5020 for 48 hrs. Tumour specimens were collected from 5 patients, giving rise to a total of 15 samples. To cut down the computation time, only half the expressed genes were randomly sampled to derive the example logCPM matrix. A more detailed description of the dataset can be assessed through the help page (?logCPM_example and ?metadata_example).

data(logCPM_example)
data(metadata_example)
# check if samples included in the logCPM matrix and metadata dataframe are identical
setequal(colnames(logCPM_example), metadata_example$sample)
## [1] TRUE
# View sample metadata
datatable(metadata_example,  filter = "top")

3 sSNAPPY workflow

3.1 Compute weighted single-sample logFCs (ssLogFCs)

It is expected that the logCPM matrix will be filtered to remove undetectable genes and normalised to correct for library sizes or other systematic artefacts, such as gene length or GC contents, prior to applying the sSNAPPY workflow. Filtration and normalisation have been performed on the example dataset.

Before single-sample logFC (ssFC) can be computed, row names of the logCPM matrix need to be converted to entrez ID. This is because all the pathway topology information retrieved will be in entrez ID. The conversion can be achieved through bioconductor packages AnnotationHub and ensembldb.

head(logCPM_example)

To compute the ssFC, samples must be in matching pairs. In our example data, treated samples were matched to the corresponding control samples derived from the same patients. Therefore the groupBy parameter of the weight_ss_fc() functions will be set to be “patient”.

weight_ss_fc() requires both the logCPM matrix and sample metadata as input. The column names of the logCPM matrix should be sample names, matching a column in the metadata. Name of the sample name column will be provided as the sampleColumn parameter. The function also requires the name of the metadata column that contains treatment information to be specified. The column with treatment information must be a factor with the control treatment set to be the reference level.

# Check that the baseline level of the treatment column is the control
levels(metadata_example$treatment)[1]
## [1] "Vehicle"
#compute weighted single sample logFCs
weightedFC <- weight_ss_fc(logCPM_example, metadata = metadata_example,
                           groupBy  = "patient", sampleColumn = "sample", 
                           treatColumn = "treatment")

The weight_ss_fc() function firstly computes raw ssFC for each gene by subtracting logCPM values of the control sample from the logCPM values of treated samples within each patient.

It has been demonstrated previously that in RNA-seq data, lowly expressed genes turn to have a larger variance (Law et al. 2014), which is also demonstrated by the plots below. To reduce the impact of this artefact, weight_ss_fc also weights each ssFCs by estimating the relationship between the gene-level variance and mean logCPM, and defining the gene-wise weight to be the inverse of the predicted variance of that gene’s mean logCPM value.

logCPM_example %>%
    as.data.frame() %>%
    mutate(
        sd = apply(., 1, sd),
        mean = apply(., 1, mean)
        ) %>%
    ggplot(
        aes(mean, sd)
    ) +
    geom_point() +
    geom_smooth(
        method = "loess") +
    labs(
        x = expression(Gene-wise~average~expression~(bar(mu[g]))),
        y = expression(Gene-wise~standard~deviation~(sigma[g]))
    ) +
    theme_bw() +
    theme(
        panel.grid = element_blank(),
        axis.title = element_text(size = 14)
    )