shinyDSP internal data processing pipeline explained

0.1 Introduction

The purpose of this vignette is to look under the hood and explain what a few underlying functions do to make this app work.

0.2 Data import

The human kidney data from Nanostring is loaded from ExperimentHub(). Two files are required: a count matrix and matching annotation file. Let’s read those files.

library(standR)
library(SummarizedExperiment)
library(ExperimentHub)
library(readr)
library(dplyr)
library(stats)

eh <- ExperimentHub()
AnnotationHub::query(eh, "standR")

## ExperimentHub with 3 records
## # snapshotDate(): 2025-04-11
## # $dataprovider: Nanostring
## # $species: NA
## # $rdataclass: data.frame
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["EH7364"]]' 
## 
##            title                   
##   EH7364 | GeomxDKDdata_count      
##   EH7365 | GeomxDKDdata_sampleAnno 
##   EH7366 | GeomxDKDdata_featureAnno

countFilePath <- eh[["EH7364"]]
sampleAnnoFilePath <- eh[["EH7365"]]

countFile <- readr::read_delim(unname(countFilePath), na = character())
sampleAnnoFile <- readr::read_delim(unname(sampleAnnoFilePath), na = character())

0.3 Variable(s) selection

A key step in analysis is to select a biological group(s) of interest. Let’s inspect sampleAnnoFile.

colnames(sampleAnnoFile)

##  [1] "SlideName"            "ScanName"             "ROILabel"            
##  [4] "SegmentLabel"         "SegmentDisplayName"   "Sample_ID"           
##  [7] "AOISurfaceArea"       "AOINucleiCount"       "ROICoordinateX"      
## [10] "ROICoordinateY"       "RawReads"             "TrimmedReads"        
## [13] "StitchedReads"        "AlignedReads"         "DeduplicatedReads"   
## [16] "SequencingSaturation" "UMIQ30"               "RTSQ30"              
## [19] "disease_status"       "pathology"            "region"              
## [22] "LOQ"                  "NormalizationFactor"  "RoiReportX"          
## [25] "RoiReportY"

sampleAnnoFile$disease_status %>% unique()

## [1] "DKD"    "normal"

sampleAnnoFile$region %>% unique()

## [1] "glomerulus" "tubule"

“disease_status” and “region” look like interesting variables. For example, we might be interested in comparing “normal_glomerulus” to “DKD_glomerulus”. The app can create a new sampleAnnoFile where two or more variables of interest can be combined into one grouped variable (under “Variable(s) of interest” in the main side bar).

new_sampleAnnoFile <- sampleAnnoFile %>%
    tidyr::unite(
        "disease_region", # newly created grouped variable
        c("disease_status","region"), # variables to combine
        sep = "_"
    )

new_sampleAnnoFile$disease_region %>% unique()

## [1] "DKD_glomerulus"    "DKD_tubule"        "normal_tubule"    
## [4] "normal_glomerulus"

Finally a spatial experiment object can be created.

spe <- standR::readGeoMx(as.data.frame(countFile),
                         as.data.frame(new_sampleAnnoFile))

0.4 Selecting groups to analyze

We merged “disease_status” and “region” to create a new group called, “disease_region”. The app allows users to pick any subset of groups of interest. In this example, the four possible groups are “DKD_glomerulus”, “DKD_tubule”,“normal_tubule”, and “normal_glomerulus”. The following script can subset spe to keep certain groups.

selectedTypes <- c("DKD_glomerulus", "DKD_tubule", "normal_tubule", "normal_glomerulus")
toKeep <- colData(spe) %>%
    tibble::as_tibble() %>%
    pull(disease_region)

spe <- spe[, grepl(paste(selectedTypes, collapse = "|"), toKeep)]

0.5 Applying QC filters

Regions of interest(ROIs) can be filtered out based on any quantitative variable in colData(spe) (same as colnames(new_sampleAnnoFile)). These options can be found under the “QC” nav panel’s side bar. Let’s say I want to keep ROIs whose “SequencingSaturation” is at least 85.

filter <- lapply("SequencingSaturation", function(column) {
    cutoff_value <- 85
    if(!is.na(cutoff_value)) {
        return(colData(spe)[[column]] > cutoff_value)
    } else {
        return(NULL)
    }
})

filtered_spe <- spe[,unlist(filter)]

colData(spe) %>% dim()

## [1] 231  24

colData(filtered_spe) %>% dim()

## [1] 217  24

We can see that 14 ROIs have been filtered out.

0.6 PCA

Two PCA plots (colour-coded by biological groups or batch) for three normalization schemes are automatically created in the “PCA” nav panel. Let’s take Q3 normalization as an example.

speQ3 <- standR::geomxNorm(filtered_spe, method = "upperquartile")
speQ3 <- scater::runPCA(filtered_spe)

speQ3_compute <- SingleCellExperiment::reducedDim(speQ3, "PCA")

Then, .PCAFunction() is called to generate the plot.

.PCAFunction(speQ3, speQ3_compute, "disease_region", selectedTypes, c(21,22,23,24), c("blue","blue","black","black"))

0.7 Design matrix

Let’s proceed with Q3 normalization. A design matrix is created next.

design <- model.matrix(~0+disease_region, data = colData(speQ3))

# Clean up column name
colnames(design) <- gsub("disease_region", "", colnames(design))

If confounding variables are chosen in the main side bar, those would be added to model.matrix as ~0 + disease_region + confounder1 + confounder2).

0.8 Creating a DGEList

edgeR is used to convert SpatialExperiment into DGEList, filter and estimate dispersion.

library(edgeR)
dge <- SE2DGEList(speQ3)
keep <- filterByExpr(dge, design)

dge <- dge[keep, , keep.lib.sizes = FALSE]

dge <- estimateDisp(dge, design = design, robust = TRUE)

0.9 Comparison between all biological groups

Recall our “selectedTypes” from above: “DKD_glomerulus”, “DKD_tubule”, “normal_tubule”, “normal_glomerulus”.

The following code creates all pairwise comparisons between them.

# In case there are spaces
selectedTypes_underscore <- gsub(" ", "_", selectedTypes)

comparisons <- list()

comparisons <- lapply(
    seq_len(choose(length(selectedTypes_underscore), 2)),
    function(i) {
        noquote(
            paste0(
                utils::combn(selectedTypes_underscore, 2,
                    simplify = FALSE
                )[[i]][1],
                "-",
                utils::combn(selectedTypes_underscore, 2,
                    simplify = FALSE
                )[[i]][2]
            )
        )
    }
)

con <- makeContrasts(
    # Must use as.character()
    contrasts = as.character(unlist(comparisons)),
    levels = colnames(design)
)

colnames(con) <- sub("-", "_vs_", colnames(con))

con

##                    Contrasts
## Levels              DKD_glomerulus_vs_DKD_tubule
##   DKD_glomerulus                               1
##   DKD_tubule                                  -1
##   normal_glomerulus                            0
##   normal_tubule                                0
##                    Contrasts
## Levels              DKD_glomerulus_vs_normal_tubule
##   DKD_glomerulus                                  1
##   DKD_tubule                                      0
##   normal_glomerulus                               0
##   normal_tubule                                  -1
##                    Contrasts
## Levels              DKD_glomerulus_vs_normal_glomerulus
##   DKD_glomerulus                                      1
##   DKD_tubule                                          0
##   normal_glomerulus                                  -1
##   normal_tubule                                       0
##                    Contrasts
## Levels              DKD_tubule_vs_normal_tubule DKD_tubule_vs_normal_glomerulus
##   DKD_glomerulus                              0                               0
##   DKD_tubule                                  1                               1
##   normal_glomerulus                           0                              -1
##   normal_tubule                              -1                               0
##                    Contrasts
## Levels              normal_tubule_vs_normal_glomerulus
##   DKD_glomerulus                                     0
##   DKD_tubule                                         0
##   normal_glomerulus                                 -1
##   normal_tubule                                      1

0.10 Fitting a linear regression model with `limma`

The app uses duplicateCorrelation() “[s]ince we need to make comparisons both within and between subjects, it is necessary to treat Patient as a random effect” limma user’s guide (Ritchie et al. 2015). limma-voom method is used as standR package recommends (Liu et al. 2023).

library(limma)
block_by <- colData(speQ3)[["SlideName"]]

v <- voom(dge, design)
corfit <- duplicateCorrelation(v, design,
    block = block_by
)

v2 <- voom(dge, design,
    block = block_by,
    correlation = corfit$consensus
)

corfit2 <- duplicateCorrelation(v, design,
    block = block_by
)

fit <- lmFit(v, design,
    block = block_by,
    correlation = corfit2$consensus
)

fit_contrast <- contrasts.fit(fit,
    contrasts = con
)

efit <- eBayes(fit_contrast, robust = TRUE)

0.11 Tables of top differentially expressed genes

For each contrast (a column in con), the app creates a table of top differentially expressed genes sorted by their adjusted P value.

# Keep track of how many comparisons there are
numeric_vector <- seq_len(ncol(con))
new_list <- as.list(numeric_vector)

# This adds n+1th index to new_list where n = ncol(con)
# This index contains seq_len(ncol(con)) 
# ex. new_list[[7]] = 1 2 3 4 5 6
# This coefficient allows ANOVA-like comparison in toptable()
if (length(selectedTypes) > 2) {
    new_list[[length(new_list) + 1]] <- numeric_vector
}

topTabDF <- lapply(new_list, function(i) {
    limma::topTable(efit,
        coef = i, number = Inf, p.value = 0.05,
        adjust.method = "BH", lfc = 1
    ) %>%
        tibble::rownames_to_column(var = "Gene")
})

# Adds names to topTabDF
if (length(selectedTypes) > 2) {
    names(topTabDF) <- c(
        colnames(con),
        colnames(con) %>% stringr::str_split(., "_vs_") %>%
            unlist() %>% unique() %>% paste(., collapse = "_vs_")
    )
} else {
    names(topTabDF) <- colnames(con)
}

topTabDF is now a list of tables where the list index corresponds to that of columns in con.

colnames(con)

## [1] "DKD_glomerulus_vs_DKD_tubule"        "DKD_glomerulus_vs_normal_tubule"    
## [3] "DKD_glomerulus_vs_normal_glomerulus" "DKD_tubule_vs_normal_tubule"        
## [5] "DKD_tubule_vs_normal_glomerulus"     "normal_tubule_vs_normal_glomerulus"

head(topTabDF[[1]])

##     Gene Type    logFC  AveExpr        t      P.Value    adj.P.Val        B
## 1 SPOCK1 gene 2.773094 7.073508 30.12186 1.201902e-89 2.159337e-85 193.6072
## 2   PLAT gene 3.953830 8.269033 34.17176 7.387340e-89 6.636048e-85 191.7033
## 3  ARMH4 gene 2.545175 7.168846 28.85332 1.067403e-85 6.392320e-82 184.6393
## 4  TIMP3 gene 2.898280 9.560524 29.27997 3.602942e-82 1.618262e-78 176.5764
## 5   FGF1 gene 3.486781 7.645384 30.66180 5.580483e-81 2.005179e-77 173.7881
## 6  PODXL gene 4.218256 8.531335 30.20394 5.207716e-79 1.559364e-75 169.3251

These tables are then shown to the user in the “Tables” nav panel.

0.12 Volcano plot and heatmap

No further serious data processing is performed at this point. The numbers in topTabDF is now parsed to create volcano plots and heatmaps in their respective nav panels.

Liu, Ning, Dharmesh D Bhuva, Ahmed Mohamed, Micah Bokelund, Arutha Kulasinghe, Chin Wee Tan, and Melissa J Davis. 2023. “standR: Spatial Transcriptomic Analysis for GeoMx DSP Data.” Nucleic Acids Research, November, gkad1026. https://doi.org/10.1093/nar/gkad1026.

Ritchie, Matthew E., Belinda Phipson, Di Wu, Yifang Hu, Charity W. Law, Wei Shi, and Gordon K. Smyth. 2015. “Limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47. https://doi.org/10.1093/nar/gkv007.

shinyDSP internal data processing pipeline explained

15 April 2025

Package

0.1 Introduction

0.2 Data import

0.3 Variable(s) selection

0.4 Selecting groups to analyze

0.5 Applying QC filters

0.6 PCA

0.7 Design matrix

0.8 Creating a DGEList

0.9 Comparison between all biological groups

0.10 Fitting a linear regression model with `limma`

0.11 Tables of top differentially expressed genes

0.12 Volcano plot and heatmap

shinyDSP internal data processing pipeline explained

15 April 2025

Package

0.1 Introduction

0.2 Data import

0.3 Variable(s) selection

0.4 Selecting groups to analyze

0.5 Applying QC filters

0.6 PCA

0.7 Design matrix

0.8 Creating a DGEList

0.9 Comparison between all biological groups

0.10 Fitting a linear regression model with limma

0.11 Tables of top differentially expressed genes

0.12 Volcano plot and heatmap

0.10 Fitting a linear regression model with `limma`