library(miloR)
library(SingleCellExperiment)
library(scater)
library(scran)
library(dplyr)
library(patchwork)

1 Introduction

Milo is a tool for analysis of complex single cell datasets generated from replicated multi-condition experiments, which detects changes in composition between conditions. While differential abundance (DA) is commonly quantified in discrete cell clusters, Milo uses partally overlapping neighbourhoods of cells on a KNN graph. Starting from a graph that faithfully recapitulates the biology of the cell population, Milo analysis consists of 3 steps:

  1. Sampling of representative neighbourhoods
  2. Testing for differential abundance of conditions in all neighbourhoods
  3. Accounting for multiple hypothesis testing using a weighted FDR procedure that accounts for the overlap of neighbourhoods

In this vignette we will elaborate on how these steps are implemented in the miloR package.

2 Load data

For this demo we will use a synthetic dataset simulating a developmental trajectory, generated using dyntoy.

data("sim_trajectory", package = "miloR")

## Extract SingleCellExperiment object
traj_sce <- sim_trajectory[['SCE']]

## Extract sample metadata to use for testing
traj_meta <- sim_trajectory[["meta"]]

## Add metadata to colData slot
colData(traj_sce) <- DataFrame(traj_meta)

3 Pre-processing

For DA analysis we need to construct an undirected KNN graph of single-cells. Standard single-cell analysis pipelines usually do this from distances in PCA. We normalize and calculate principal components using scater. I also run UMAP for visualization purposes.

logcounts(traj_sce) <- log(counts(traj_sce) + 1)
traj_sce <- runPCA(traj_sce, ncomponents=30)
traj_sce <- runUMAP(traj_sce)

plotUMAP(traj_sce)

4 Create a Milo object

For differential abundance analysis on graph neighbourhoods we first construct a Milo object. This extends the SingleCellExperiment class to store information about neighbourhoods on the KNN graph.

4.1 From SingleCellExperiment object

The Milo constructor takes as input a SingleCellExperiment object.

traj_milo <- Milo(traj_sce)
reducedDim(traj_milo, "UMAP") <- reducedDim(traj_sce, "UMAP")

traj_milo
## class: Milo 
## dim: 500 500 
## metadata(0):
## assays(2): counts logcounts
## rownames(500): G1 G2 ... G499 G500
## rowData names(0):
## colnames: NULL
## colData names(5): cell_id group_id Condition Replicate Sample
## reducedDimNames(2): PCA UMAP
## mainExpName: NULL
## altExpNames(0):
## nhoods dimensions(2): 1 1
## nhoodCounts dimensions(2): 1 1
## nhoodDistances dimension(1): 0
## graph names(0):
## nhoodIndex names(1): 0
## nhoodExpression dimension(2): 1 1
## nhoodReducedDim names(0):
## nhoodGraph names(0):
## nhoodAdjacency dimension(2): 1 1

4.2 From AnnData object (.h5ad)

We can use the zellkonverter package to make a SingleCellExperiment object from an AnnData object stored as h5ad file.

library(zellkonverter)

# Obtaining an example H5AD file.
example_h5ad <- system.file("extdata", "krumsiek11.h5ad",
                            package = "zellkonverter")

example_h5ad_sce <- readH5AD(example_h5ad)
example_h5ad_milo <- Milo(example_h5ad_sce)

4.3 From Seurat object

The Seurat package includes a converter to SingleCellExperiment.

library(Seurat)
data("pbmc_small")
pbmc_small_sce <- as.SingleCellExperiment(pbmc_small)
pbmc_small_milo <- Milo(pbmc_small_sce)

5 Construct KNN graph

We need to add the KNN graph to the Milo object. This is stored in the graph slot, in igraph format. The miloR package includes functionality to build and store the graph from the PCA dimensions stored in the reducedDim slot.

traj_milo <- buildGraph(traj_milo, k = 10, d = 30)
## Constructing kNN graph with k:10

In progress: we are perfecting the functionality to add a precomputed KNN graph (for example constructed with Seurat or scanpy) to the graph slot using the adjacency matrix.

6 1. Defining representative neighbourhoods

We define the neighbourhood of a cell, the index, as the group of cells connected by an edge in the KNN graph to the index cell. For efficiency, we don’t test for DA in the neighbourhood of every cell, but we sample as indices a subset of representative cells, using a KNN sampling algorithm used by Gut et al. 2015.

For sampling you need to define a few parameters:

  • prop: the proportion of cells to randomly sample to start with (usually 0.1 - 0.2 is sufficient)
  • k: the k to use for KNN refinement (we recommend using the same k used for KNN graph building)
  • d: the number of reduced dimensions to use for KNN refinement (we recommend using the same d used for KNN graph building)
  • refined indicated whether you want to use the sampling refinement algorith, or just pick cells at random. The default and recommended way to go is to use refinement. The only situation in which you might consider using random instead, is if you have batch corrected your data with a graph based correction algorithm, such as BBKNN, but the results of DA testing will be suboptimal.
traj_milo <- makeNhoods(traj_milo, prop = 0.1, k = 10, d=30, refined = TRUE)
## Checking valid object
## Warning in .refined_sampling(random_vertices, X_reduced_dims, k): Rownames not
## set on reducedDims - setting to row indices

Once we have defined neighbourhoods, it’s good to take a look at how big the neighbourhoods are (i.e. how many cells form each neighbourhood). This affects the power of DA testing. We can check this out using the plotNhoodSizeHist function. Empirically, we found it’s best to have a distribution peaking between 50 and 100. Otherwise you might consider rerunning makeNhoods increasing k and/or prop (here the distribution looks ludicrous because it’s a small dataset).

plotNhoodSizeHist(traj_milo)

7 Counting cells in neighbourhoods

Now we have to count how many cells from each sample are in each neighbourhood. We need to use the cell metadata and specify which column contains the sample information.

traj_milo <- countCells(traj_milo, meta.data = data.frame(colData(traj_milo)), samples="Sample")
## Checking meta.data validity
## Counting cells in neighbourhoods

This adds to the Milo object a n \times m matrix, where n is the number of neighbourhoods and \(m\) is the number of experimental samples. Values indicate the number of cells from each sample counted in a neighbourhood. This count matrix will be used for DA testing.

head(nhoodCounts(traj_milo))
## 6 x 6 sparse Matrix of class "dgCMatrix"
##   B_R1 A_R1 A_R2 B_R2 B_R3 A_R3
## 1    3    6    3    6    7    2
## 2   10    2    1    4    8    3
## 3    9    3    3   10   11    9
## 4    9    6    7    7    6    9
## 5    5    3    3    5    6    7
## 6    4    4    1    2    4    3

8 Differential abundance testing

Now we are all set to test for differential abundance in neighbourhoods. We implement this hypothesis testing in a generalized linear model (GLM) framework, specifically using the Negative Binomial GLM implementation in edgeR.

We first need to think about our experimental design. The design matrix should match samples to a condition of interest. In this case the Condition is the covariate we are going to test for.

traj_design <- data.frame(colData(traj_milo))[,c("Sample", "Condition")]
traj_design <- distinct(traj_design)
rownames(traj_design) <- traj_design$Sample
## Reorder rownames to match columns of nhoodCounts(milo)
traj_design <- traj_design[colnames(nhoodCounts(traj_milo)), , drop=FALSE]

traj_design
##      Sample Condition
## B_R1   B_R1         B
## A_R1   A_R1         A
## A_R2   A_R2         A
## B_R2   B_R2         B
## B_R3   B_R3         B
## A_R3   A_R3         A

Milo uses an adaptation of the Spatial FDR correction introduced by cydar, which accounts for the overlap between neighbourhoods. Specifically, each hypothesis test P-value is weighted by the reciprocal of the kth nearest neighbour distance. To use this statistic we first need to store the distances between nearest neighbors in the Milo object.

traj_milo <- calcNhoodDistance(traj_milo, d=30)

Now we can do the test, explicitly defining our experimental design.

rownames(traj_design) <- traj_design$Sample
da_results <- testNhoods(traj_milo, design = ~ Condition, design.df = traj_design)
## Using TMM normalisation
## Performing spatial FDR correction withk-distance weighting

This calculates a Fold-change and corrected P-value for each neighbourhood, which indicates wheather there is significant differential abundance between conditions.

da_results %>%
  arrange(- SpatialFDR) %>%
  head() 
##          logFC   logCPM          F    PValue       FDR Nhood SpatialFDR
## 3  -0.08085219 15.79198 0.04317858 0.8574082 0.8574082     3  0.8574082
## 21 -0.16013710 15.20704 0.07749603 0.7831219 0.8101261    21  0.8100769
## 15  0.29971578 15.84167 0.56740535 0.5107248 0.5472052    15  0.5470124
## 1  -0.52770715 15.24452 0.88064014 0.3499114 0.3887904     1  0.3912721
## 6  -0.73135864 14.86601 2.28022748 0.2738419 0.3159714     6  0.3198554
## 29 -0.55597157 15.48269 1.83712283 0.2657550 0.3159714    29  0.3198554

9 Visualize neighbourhoods displaying DA

To visualize DA results, we build an abstracted graph of neighbourhoods that we can superimpose on the single-cell embedding.

traj_milo <- buildNhoodGraph(traj_milo)

plotUMAP(traj_milo) + plotNhoodGraphDA(traj_milo, da_results, alpha=0.05) +
  plot_layout(guides="collect")

Session Info

sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] patchwork_1.1.1             dplyr_1.0.7                
##  [3] scran_1.22.0                scater_1.22.0              
##  [5] ggplot2_3.3.5               scuttle_1.4.0              
##  [7] SingleCellExperiment_1.16.0 SummarizedExperiment_1.24.0
##  [9] Biobase_2.54.0              GenomicRanges_1.46.0       
## [11] GenomeInfoDb_1.30.0         IRanges_2.28.0             
## [13] S4Vectors_0.32.0            BiocGenerics_0.40.0        
## [15] MatrixGenerics_1.6.0        matrixStats_0.61.0         
## [17] miloR_1.2.0                 edgeR_3.36.0               
## [19] limma_3.50.0                BiocStyle_2.22.0           
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7              RColorBrewer_1.1-2       
##  [3] tools_4.1.1               bslib_0.3.1              
##  [5] utf8_1.2.2                R6_2.5.1                 
##  [7] irlba_2.3.3               vipor_0.4.5              
##  [9] uwot_0.1.10               DBI_1.1.1                
## [11] colorspace_2.0-2          withr_2.4.2              
## [13] tidyselect_1.1.1          gridExtra_2.3            
## [15] compiler_4.1.1            BiocNeighbors_1.12.0     
## [17] DelayedArray_0.20.0       labeling_0.4.2           
## [19] bookdown_0.24             sass_0.4.0               
## [21] scales_1.1.1              stringr_1.4.0            
## [23] digest_0.6.28             rmarkdown_2.11           
## [25] XVector_0.34.0            pkgconfig_2.0.3          
## [27] htmltools_0.5.2           sparseMatrixStats_1.6.0  
## [29] highr_0.9                 fastmap_1.1.0            
## [31] rlang_0.4.12              FNN_1.1.3                
## [33] DelayedMatrixStats_1.16.0 jquerylib_0.1.4          
## [35] farver_2.1.0              generics_0.1.1           
## [37] jsonlite_1.7.2            gtools_3.9.2             
## [39] BiocParallel_1.28.0       RCurl_1.98-1.5           
## [41] magrittr_2.0.1            BiocSingular_1.10.0      
## [43] GenomeInfoDbData_1.2.7    Matrix_1.3-4             
## [45] ggbeeswarm_0.6.0          Rcpp_1.0.7               
## [47] munsell_0.5.0             fansi_0.5.0              
## [49] viridis_0.6.2             lifecycle_1.0.1          
## [51] stringi_1.7.5             yaml_2.2.1               
## [53] ggraph_2.0.5              MASS_7.3-54              
## [55] zlibbioc_1.40.0           grid_4.1.1               
## [57] dqrng_0.3.0               parallel_4.1.1           
## [59] ggrepel_0.9.1             crayon_1.4.1             
## [61] lattice_0.20-45           splines_4.1.1            
## [63] graphlayouts_0.7.1        cowplot_1.1.1            
## [65] beachmat_2.10.0           magick_2.7.3             
## [67] locfit_1.5-9.4            metapod_1.2.0            
## [69] knitr_1.36                pillar_1.6.4             
## [71] igraph_1.2.7              ScaledMatrix_1.2.0       
## [73] glue_1.4.2                evaluate_0.14            
## [75] BiocManager_1.30.16       vctrs_0.3.8              
## [77] tweenr_1.0.2              gtable_0.3.0             
## [79] purrr_0.3.4               polyclip_1.10-0          
## [81] tidyr_1.1.4               assertthat_0.2.1         
## [83] xfun_0.27                 ggforce_0.3.3            
## [85] rsvd_1.0.5                tidygraph_1.2.0          
## [87] RSpectra_0.16-0           viridisLite_0.4.0        
## [89] tibble_3.1.5              beeswarm_0.4.0           
## [91] cluster_2.1.2             statmod_1.4.36           
## [93] bluster_1.4.0             ellipsis_0.3.2