1 Basics

1.1 Install spatialLIBD

R is an open-source statistical environment which can be easily modified to enhance its functionality via packages. spatialLIBD (Pardo, Spangler, Weber, Hicks, Jaffe, Martinowich, Maynard, and Collado-Torres, 2022) is an R package available via the Bioconductor repository for packages. R can be installed on any operating system from CRAN after which you can install spatialLIBD by using the following commands in your R session:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("spatialLIBD")

## Check that you have a valid Bioconductor installation
BiocManager::valid()

To run all the code in this vignette, you might need to install other R/Bioconductor packages, which you can do with:

BiocManager::install("spatialLIBD", dependencies = TRUE, force = TRUE)

If you want to use the development version of spatialLIBD, you will need to use the R version corresponding to the current Bioconductor-devel branch as described in more detail on the Bioconductor website. Then you can install spatialLIBD from GitHub using the following command.

BiocManager::install("LieberInstitute/spatialLIBD")

1.2 Required knowledge

Please first check the Introduction to spatialLIBD vignette available through GitHub or Bioconductor.

1.3 Citing spatialLIBD

We hope that spatialLIBD will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!

## Citation info
citation("spatialLIBD")
#> 
#> To cite package 'spatialLIBD' in publications use:
#> 
#>   Pardo B, Spangler A, Weber LM, Hicks SC, Jaffe AE, Martinowich K,
#>   Maynard KR, Collado-Torres L (2022). "spatialLIBD: an R/Bioconductor
#>   package to visualize spatially-resolved transcriptomics data." _BMC
#>   Genomics_. doi:10.1186/s12864-022-08601-w
#>   <https://doi.org/10.1186/s12864-022-08601-w>,
#>   <https://doi.org/10.1186/s12864-022-08601-w>.
#> 
#>   Maynard KR, Collado-Torres L, Weber LM, Uytingco C, Barry BK,
#>   Williams SR, II JLC, Tran MN, Besich Z, Tippani M, Chew J, Yin Y,
#>   Kleinman JE, Hyde TM, Rao N, Hicks SC, Martinowich K, Jaffe AE
#>   (2021). "Transcriptome-scale spatial gene expression in the human
#>   dorsolateral prefrontal cortex." _Nature Neuroscience_.
#>   doi:10.1038/s41593-020-00787-0
#>   <https://doi.org/10.1038/s41593-020-00787-0>,
#>   <https://www.nature.com/articles/s41593-020-00787-0>.
#> 
#> To see these entries in BibTeX format, use 'print(<citation>,
#> bibtex=TRUE)', 'toBibtex(.)', or set
#> 'options(citation.bibtex.max=999)'.

2 Download data from 10x Genomics

In this vignette we’ll show you how you can use spatialLIBD (Pardo, Spangler, Weber et al., 2022) for exploring spatially resolved transcriptomics data from the Visium platform by 10x Genomics. That is, you will learn how to use spatialLIBD for data beyond the one it was initially developed for (Maynard, Collado-Torres, Weber, Uytingco, Barry, Williams, II, Tran, Besich, Tippani, Chew, Yin, Kleinman, Hyde, Rao, Hicks, Martinowich, and Jaffe, 2021). To illustrate these steps, we will use data that 10x Genomics made publicly available at https://support.10xgenomics.com/spatial-gene-expression/datasets. We will use files from the human lymph node example publicly available at https://support.10xgenomics.com/spatial-gene-expression/datasets/1.1.0/V1_Human_Lymph_Node.

2.1 Load packages

To get started, lets load the different packages we’ll need for this vignette. Here’s a brief summary of why we need these packages:

  • BiocFileCache: for downloading and saving a local cache of the data
  • SpatialExperiment: for reading the spaceranger files provided by 10x Genomics
  • rtracklayer: for importing a gene annotation GTF file
  • lobstr: for checking how much memory our object is using
  • spatialLIBD: for launching an interactive website to explore the data
## Load packages in the order we'll use them next
library("BiocFileCache")
library("SpatialExperiment")
library("rtracklayer")
library("lobstr")
library("spatialLIBD")

2.2 Download spaceranger output files

Next we download data from 10x Genomics available from the human lymph node example, available at https://support.10xgenomics.com/spatial-gene-expression/datasets/1.1.0/V1_Human_Lymph_Node. We don’t need to download all the files listed there since SpatialExperiment doesn’t need all of them for importing the data into R. These files are part of the output that gets generated by spaceranger which is the processing pipeline provided by 10x Genomics for Visium data.

We’ll use BiocFileCache to keep the data in a local cache in case we want to run this example again and don’t want to re-download the data from the web.

## Download and save a local cache of the data provided by 10x Genomics
bfc <- BiocFileCache::BiocFileCache()
lymph.url <-
    paste0(
        "https://cf.10xgenomics.com/samples/spatial-exp/",
        "1.1.0/V1_Human_Lymph_Node/",
        c(
            "V1_Human_Lymph_Node_filtered_feature_bc_matrix.tar.gz",
            "V1_Human_Lymph_Node_spatial.tar.gz",
            "V1_Human_Lymph_Node_analysis.tar.gz"
        )
    )
lymph.data <- sapply(lymph.url, BiocFileCache::bfcrpath, x = bfc)

10x Genomics provides the files in compressed tarballs (.tar.gz file extension). Which is why we’ll need to use utils::untar() to decompress the files. This will create new directories and we will use list.files() to see what files these directories contain.

## Extract the files to a temporary location
## (they'll be deleted once you close your R session)
xx <- sapply(lymph.data, utils::untar, exdir = file.path(tempdir(), "outs"))
## The names are the URLs, which are long and thus too wide to be shown here,
## so we shorten them to only show the file name prior to displaying the
## utils::untar() output status
names(xx) <- basename(names(xx))
xx
#> V1_Human_Lymph_Node_filtered_feature_bc_matrix.tar.gz.BFC37 
#>                                                           0 
#>                    V1_Human_Lymph_Node_spatial.tar.gz.BFC38 
#>                                                           0 
#>                  V1_Human_Lymph_Node_analysis.tar.gz.BFC111 
#>                                                           0

## List the files we downloaded and extracted
## These files are typically SpaceRanger outputs
lymph.dirs <- file.path(
    tempdir(), "outs",
    c("filtered_feature_bc_matrix", "spatial", "raw_feature_bc_matrix", "analysis")
)
list.files(lymph.dirs)
#>  [1] "aligned_fiducials.jpg"     "barcodes.tsv.gz"          
#>  [3] "clustering"                "detected_tissue_image.jpg"
#>  [5] "diffexp"                   "features.tsv.gz"          
#>  [7] "matrix.mtx.gz"             "pca"                      
#>  [9] "scalefactors_json.json"    "tissue_hires_image.png"   
#> [11] "tissue_lowres_image.png"   "tissue_positions_list.csv"
#> [13] "tsne"                      "umap"

Now that we have the files that we need, we can import the data into R using read10xVisium() from SpatialExperiment. We’ll import the low resolution histology images produced by spaceranger using the images = "lowres" and load = TRUE arguments. We’ll also load the filtered gene expression data using the data = "filtered" argument. The count matrix can still be quite large, which is why we’ll use the type = "sparse" argument to load the data into an R object that is memory-efficient for sparse data.

## Import the data as a SpatialExperiment object
spe <- SpatialExperiment::read10xVisium(
    samples = tempdir(),
    sample_id = "lymph",
    type = "sparse", data = "filtered",
    images = "lowres", load = TRUE
)
#> as(<dgTMatrix>, "dgCMatrix") is deprecated since Matrix 1.5-0; do as(., "CsparseMatrix") instead
## Inspect the R object we just created: class, memory, and how it looks in
## general
class(spe)
#> [1] "SpatialExperiment"
#> attr(,"package")
#> [1] "SpatialExperiment"
lobstr::obj_size(spe) / 1024^2 ## Convert to MB
#> 281.90 B
spe
#> class: SpatialExperiment 
#> dim: 36601 4035 
#> metadata(0):
#> assays(1): counts
#> rownames(36601): ENSG00000243485 ENSG00000237613 ... ENSG00000278817
#>   ENSG00000277196
#> rowData names(1): symbol
#> colnames(4035): AAACAAGTATCTCCCA-1 AAACAATCTACTAGCA-1 ...
#>   TTGTTTGTATTACACG-1 TTGTTTGTGTAAATTC-1
#> colData names(4): in_tissue array_row array_col sample_id
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
#> imgData names(4): sample_id image_id data scaleFactor

## The counts are saved in a sparse matrix R object
class(counts(spe))
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"

3 Modify spe for spatialLIBD

Now that we have an SpatialExperiment R object (spe) with the data from 10x Genomics for the human lymph node example, we need to add a few features to the R object in order to create the interactive website using spatialLIBD::run_app(). These additional elements power features in the interactive website that you might be interested in.

First we start with adding a few variables to the sample information table (colData()) of our spe object. We add:

  • key: this labels each spot with a unique identifier. We combine the sample ID with the spot barcode ID to create this unique identifier.
  • sum_umi: this continuous variable contains the total number of counts for each sample prior to filtering any genes.
  • sum_gene: this continuous variable contains the number of genes that have at least 1 count.
## Add some information used by spatialLIBD
spe <- add_key(spe)
spe$sum_umi <- colSums(counts(spe))
spe$sum_gene <- colSums(counts(spe) > 0)

3.1 Add gene annotation information

The files SpatialExperiment::read10xVisium() uses to read in the spaceranger outputs into R do not include much information about the genes, such as their chromosomes, coordinates, and other gene annotation information. We thus recommend that you read in this information from a gene annotation file: typically a gtf file. For a real case scenario, you’ll mostly likely have access to the GTF file provided by 10x Genomics. However, we cannot download that file without downloading other files for this example. Thus we’ll show you the code you would use if you had access to the GTF file from 10x Genomics and also show a second approach that works for this vignette.

## Initially we don't have much information about the genes
rowRanges(spe)
#> GRangesList object of length 36601:
#> $ENSG00000243485
#> GRanges object with 0 ranges and 0 metadata columns:
#>    seqnames    ranges strand
#>       <Rle> <IRanges>  <Rle>
#>   -------
#>   seqinfo: no sequences
#> 
#> $ENSG00000237613
#> GRanges object with 0 ranges and 0 metadata columns:
#>    seqnames    ranges strand
#>       <Rle> <IRanges>  <Rle>
#>   -------
#>   seqinfo: no sequences
#> 
#> $ENSG00000186092
#> GRanges object with 0 ranges and 0 metadata columns:
#>    seqnames    ranges strand
#>       <Rle> <IRanges>  <Rle>
#>   -------
#>   seqinfo: no sequences
#> 
#> ...
#> <36598 more elements>

3.1.1 From 10x

Depending on the version of spaceranger you used, you might have used different GTF files 10x Genomics has made available at https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest and described at https://support.10xgenomics.com/single-cell-gene-expression/software/release-notes/build. These files are too big though and we won’t download them in this example. For instance, References - 2020-A (July 7, 2020) for Human reference (GRCh38) is 11 GB in size and contains files we do not need for this vignette. If you did have the file locally, you could use the following code to read in the GTF file prepared by 10x Genomics and add the information into your spe object that SpatialExperiment::read10xVisium() does not include.

For example, in our computing cluster this GTF file is located at the following path and is 1.4 GB in size:

$ cd /dcs04/lieber/lcolladotor/annotationFiles_LIBD001/10x/refdata-gex-GRCh38-2020-A
$ du -sh --apparent-size genes/genes.gtf
1.4G    genes/genes.gtf

If you have the GTF file from 10x Genomics, we show next how you can read the information into R, match it appropriately with the information in the spe object and add it back into the spe object.

## You could:
## * download the 11 GB file from
## https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz
## * decompress it

## Read in the gene information from the annotation GTF file provided by 10x
gtf <-
    rtracklayer::import(
        "/path/to/refdata-gex-GRCh38-2020-A/genes/genes.gtf"
    )

## Subject to genes only
gtf <- gtf[gtf$type == "gene"]

## Set the names to be the gene IDs
names(gtf) <- gtf$gene_id

## Match the genes
match_genes <- match(rownames(spe), gtf$gene_id)

## They should all be present if you are using the correct GTF file from 10x
stopifnot(all(!is.na(match_genes)))

## Keep only some columns from the gtf (you could keep all of them if you want)
mcols(gtf) <-
    mcols(gtf)[, c(
        "source",
        "type",
        "gene_id",
        "gene_version",
        "gene_name",
        "gene_type"
    )]

## Add the gene info to our SPE object
rowRanges(spe) <- gtf[match_genes]

## Inspect the gene annotation data we added
rowRanges(spe)

3.1.2 From Gencode

In this vignette, we’ll use the GTF file from Gencode v32. That’s because the build notes from References - 2020-A (July 7, 2020) and Human reference, GRCh38 (GENCODE v32/Ensembl 98) at https://support.10xgenomics.com/single-cell-gene-expression/software/release-notes/build#GRCh38_2020A show that 10x Genomics used Gencode v32. They also used Ensembl version 98 which is why a few genes we have in our object are going to be missing. We show next how you can read the information into R, match it appropriately with the information in the spe object and add it back into the spe object.

## Download the Gencode v32 GTF file and cache it
gtf_cache <- BiocFileCache::bfcrpath(
    bfc,
    paste0(
        "ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/",
        "release_32/gencode.v32.annotation.gtf.gz"
    )
)

## Show the GTF cache location
gtf_cache
#>                                                                                 BFC39 
#> "/home/biocbuild/.cache/R/BiocFileCache/24ef9b1c465177_gencode.v32.annotation.gtf.gz"

## Import into R (takes ~1 min)
gtf <- rtracklayer::import(gtf_cache)

## Subset to genes only
gtf <- gtf[gtf$type == "gene"]

## Remove the .x part of the gene IDs
gtf$gene_id <- gsub("\\..*", "", gtf$gene_id)

## Set the names to be the gene IDs
names(gtf) <- gtf$gene_id

## Match the genes
match_genes <- match(rownames(spe), gtf$gene_id)
table(is.na(match_genes))
#> 
#> FALSE  TRUE 
#> 36572    29

## Drop the few genes for which we don't have information
spe <- spe[!is.na(match_genes), ]
match_genes <- match_genes[!is.na(match_genes)]

## Keep only some columns from the gtf
mcols(gtf) <- mcols(gtf)[, c("source", "type", "gene_id", "gene_name", "gene_type")]

## Add the gene info to our SPE object
rowRanges(spe) <- gtf[match_genes]

## Inspect the gene annotation data we added
rowRanges(spe)
#> GRanges object with 36572 ranges and 5 metadata columns:
#>                   seqnames       ranges strand |   source     type
#>                      <Rle>    <IRanges>  <Rle> | <factor> <factor>
#>   ENSG00000243485     chr1  29554-31109      + |   HAVANA     gene
#>   ENSG00000237613     chr1  34554-36081      - |   HAVANA     gene
#>   ENSG00000186092     chr1  65419-71585      + |   HAVANA     gene
#>   ENSG00000238009     chr1 89295-133723      - |   HAVANA     gene
#>   ENSG00000239945     chr1  89551-91105      - |   HAVANA     gene
#>               ...      ...          ...    ... .      ...      ...
#>   ENSG00000212907     chrM  10470-10766      + |  ENSEMBL     gene
#>   ENSG00000198886     chrM  10760-12137      + |  ENSEMBL     gene
#>   ENSG00000198786     chrM  12337-14148      + |  ENSEMBL     gene
#>   ENSG00000198695     chrM  14149-14673      - |  ENSEMBL     gene
#>   ENSG00000198727     chrM  14747-15887      + |  ENSEMBL     gene
#>                           gene_id   gene_name      gene_type
#>                       <character> <character>    <character>
#>   ENSG00000243485 ENSG00000243485 MIR1302-2HG         lncRNA
#>   ENSG00000237613 ENSG00000237613     FAM138A         lncRNA
#>   ENSG00000186092 ENSG00000186092       OR4F5 protein_coding
#>   ENSG00000238009 ENSG00000238009  AL627309.1         lncRNA
#>   ENSG00000239945 ENSG00000239945  AL627309.3         lncRNA
#>               ...             ...         ...            ...
#>   ENSG00000212907 ENSG00000212907     MT-ND4L protein_coding
#>   ENSG00000198886 ENSG00000198886      MT-ND4 protein_coding
#>   ENSG00000198786 ENSG00000198786      MT-ND5 protein_coding
#>   ENSG00000198695 ENSG00000198695      MT-ND6 protein_coding
#>   ENSG00000198727 ENSG00000198727      MT-CYB protein_coding
#>   -------
#>   seqinfo: 25 sequences from an unspecified genome; no seqlengths

3.2 Filter the spe object

We can now continue with some filtering steps since this can help reduce the object size in memory as well as make it ready to use for downstream processing tools such as those from the scran and scuttle packages. Though these steps are not absolutely necessary.

## Remove genes with no data
no_expr <- which(rowSums(counts(spe)) == 0)

## Number of genes with no counts
length(no_expr)
#> [1] 11397

## Compute the percent of genes with no counts
length(no_expr) / nrow(spe) * 100
#> [1] 31.16318
spe <- spe[-no_expr, , drop = FALSE]

## Remove spots without counts
summary(spe$sum_umi)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>      23   15917   20239   20738   25252   54931

## If we had spots with no counts, we would remove them
if (any(spe$sum_umi == 0)) {
    spots_no_counts <- which(spe$sum_umi == 0)
    ## Number of spots with no counts
    print(length(spots_no_counts))
    ## Percent of spots with no counts
    print(length(spots_no_counts) / ncol(spe) * 100)
    spe <- spe[, -spots_no_counts, drop = FALSE]
}

3.3 Check object

Next, we add the ManualAnnotation variable to the sample information table (colData()) with "NA". That variable is used by the interactive website to store any manual annotations.

## Add a variable for saving the manual annotations
spe$ManualAnnotation <- "NA"

Finally, we can now check the final object using spatialLIBD::check_spe(). This is a helper function that will warn us if some important element is missing in spe that we use later for the interactive website. If it all goes well, it will return the original spe object.

## Check the final dimensions and object size
dim(spe)
#> [1] 25175  4035
lobstr::obj_size(spe) / 1024^2 ## Convert to MB
#> 283.86 B

## Run check_spe() function
check_spe(spe)
#> class: SpatialExperiment 
#> dim: 25175 4035 
#> metadata(0):
#> assays(1): counts
#> rownames(25175): ENSG00000238009 ENSG00000241860 ... ENSG00000198695
#>   ENSG00000198727
#> rowData names(6): source type ... gene_type gene_search
#> colnames(4035): AAACAAGTATCTCCCA-1 AAACAATCTACTAGCA-1 ...
#>   TTGTTTGTATTACACG-1 TTGTTTGTGTAAATTC-1
#> colData names(10): in_tissue array_row ... expr_chrM_ratio
#>   ManualAnnotation
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
#> imgData names(4): sample_id image_id data scaleFactor

4 Explore the data

With our complete spe object, we can now use spatialLIBD for visualizing our data. We can do so using functions such as vis_gene() and vis_clus() that are described in more detail at the Introduction to spatialLIBD vignette available through GitHub or Bioconductor.

## Example visualizations. Let's start with a continuous variable.
spatialLIBD::vis_gene(
    spe = spe,
    sampleid = "lymph",
    geneid = "sum_umi",
    assayname = "counts"
)