spatialLIBD 1.10.1
spatialLIBD
R
is an open-source statistical environment which can be easily modified to enhance its functionality via packages. spatialLIBD (Pardo, Spangler, Weber, Hicks, Jaffe, Martinowich, Maynard, and Collado-Torres, 2022) is an R
package available via the Bioconductor repository for packages. R
can be installed on any operating system from CRAN after which you can install spatialLIBD by using the following commands in your R
session:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("spatialLIBD")
## Check that you have a valid Bioconductor installation
BiocManager::valid()
To run all the code in this vignette, you might need to install other R/Bioconductor packages, which you can do with:
BiocManager::install("spatialLIBD", dependencies = TRUE, force = TRUE)
If you want to use the development version of spatialLIBD
, you will need to use the R version corresponding to the current Bioconductor-devel branch as described in more detail on the Bioconductor website. Then you can install spatialLIBD
from GitHub using the following command.
BiocManager::install("LieberInstitute/spatialLIBD")
Please first check the Introduction to spatialLIBD vignette available through GitHub or Bioconductor.
spatialLIBD
We hope that spatialLIBD will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!
## Citation info
citation("spatialLIBD")
#>
#> To cite package 'spatialLIBD' in publications use:
#>
#> Pardo B, Spangler A, Weber LM, Hicks SC, Jaffe AE, Martinowich K,
#> Maynard KR, Collado-Torres L (2022). "spatialLIBD: an R/Bioconductor
#> package to visualize spatially-resolved transcriptomics data." _BMC
#> Genomics_. doi:10.1186/s12864-022-08601-w
#> <https://doi.org/10.1186/s12864-022-08601-w>,
#> <https://doi.org/10.1186/s12864-022-08601-w>.
#>
#> Maynard KR, Collado-Torres L, Weber LM, Uytingco C, Barry BK,
#> Williams SR, II JLC, Tran MN, Besich Z, Tippani M, Chew J, Yin Y,
#> Kleinman JE, Hyde TM, Rao N, Hicks SC, Martinowich K, Jaffe AE
#> (2021). "Transcriptome-scale spatial gene expression in the human
#> dorsolateral prefrontal cortex." _Nature Neuroscience_.
#> doi:10.1038/s41593-020-00787-0
#> <https://doi.org/10.1038/s41593-020-00787-0>,
#> <https://www.nature.com/articles/s41593-020-00787-0>.
#>
#> To see these entries in BibTeX format, use 'print(<citation>,
#> bibtex=TRUE)', 'toBibtex(.)', or set
#> 'options(citation.bibtex.max=999)'.
In this vignette we’ll show you how you can use spatialLIBD (Pardo, Spangler, Weber et al., 2022) for exploring spatially resolved transcriptomics data from the Visium platform by 10x Genomics. That is, you will learn how to use spatialLIBD for data beyond the one it was initially developed for (Maynard, Collado-Torres, Weber, Uytingco, Barry, Williams, II, Tran, Besich, Tippani, Chew, Yin, Kleinman, Hyde, Rao, Hicks, Martinowich, and Jaffe, 2021). To illustrate these steps, we will use data that 10x Genomics made publicly available at https://support.10xgenomics.com/spatial-gene-expression/datasets. We will use files from the human lymph node example publicly available at https://support.10xgenomics.com/spatial-gene-expression/datasets/1.1.0/V1_Human_Lymph_Node.
To get started, lets load the different packages we’ll need for this vignette. Here’s a brief summary of why we need these packages:
spaceranger
files provided by 10x Genomics## Load packages in the order we'll use them next
library("BiocFileCache")
library("SpatialExperiment")
library("rtracklayer")
library("lobstr")
library("spatialLIBD")
Next we download data from 10x Genomics available from the human lymph node example, available at https://support.10xgenomics.com/spatial-gene-expression/datasets/1.1.0/V1_Human_Lymph_Node. We don’t need to download all the files listed there since SpatialExperiment doesn’t need all of them for importing the data into R. These files are part of the output that gets generated by spaceranger
which is the processing pipeline provided by 10x Genomics for Visium data.
We’ll use BiocFileCache to keep the data in a local cache in case we want to run this example again and don’t want to re-download the data from the web.
## Download and save a local cache of the data provided by 10x Genomics
bfc <- BiocFileCache::BiocFileCache()
lymph.url <-
paste0(
"https://cf.10xgenomics.com/samples/spatial-exp/",
"1.1.0/V1_Human_Lymph_Node/",
c(
"V1_Human_Lymph_Node_filtered_feature_bc_matrix.tar.gz",
"V1_Human_Lymph_Node_spatial.tar.gz",
"V1_Human_Lymph_Node_analysis.tar.gz"
)
)
lymph.data <- sapply(lymph.url, BiocFileCache::bfcrpath, x = bfc)
10x Genomics provides the files in compressed tarballs (.tar.gz
file extension). Which is why we’ll need to use utils::untar()
to decompress the files. This will create new directories and we will use list.files()
to see what files these directories contain.
## Extract the files to a temporary location
## (they'll be deleted once you close your R session)
xx <- sapply(lymph.data, utils::untar, exdir = file.path(tempdir(), "outs"))
## The names are the URLs, which are long and thus too wide to be shown here,
## so we shorten them to only show the file name prior to displaying the
## utils::untar() output status
names(xx) <- basename(names(xx))
xx
#> V1_Human_Lymph_Node_filtered_feature_bc_matrix.tar.gz.BFC37
#> 0
#> V1_Human_Lymph_Node_spatial.tar.gz.BFC38
#> 0
#> V1_Human_Lymph_Node_analysis.tar.gz.BFC111
#> 0
## List the files we downloaded and extracted
## These files are typically SpaceRanger outputs
lymph.dirs <- file.path(
tempdir(), "outs",
c("filtered_feature_bc_matrix", "spatial", "raw_feature_bc_matrix", "analysis")
)
list.files(lymph.dirs)
#> [1] "aligned_fiducials.jpg" "barcodes.tsv.gz"
#> [3] "clustering" "detected_tissue_image.jpg"
#> [5] "diffexp" "features.tsv.gz"
#> [7] "matrix.mtx.gz" "pca"
#> [9] "scalefactors_json.json" "tissue_hires_image.png"
#> [11] "tissue_lowres_image.png" "tissue_positions_list.csv"
#> [13] "tsne" "umap"
Now that we have the files that we need, we can import the data into R using read10xVisium()
from SpatialExperiment. We’ll import the low resolution histology images produced by spaceranger
using the images = "lowres"
and load = TRUE
arguments. We’ll also load the filtered gene expression data using the data = "filtered"
argument. The count matrix can still be quite large, which is why we’ll use the type = "sparse"
argument to load the data into an R object that is memory-efficient for sparse data.
## Import the data as a SpatialExperiment object
spe <- SpatialExperiment::read10xVisium(
samples = tempdir(),
sample_id = "lymph",
type = "sparse", data = "filtered",
images = "lowres", load = TRUE
)
#> as(<dgTMatrix>, "dgCMatrix") is deprecated since Matrix 1.5-0; do as(., "CsparseMatrix") instead
## Inspect the R object we just created: class, memory, and how it looks in
## general
class(spe)
#> [1] "SpatialExperiment"
#> attr(,"package")
#> [1] "SpatialExperiment"
lobstr::obj_size(spe) / 1024^2 ## Convert to MB
#> 281.90 B
spe
#> class: SpatialExperiment
#> dim: 36601 4035
#> metadata(0):
#> assays(1): counts
#> rownames(36601): ENSG00000243485 ENSG00000237613 ... ENSG00000278817
#> ENSG00000277196
#> rowData names(1): symbol
#> colnames(4035): AAACAAGTATCTCCCA-1 AAACAATCTACTAGCA-1 ...
#> TTGTTTGTATTACACG-1 TTGTTTGTGTAAATTC-1
#> colData names(4): in_tissue array_row array_col sample_id
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
#> imgData names(4): sample_id image_id data scaleFactor
## The counts are saved in a sparse matrix R object
class(counts(spe))
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"
Now that we have an SpatialExperiment
R object (spe
) with the data from 10x Genomics for the human lymph node example, we need to add a few features to the R object in order to create the interactive website using spatialLIBD::run_app()
. These additional elements power features in the interactive website that you might be interested in.
First we start with adding a few variables to the sample information table (colData()
) of our spe
object. We add:
key
: this labels each spot with a unique identifier. We combine the sample ID with the spot barcode ID to create this unique identifier.sum_umi
: this continuous variable contains the total number of counts for each sample prior to filtering any genes.sum_gene
: this continuous variable contains the number of genes that have at least 1 count.## Add some information used by spatialLIBD
spe <- add_key(spe)
spe$sum_umi <- colSums(counts(spe))
spe$sum_gene <- colSums(counts(spe) > 0)
The files SpatialExperiment::read10xVisium()
uses to read in the spaceranger
outputs into R do not include much information about the genes, such as their chromosomes, coordinates, and other gene annotation information. We thus recommend that you read in this information from a gene annotation file: typically a gtf
file. For a real case scenario, you’ll mostly likely have access to the GTF file provided by 10x Genomics. However, we cannot download that file without downloading other files for this example. Thus we’ll show you the code you would use if you had access to the GTF file from 10x Genomics and also show a second approach that works for this vignette.
## Initially we don't have much information about the genes
rowRanges(spe)
#> GRangesList object of length 36601:
#> $ENSG00000243485
#> GRanges object with 0 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> -------
#> seqinfo: no sequences
#>
#> $ENSG00000237613
#> GRanges object with 0 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> -------
#> seqinfo: no sequences
#>
#> $ENSG00000186092
#> GRanges object with 0 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> -------
#> seqinfo: no sequences
#>
#> ...
#> <36598 more elements>
Depending on the version of spaceranger
you used, you might have used different GTF files 10x Genomics has made available at https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest and described at https://support.10xgenomics.com/single-cell-gene-expression/software/release-notes/build. These files are too big though and we won’t download them in this example. For instance, References - 2020-A (July 7, 2020) for Human reference (GRCh38) is 11 GB in size and contains files we do not need for this vignette. If you did have the file locally, you could use the following code to read in the GTF file prepared by 10x Genomics and add the information into your spe
object that SpatialExperiment::read10xVisium()
does not include.
For example, in our computing cluster this GTF file is located at the following path and is 1.4 GB in size:
$ cd /dcs04/lieber/lcolladotor/annotationFiles_LIBD001/10x/refdata-gex-GRCh38-2020-A
$ du -sh --apparent-size genes/genes.gtf
1.4G genes/genes.gtf
If you have the GTF file from 10x Genomics, we show next how you can read the information into R, match it appropriately with the information in the spe
object and add it back into the spe
object.
## You could:
## * download the 11 GB file from
## https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz
## * decompress it
## Read in the gene information from the annotation GTF file provided by 10x
gtf <-
rtracklayer::import(
"/path/to/refdata-gex-GRCh38-2020-A/genes/genes.gtf"
)
## Subject to genes only
gtf <- gtf[gtf$type == "gene"]
## Set the names to be the gene IDs
names(gtf) <- gtf$gene_id
## Match the genes
match_genes <- match(rownames(spe), gtf$gene_id)
## They should all be present if you are using the correct GTF file from 10x
stopifnot(all(!is.na(match_genes)))
## Keep only some columns from the gtf (you could keep all of them if you want)
mcols(gtf) <-
mcols(gtf)[, c(
"source",
"type",
"gene_id",
"gene_version",
"gene_name",
"gene_type"
)]
## Add the gene info to our SPE object
rowRanges(spe) <- gtf[match_genes]
## Inspect the gene annotation data we added
rowRanges(spe)
In this vignette, we’ll use the GTF file from Gencode v32. That’s because the build notes from References - 2020-A (July 7, 2020) and Human reference, GRCh38 (GENCODE v32/Ensembl 98) at https://support.10xgenomics.com/single-cell-gene-expression/software/release-notes/build#GRCh38_2020A show that 10x Genomics used Gencode v32. They also used Ensembl version 98 which is why a few genes we have in our object are going to be missing. We show next how you can read the information into R, match it appropriately with the information in the spe
object and add it back into the spe
object.
## Download the Gencode v32 GTF file and cache it
gtf_cache <- BiocFileCache::bfcrpath(
bfc,
paste0(
"ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/",
"release_32/gencode.v32.annotation.gtf.gz"
)
)
## Show the GTF cache location
gtf_cache
#> BFC39
#> "/home/biocbuild/.cache/R/BiocFileCache/24ef9b1c465177_gencode.v32.annotation.gtf.gz"
## Import into R (takes ~1 min)
gtf <- rtracklayer::import(gtf_cache)
## Subset to genes only
gtf <- gtf[gtf$type == "gene"]
## Remove the .x part of the gene IDs
gtf$gene_id <- gsub("\\..*", "", gtf$gene_id)
## Set the names to be the gene IDs
names(gtf) <- gtf$gene_id
## Match the genes
match_genes <- match(rownames(spe), gtf$gene_id)
table(is.na(match_genes))
#>
#> FALSE TRUE
#> 36572 29
## Drop the few genes for which we don't have information
spe <- spe[!is.na(match_genes), ]
match_genes <- match_genes[!is.na(match_genes)]
## Keep only some columns from the gtf
mcols(gtf) <- mcols(gtf)[, c("source", "type", "gene_id", "gene_name", "gene_type")]
## Add the gene info to our SPE object
rowRanges(spe) <- gtf[match_genes]
## Inspect the gene annotation data we added
rowRanges(spe)
#> GRanges object with 36572 ranges and 5 metadata columns:
#> seqnames ranges strand | source type
#> <Rle> <IRanges> <Rle> | <factor> <factor>
#> ENSG00000243485 chr1 29554-31109 + | HAVANA gene
#> ENSG00000237613 chr1 34554-36081 - | HAVANA gene
#> ENSG00000186092 chr1 65419-71585 + | HAVANA gene
#> ENSG00000238009 chr1 89295-133723 - | HAVANA gene
#> ENSG00000239945 chr1 89551-91105 - | HAVANA gene
#> ... ... ... ... . ... ...
#> ENSG00000212907 chrM 10470-10766 + | ENSEMBL gene
#> ENSG00000198886 chrM 10760-12137 + | ENSEMBL gene
#> ENSG00000198786 chrM 12337-14148 + | ENSEMBL gene
#> ENSG00000198695 chrM 14149-14673 - | ENSEMBL gene
#> ENSG00000198727 chrM 14747-15887 + | ENSEMBL gene
#> gene_id gene_name gene_type
#> <character> <character> <character>
#> ENSG00000243485 ENSG00000243485 MIR1302-2HG lncRNA
#> ENSG00000237613 ENSG00000237613 FAM138A lncRNA
#> ENSG00000186092 ENSG00000186092 OR4F5 protein_coding
#> ENSG00000238009 ENSG00000238009 AL627309.1 lncRNA
#> ENSG00000239945 ENSG00000239945 AL627309.3 lncRNA
#> ... ... ... ...
#> ENSG00000212907 ENSG00000212907 MT-ND4L protein_coding
#> ENSG00000198886 ENSG00000198886 MT-ND4 protein_coding
#> ENSG00000198786 ENSG00000198786 MT-ND5 protein_coding
#> ENSG00000198695 ENSG00000198695 MT-ND6 protein_coding
#> ENSG00000198727 ENSG00000198727 MT-CYB protein_coding
#> -------
#> seqinfo: 25 sequences from an unspecified genome; no seqlengths
Regardless of which method you used to obtain the gene annotation information, we can now proceed by adding the gene symbol and gene ID information that helps users search for genes in the shiny app produced by spatialLIBD
. This will enable users to search genes by gene symbol or gene ID. If you didn’t do this, users would only be able to search genes by gene ID which makes the web application harder to use.
We also compute the total expression for the mitochondrial chromosome (chrM) as well as the ratio of chrM expression. Both of these continuous variables are interesting to explore and in some situations could be useful for biological interpretations. For instance, in our pilot data (Maynard, Collado-Torres, Weber et al., 2021), we noticed that the expr_chrM_ratio
was associated to DLPFC layers. That is, spots with high expr_chrM_ratio
were not randomly located in our Visium slides.
## Add information used by spatialLIBD
rowData(spe)$gene_search <- paste0(
rowData(spe)$gene_name, "; ", rowData(spe)$gene_id
)
## Compute chrM expression and chrM expression ratio
is_mito <- which(seqnames(spe) == "chrM")
spe$expr_chrM <- colSums(counts(spe)[is_mito, , drop = FALSE])
spe$expr_chrM_ratio <- spe$expr_chrM / spe$sum_umi
We can now continue with some filtering steps since this can help reduce the object size in memory as well as make it ready to use for downstream processing tools such as those from the scran and scuttle packages. Though these steps are not absolutely necessary.
## Remove genes with no data
no_expr <- which(rowSums(counts(spe)) == 0)
## Number of genes with no counts
length(no_expr)
#> [1] 11397
## Compute the percent of genes with no counts
length(no_expr) / nrow(spe) * 100
#> [1] 31.16318
spe <- spe[-no_expr, , drop = FALSE]
## Remove spots without counts
summary(spe$sum_umi)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 23 15917 20239 20738 25252 54931
## If we had spots with no counts, we would remove them
if (any(spe$sum_umi == 0)) {
spots_no_counts <- which(spe$sum_umi == 0)
## Number of spots with no counts
print(length(spots_no_counts))
## Percent of spots with no counts
print(length(spots_no_counts) / ncol(spe) * 100)
spe <- spe[, -spots_no_counts, drop = FALSE]
}
Next, we add the ManualAnnotation
variable to the sample information table (colData()
) with "NA"
. That variable is used by the interactive website to store any manual annotations.
## Add a variable for saving the manual annotations
spe$ManualAnnotation <- "NA"
Finally, we can now check the final object using spatialLIBD::check_spe()
. This is a helper function that will warn us if some important element is missing in spe
that we use later for the interactive website. If it all goes well, it will return the original spe
object.
## Check the final dimensions and object size
dim(spe)
#> [1] 25175 4035
lobstr::obj_size(spe) / 1024^2 ## Convert to MB
#> 283.86 B
## Run check_spe() function
check_spe(spe)
#> class: SpatialExperiment
#> dim: 25175 4035
#> metadata(0):
#> assays(1): counts
#> rownames(25175): ENSG00000238009 ENSG00000241860 ... ENSG00000198695
#> ENSG00000198727
#> rowData names(6): source type ... gene_type gene_search
#> colnames(4035): AAACAAGTATCTCCCA-1 AAACAATCTACTAGCA-1 ...
#> TTGTTTGTATTACACG-1 TTGTTTGTGTAAATTC-1
#> colData names(10): in_tissue array_row ... expr_chrM_ratio
#> ManualAnnotation
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
#> imgData names(4): sample_id image_id data scaleFactor
With our complete spe
object, we can now use spatialLIBD for visualizing our data. We can do so using functions such as vis_gene()
and vis_clus()
that are described in more detail at the Introduction to spatialLIBD vignette available through GitHub or Bioconductor.
## Example visualizations. Let's start with a continuous variable.
spatialLIBD::vis_gene(
spe = spe,
sampleid = "lymph",
geneid = "sum_umi",
assayname = "counts"
)