aggregateReference {SingleR} | R Documentation |
Aggregate reference samples for a given label by averaging their count profiles. This can be done with varying degrees of resolution to preserve the within-label heterogeneity.
aggregateReference( ref, labels, power = 0.5, assay.type = "logcounts", check.missing = TRUE, BPPARAM = SerialParam() )
ref |
A numeric matrix of reference expression values, usually containing log-expression values. Alternatively, a SummarizedExperiment object containing such a matrix. |
labels |
A character vector or factor of known labels for all cells in |
power |
Numeric scalar between 0 and 1 indicating how much aggregation should be performed, see Details. |
assay.type |
An integer scalar or string specifying the assay of |
check.missing |
Logical scalar indicating whether rows should be checked for missing values (and if found, removed). |
BPPARAM |
A BiocParallelParam object indicating how parallelization should be performed.
Only used if |
With single-cell reference datasets, it is often useful to aggregate individual cells into pseudo-bulk samples to serve as a reference.
This improves speed (and to some extent, reduces noise) in downstream assignment with classifySingleR
or SingleR
.
The most obvious aggregation is to simply average all counts for all cells in a label to obtain a single pseudo-bulk profile.
This can be achieved by setting power=0
.
However, this discards information about the within-label heterogeneity (e.g., the “shape” and spread of the population in expression space)
that may be informative for assignment, especially for closely related labels.
Instead, the default approach in this function is to create a series of pseudo-bulk samples to represent each label.
This is achieved by performing vector quantization using k-means clustering on all cells in a particular label.
Cells in each cluster are subsequently averaged to create one pseudo-bulk sample.
We set the number of clusters to be ncol(ref)^power
so that labels with more cells have more resolved representatives.
If power=1
, no aggregation is performed.
We use the average rather than the sum in order to be compatible with trainSingleR
's internal marker detection.
Moreover, unlike counts, the sum of transformed and normalized expression values generally has little meaning.
We do not use the median to avoid consistently obtaining zeros for lowly expressed genes.
A SummarizedExperiment object with a "logcounts"
assay containing a matrix of aggregated expression values,
and a label
column metadata field specifying the label corresponding to each column.
Aaron Lun
library(scater) sce <- mockSCE() sce <- logNormCounts(sce) # Making up some labels for demonstration purposes: labels <- sample(LETTERS, ncol(sce), replace=TRUE) # Aggregation at different resolutions: (aggr <- aggregateReference(sce, labels, power=0.5)) (aggr <- aggregateReference(sce, labels, power=0)) # No aggregation: (aggr <- aggregateReference(sce, labels, power=1))