denoisePCA {scran}R Documentation

Denoise expression with PCA

Description

Denoise log-expression data by removing principal components corresponding to technical noise.

Usage

getDenoisedPCs(x, ...)

## S4 method for signature 'ANY'
getDenoisedPCs(x, technical, subset.row = NULL,
  min.rank = 5, max.rank = 50, fill.missing = FALSE,
  BSPARAM = bsparam(), BPPARAM = SerialParam())

## S4 method for signature 'SingleCellExperiment'
getDenoisedPCs(x, ...,
  assay.type = "logcounts")

denoisePCA(x, ..., subset.row = NULL, value = c("pca", "lowrank"),
  assay.type = "logcounts", get.spikes = FALSE, sce.out = TRUE)

denoisePCANumber(var.exp, var.tech, var.total)

Arguments

x

A numeric matrix of log-expression values.

Alternatively, a SingleCellExperiment object containing such values.

...

For the getDenoisedPCs generic, further arguments to pass to specific methods. For the SingleCellExperiment method, further arguments to pass to the ANY method.

For the denoisePCA function, further arguments to pass to the getDenoisedPCs function.

technical

An object containing the technical components of variation for each gene in x. This can be:

  • a function that computes the technical component of the variance for a gene with a given mean log-expression, as generated by fitTrendVar.

  • a numeric vector of length equal to the number of rows in x, containing the technical component for each gene.

  • a DataFrame of variance decomposition results generated by modelGeneVar or related functions.

subset.row

See ?"scran-gene-selection".

min.rank, max.rank

Integer scalars specifying the minimum and maximum number of PCs to retain.

fill.missing

Logical scalar indicating whether entries in the rotation matrix should be imputed for genes that were not used in the PCA.

BSPARAM

A BiocSingularParam object specifying the algorithm to use for PCA.

BPPARAM

A BiocParallelParam object to use for parallel processing.

assay.type

A string specifying which assay values to use.

value

String specifying the type of value to return. "pca" will return the PCs, "n" will return the number of retained components, and "lowrank" will return a low-rank approximation.

get.spikes

See ?"scran-gene-selection".

sce.out

Deprecated, a logical scalar specifying whether a modified SingleCellExperiment object should be returned.

var.exp

A numeric vector of the variances explained by successive PCs, starting from the first (but not necessarily containing all PCs).

var.tech

A numeric scalar containing the variance attributable to technical noise.

var.total

A numeric scalar containing the total variance in the data.

Details

This function performs a principal components analysis to eliminate random technical noise in the data. Random noise is uncorrelated across genes and should be captured by later PCs, as the variance in the data explained by any single gene is low. In contrast, biological processes should be captured by earlier PCs as more variance can be explained by the correlated behavior of sets of genes in a particular pathway. The idea is to discard later PCs to remove noise and improve resolution of population structure. This also has the benefit of reducing computational work for downstream steps.

The choice of the number of PCs to discard is based on the estimates of technical variance in technical. This argument accepts a number of different values, depending on how the technical noise is calculated. The percentage of variance explained by technical noise is estimated by summing the technical components across genes and dividing by the summed total variance. Genes with negative biological components are ignored during downstream analyses to ensure that the total variance is greater than the overall technical estimate.

Now, consider the retention of the first X PCs. For a given value of X, we compute the variance explained by all of the later PCs. We aim to find the largest value of X such that the sum of variances explained by the later PCs is still less than the variance attributable to technical noise. This X represents a lower bound on the number of PCs that can be retained before biological variation is definitely lost. We use this value to obtain a “reasonable” dimensionality for the PCA output.

Note that X will be coerced to lie between min.rank and max.rank. This mitigates the effect of occasional extreme results when the percentage of noise is very high or low.

Value

For getDenoisedPCs, a list is returned containing:

denoisePCA will return a modified x with:

denoisePCANumber will return an integer scalar specifying the number of PCs to retain. This is equivalent to the output from getDenoisedPCs after setting value="n", but ignoring any setting of min.rank or max.rank.

Effects of gene selection

We can use subset.row to perform the PCA on a subset of genes, e.g., HVGs. This can be used to only perform the PCA on genes with the largest biological components, thus increasing the signal-to-noise ratio of downstream analyses. Note that only rows with positive components are actually used in the PCA, even if we explicitly specified them in subset.row.

(It naturally follows that, if our HVG selection strategy involves picking all genes with positive biological components, there is no need to set subset.row explicitly in this function, because that selection would be redundant with the filtering performed internally by getDenoisedPCs.)

If fill.missing=TRUE, entries of the rotation matrix are imputed for all genes in x. This includes “unselected” genes, i.e., with negative biological components or that were not selected with subset.row. Rotation vectors are extrapolated to these genes by projecting their expression profiles into the low-dimensional space defined by the SVD on the selected genes. This is useful for guaranteeing that any low-rank approximation has the same dimensions as the input x. For example, denoisePCA will only ever use fill.missing=TRUE when value="lowrank".

Caveats with interpretation

In reality, the choice of X will only be optimal if the early PCs capture all the biological variation with minimal noise. This is unlikely to be true as the PCA cannot distinguish between technical noise and weak biological signal in the later PCs. Thus, from a mathematical perspective, X will usually be underestimated if our aim was to retain all signal.

On the other hand, many aspects of biological variation are not that interesting in most applications (e.g., transcriptional bursting, metabolic fluctuations). It is often the case that we do not actually need to retain all signal, in which case X is likely a gross overestimate in the context of the wider analysis. This can be mitigated by using modelGeneVar rather than modelGeneVarWithSpikes, as the former attempts to remove “uninteresting” biological variation; and by using a more stringent HVG subset in subset.row, to focus on the stronger aspects of biological variation.

Author(s)

Aaron Lun

References

Lun ATL (2018). Discussion of PC selection methods for scRNA-seq data. https://github.com/LTLA/PCSelection2018

See Also

fitTrendVar and modelGeneVar, for methods of computing technical components.

runSVD, for the underlying SVD algorithm(s).

Examples

library(scater)
sce <- mockSCE()
sce <- logNormCounts(sce)

# Modelling the variance:
var.stats <- modelGeneVar(sce)

# Denoising:
pcs <- getDenoisedPCs(sce, technical=var.stats)
head(pcs$components)
head(pcs$rotation)
head(pcs$percent.var)

# Automatically storing the results.
sce <- denoisePCA(sce, technical=var.stats)
reducedDimNames(sce)

[Package scran version 1.14.6 Index]