Denoise with PCA {scran}R Documentation

Denoise expression with PCA

Description

Denoise log-expression data by removing principal components corresponding to technical noise.

Usage

## S4 method for signature 'ANY'
denoisePCA(x, technical, design=NULL, subset.row=NULL,
    value=c("pca", "n", "lowrank"), min.rank=5, max.rank=100, 
    preserve.dim=FALSE, approximate=FALSE, rand.seed=1000)

## S4 method for signature 'SingleCellExperiment'
denoisePCA(x, ..., subset.row=NULL, 
    value=c("pca", "n", "lowrank"), assay.type="logcounts", 
    get.spikes=FALSE, sce.out=TRUE)

Arguments

x

A numeric matrix of log-expression values for denoisePCA,ANY-method, or a SingleCellExperiment object containing such values for denoisePCA,SingleCellExperiment-method.

technical

A function that computes the technical component of the variance for a gene with a given mean (log-)expression, see ?trendVar.

design

A numeric matrix containing the experimental design. If NULL, all cells are assumed to belong to a single group.

subset.row

A logical, integer or character vector indicating the rows of x to use. All genes are used by default.

value

A string specifying the type of value to return; the PCs, the number of retained components, or a low-rank approximation.

min.rank, max.rank

Integer scalars specifying the minimum and maximum number of PCs to retain.

preserve.dim

A logical scalar indicating whether the dimensions should be preserved when subset.row is not NULL.

approximate

A logical scalar indicating whether approximate SVD should be performed via irlba.

rand.seed

A numeric scalar specifying the seed for approximate PCA when approximate=TRUE. This can be set to NA to use the existing session seed.

...

Further arguments to pass to denoisePCA,ANY-method.

assay.type

A string specifying which assay values to use.

get.spikes

A logical scalar specifying whether spike-in transcripts should be used. This will be intersected with subset.row if the latter is specified.

sce.out

A logical scalar specifying whether a modified SingleCellExperiment object should be returned.

Details

This function performs a principal components analysis to reduce random technical noise in the data. Random noise is uncorrelated across genes and should be captured by later PCs, as the variance in the data explained by any single gene is low. In contrast, biological substructure should be correlated and captured by earlier PCs, as this explains more variance for sets of genes. The idea is to discard later PCs to remove technical noise and improve the resolution of substructure.

The choice of the number of PCs to discard is based on the estimates of technical variance in technical. This uses the trend function obtained from trendVar to compute the technical component for each gene, based on its mean abundance. The overall technical variance is estimated by summing the values across genes. Genes with negative biological components are ignored during downstream analyses to ensure that the total variance is greater than the technical variance.

The function works by assuming that the first X PCs contain all of the biological signal, while the remainder contains technical noise. For a given value of X, an estimate of the total technical variance is calculated from the sum of variance explained by all of the later PCs. A value of X is found such that the predicted technical variance equals the estimated technical variance. Note that X will be coerced to lie between min.rank and max.rank.

Only the first X PCs are reported if value="pca". If value="n", the value of X is directly reported, which avoids computing the PCs if only the rank is desired. If value="lowrank", a low-rank approximation of the original matrix is computed using only the first X components. This is useful for denoising prior to downstream applications that expect gene-wise expression profiles.

If value="lowrank", genes with negative components are still reported but are assigned expression values of zero for all cells. If subset.row is not NULL, genes that are not in the selected set are removed by default. However, if preserve.dim=TRUE, the unselected genes will still be reported and are assigned all-zero expression profiles.

If design is specified, the residuals of a linear model fitted to each gene are computed. Because variances computed from residuals are usually underestimated, the residuals are scaled up so that their variance is equal to the residual variance of the model fit. This ensures that the sum of variances is not understated, which would lead to more PCs being discarded than appropriate.

Value

For denoisePCA,ANY-method, a numeric matrix is returned containing the selected PCs (columns) for all cells (rows) if value="pca". If value="n", it will return an integer scalar specifying the number of retained components. If value="lowrank", it will return a low-rank approximation of x with the same dimensions.

For denoisePCA,SingleCellExperiment-method, the return value is the same as denoisePCA,ANY-method if sce.out=TRUE or value="n". Otherwise, a SingleCellExperiment object is returned that is a modified version of x. If value="pca", the modified object will contain the PCs as the "PCA" entry in the reducedDims slot. If value="lowrank", it will return a low-rank approximation in assays slot, named "lowrank".

In all cases, the fraction of variance explained by each PC will be stored as the "percentVar" attribute in the return value. This is directly compatible with functions such as plotPCA. Note that only the percentages for the first max.rank PCs will be recorded when approximate=TRUE.

Author(s)

Aaron Lun

See Also

trendVar, decomposeVar

Examples

# Mocking up some data.
ngenes <- 1000
is.spike <- 1:100
means <- 2^runif(ngenes, 6, 10)
dispersions <- 10/means + 0.2
nsamples <- 50
counts <- matrix(rnbinom(ngenes*nsamples, mu=means, size=1/dispersions), ncol=nsamples)
rownames(counts) <- paste0("Gene", seq_len(ngenes))

# Fitting a trend.
lcounts <- log2(counts + 1)
fit <- trendVar(lcounts, subset.row=is.spike)

# Denoising (not including the spike-ins in the PCA;
# spike-ins are automatically removed with the SingleCellExperiment method). 
pcs <- denoisePCA(lcounts, technical=fit$trend, subset.row=-is.spike)
dim(pcs)

# With a design matrix.
design <- model.matrix(~factor(rep(0:1, length.out=nsamples)))
fit3 <- trendVar(lcounts, design=design, subset.row=is.spike)
pcs3 <- denoisePCA(lcounts, technical=fit3$trend, design=design, subset.row=-is.spike)
dim(pcs3)

[Package scran version 1.6.9 Index]