trendVar {scran} | R Documentation |
Fit a mean-dependent trend to the gene-specific variances in single-cell RNA-seq data.
## S4 method for signature 'ANY' trendVar(x, method=c("loess", "spline", "semiloess"), span=0.3, family="symmetric", degree=1, df=4, parametric=FALSE, start=NULL, min.mean=0.1, design=NULL, subset.row=NULL) ## S4 method for signature 'SingleCellExperiment' trendVar(x, subset.row=NULL, ..., assay.type="logcounts", use.spikes=TRUE)
x |
A numeric matrix-like object of normalized log-expression values, where each column corresponds to a cell and each row corresponds to a spike-in transcript. Alternatively, a SingleCellExperiment object that contains such values. |
method |
A string specifying the algorithm to use for smooth trend fitting. |
span, family, degree |
Arguments to pass to |
df |
Arguments to pass to |
parametric |
A logical scalar indicating whether a parametric curve should be fitted prior to smoothing. |
start |
A named list of numeric scalars, containing starting values for parametric fitting with |
min.mean |
A numeric scalar specifying the minimum mean log-expression in order for a gene to be used for trend fitting. |
design |
A numeric matrix describing the uninteresting factors contributing to expression in each cell. |
subset.row |
A logical, integer or character scalar indicating the rows of |
... |
Additional arguments to pass to |
assay.type |
A string specifying which assay values to use, e.g., |
use.spikes |
A logical scalar specifying whether the trend should be fitted to variances for spike-in transcripts or endogenous genes. |
This function fits an abundance-dependent trend to the variance of the log-normalized expression for the spike-in transcripts.
For SingleCellExperiment objects, these expression values can be computed by normalize
after setting the size factors, e.g., with computeSpikeFactors
.
Log-transformed values are used as these are more robust to genes/transcripts with strong expression in only one or two outlier cells.
The mean and variance of the normalized log-counts is calculated for each spike-in transcript, and a trend is fitted to the variance against the mean for all transcripts.
The fitted value of this trend represents technical variability due to sequencing, drop-outs during capture, etc.
Variance decomposition to biological and technical components for endogenous genes can then be performed later with decomposeVar
.
The design matrix can be set if there are factors that should be blocked, e.g., batch effects, known (and uninteresting) clusters. Otherwise, it will default to an all-ones matrix, effectively treating all cells as part of the same group.
A named list is returned, containing:
mean
:A numeric vector of mean log-CPMs for all spike-in transcripts.
var
:A numeric vector of the variances of log-CPMs for all spike-in transcripts.
trend
:A function that returns the fitted value of the trend at any mean log-CPM.
design
:A numeric matrix, containing the design matrix that was used.
df2
:A numeric scalar, specifying the second degrees of freedom for a scaled F-distribution describing the variability of variance estimates around the trend.
If parametric=FALSE
, smoothing is performed directly on the log-variances.
This is the default as it provides the most stable performance on arbitrary mean-variance relationships.
If parametric=TRUE
, wa non-linear curve of the form
y = ax/(x^n + b)
is fitted to the variances against the means using nls
.
The specified smoothing algorihtm is applied to the log-ratios of the variance to the fitted value.
The aim is to use the parametric curve to reduce the sharpness of the expected mean-variance relationship[for easier smoothing.
Conversely, the parametric form is not exact, so the smoothers will model any remaining trends in the residuals.
The method
argument specifies the smoothing algorithm to be applied on the log-ratios/variances.
By default, a robust loess curve is used for trend fitting via loess
.
This provides a fairly flexible fit while protecting against genes with very large or very small variances.
Some experimentation with span
, degree
or family
may be required to obtain satisfactory results.
If method="spline"
, a smoothing spline of degree df
will be used instead, fitted using the smooth.spline
function.
The trendVar
function will produce an output trend
function with which fitted values can be computed.
When extrapolating to values below the smallest observed mean, the output function will approach zero.
When extrapolating to values above the largest observed mean, the output function will be set to the fitted value of the trend at the largest mean.
Note that method="semiloess"
is the same as parametric=TRUE
with method="loess"
.
Spike-in transcripts can be selected in trendVar,SingleCellExperiment-method
using the use.spikes
method.
By default, use.spikes=TRUE
which means that only rows labelled as spike-ins with isSpike(x)
will be used.
If use.spikes=FALSE
, only the rows not labelled as spike-ins will be used.
If use.spikes=NA
, every row will be used for trend fitting, regardless of whether it corresponds to a spike-in transcript or not.
If use.spikes=FALSE
, this implies that trendVar
will be applied to the endogenous genes in the SingleCellExperiment object.
For trendVar,ANY-method
, it is equivalent to manually supplying a matrix of normalized expression for endogenous genes.
This assumes that most genes exhibit technical variation and little biological variation, e.g., in a homogeneous population.
Users can also directly specify which rows to use with subset.row
, which will interact as expected with any non-NA
value of use.spikes
.
If subset.row
is specified and use.spikes=TRUE
, only the spike-in transcripts in subset.row
are used.
Otherwise, if use.spikes=FALSE
, only the non-spike in transcripts in subset.row
are used.
Low-abundance genes with mean log-expression below min.mean
are not used in trend fitting, to preserve the sensitivity of span-based smoothers at moderate-to-high abundances.
It also protects against discreteness, which can interfere with estimation of the variability of the variance estimates and accurate scaling of the trend.
The default threshold is chosen based on the point at which discreteness is observed in variance estimates from Poisson-distributed counts.
For heterogeneous droplet data, a lower threshold of 0.001-0.01 should be used.
Obviously, the usefulness of the trend is dependent on the quality of the features to which it is fitted. For example, the trend will not provide accurate estimates of the technical component if the coverage of all spike-ins is lower than that of endogenous genes.
If assay.type="logcounts"
, trendVar,SingleCellExperiment-method
will attempt to determine if the expression values were computed from counts via normalize
.
If so, a warning will be issued if the size factors are not centred at unity.
This is because different size factors are typically used for endogenous genes and spike-in transcripts.
If these size factor sets are not centred at the same value, there will be systematic differences in abundance between these features.
This precludes the use of a spike-in fitted trend with abundances for endogenous genes in decomposeVar
.
For other expression values and in trendVar,ANY-method
, the onus is on the user to ensure that normalization preserves differences in abundance.
In other words, the scaling factors used to normalize each feature should have the same mean.
This ensures that spurious differences in abundance are not introduced by the normalization process.
Aaron Lun
Lun ATL, McCarthy DJ and Marioni JC (2016). A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res. 5:2122
nls
,
loess
,
decomposeVar
,
computeSpikeFactors
,
computeSumFactors
,
normalize
example(computeSpikeFactors) # Using the mocked-up data 'y' from this example. # Normalizing (gene-based factors for genes, spike-in factors for spike-ins) y <- computeSumFactors(y) y <- computeSpikeFactors(y, general.use=FALSE) y <- normalize(y) # Fitting a trend to the spike-ins. fit <- trendVar(y) plot(fit$mean, fit$var) curve(fit$trend(x), col="red", lwd=2, add=TRUE) # Fitting a trend to the endogenous genes. fit.g <- trendVar(y, use.spikes=FALSE) plot(fit.g$mean, fit.g$var) curve(fit.g$trend(x), col="red", lwd=2, add=TRUE)