combineMarkers {scran} | R Documentation |
Combine multiple pairwise differential expression comparisons between groups or clusters into a single ranked list of markers for each cluster.
combineMarkers(de.lists, pairs, pval.field = "p.value", effect.field = "logFC", pval.type = c("any", "some", "all"), min.prop = NULL, log.p.in = FALSE, log.p.out = log.p.in, output.field = NULL, full.stats = FALSE, sorted = TRUE)
de.lists |
A list-like object where each element is a data.frame or DataFrame. Each element should represent the results of a pairwise comparison between two groups/clusters, in which each row should contain the statistics for a single gene/feature. Rows should be named by the feature name in the same order for all elements. |
pairs |
A matrix, data.frame or DataFrame with two columns and number of rows equal to the length of |
pval.field |
A string specifying the column name of each element of |
effect.field |
A string specifying the column name of each element of |
pval.type |
A string specifying how p-values are to be combined across pairwise comparisons for a given group/cluster. |
min.prop |
Numeric scalar specifying the minimum proportion of significant comparisons per gene,
Defaults to 0.5 when |
log.p.in |
A logical scalar indicating if the p-values in |
log.p.out |
A logical scalar indicating if log-transformed p-values/FDRs should be returned. |
output.field |
A string specifying the prefix of the field names containing the effect sizes.
Defaults to |
full.stats |
A logical scalar indicating whether all statistics in |
sorted |
Logical scalar indicating whether each output DataFrame should be sorted by a statistic relevant to |
An obvious strategy to characterizing differences between clusters is to look for genes that are differentially expressed (DE) between them.
However, this entails a number of comparisons between all pairs of clusters to comprehensively identify genes that define each cluster.
For all pairwise comparisons involving a single cluster, we would like to consolidate the DE results into a single list of candidate marker genes.
Doing so is the purpose of the combineMarkers
function.
DE statistics from any testing regime can be supplied to this function - see the Examples for how this is done with t-tests from pairwiseTTests
.
A named List of DataFrames where each DataFrame contains the consolidated marker statistics for each gene (row) for the cluster of the same name. The DataFrame for cluster X contains the fields:
Top
:Integer, the minimum rank across all pairwise comparisons.
This is only reported if pval.type="any"
.
p.value
:Numeric, the p-value across all comparisons if log.p.out=FALSE
.
This is a Simes' p-value if pval.type="any"
or "some"
, otherwise it is an IUT p-value.
FDR
:Numeric, the BH-adjusted p-value for each gene if log.p.out=FALSE
.
log.p.value
:Numeric, the (natural) log-transformed version p-value.
Replaces the p.value
field if log.p.out=TRUE
.
log.FDR
:Numeric, the (natural) log-transformed adjusted p-value.
Replaces the FDR
field if log.p.out=TRUE
.
<OUTPUT>.Y
:Numeric, where the value of <OUTPUT>
is set to output.field
.
One of these fields is present for every other cluster Y in clusters
.
It contains the effect size of the comparison of X to Y, with typical values depending on the chosen method:
logFC.Y
from pairwiseTTests
, containing log-fold changes in mean expression (usually in base 2).
AUC.Y
from pairwiseWilcox
, containing the area under the curve, i.e., the concordance probability.
logFC.Y
from pairwiseBinom
, containing log2-fold changes in the expressing proportion.
Only reported when full.stats=FALSE
and effect.field
is not NULL
.
stats.y
:A nested DataFrame containing the same fields in the corresponding entry of de.lists
for the X versus Y comparison.
Only reported when full.stats=TRUE
.
Ordering rules are:
Within each DataFrame, if sorted=TRUE
, genes are ranked by the Top
column (if available) and then the p.value
column.
Otherwise, the input order of the genes is preserved.
For the DataFrame corresponding to cluster X, the <OUTPUT>.Y
columns are sorted according to the order of cluster IDs in pairs[,2]
for all rows where pairs[,1]
is X.
In the output List, the DataFrames themselves are sorted according to the order of cluster IDs in pairs[,1]
.
Note that DataFrames are only created for clusters present in pairs[,1]
.
Clusters unique to pairs[,2]
will only be present within a DataFrame as Y.
By default, each DataFrame is sorted by the Top
value when pval.type="any"
.
(For genes with the same Top
, ranking is performed based on the Simes combined p-value - see the comments for pval.type="some"
.)
Taking all rows with Top
values less than or equal to T yields a marker set containing the top T genes (ranked by significance) from each pairwise comparison.
This guarantees the inclusion of genes that can distinguish between any two clusters.
To demonstrate, let us define a marker set with an T of 1 for a given cluster.
The set of genes with Top <= 1
will contain the top gene from each pairwise comparison to every other cluster.
If T is instead, say, 5, the set will consist of the union of the top 5 genes from each pairwise comparison.
Obviously, multiple genes can have the same Top
as different genes may have the same rank across different pairwise comparisons.
Conversely, the marker set may be smaller than the product of Top
and the number of other clusters, as the same gene may be shared across different comparisons.
This approach does not explicitly favour genes that are uniquely expressed in a cluster.
Rather, it focuses on combinations of genes that - together - drive separation of a cluster from the others.
This is more general and robust but tends to yield a less focused marker set compared to the other pval.type
settings.
Note that this only relates to the default setting of min.prop=0
with pval.type="any"
.
For user-specified values of min.prop
, see the comments in the “some clusters” section below.
If pval.type="all"
, the null hypothesis is that the gene is not DE in all contrasts, and the IUT p-value is computed for each gene.
Ranking based on the IUT p-value will focus on genes that are DE in that cluster compared to all other clusters.
This strategy is particularly effective when dealing with distinct clusters that have a unique expression profile.
In such cases, it yields a highly focused marker set that concisely captures the differences between clusters.
However, it can be too stringent if the cluster's separation is driven by combinations of gene expression.
For example, consider a situation involving four clusters expressing each combination of two marker genes A and B.
With pval.type="all"
, neither A nor B would be detected as markers as it is not uniquely defined in any one cluster.
This is especially detrimental with overclustering where an otherwise acceptable marker is discarded if it is not DE between two adjacent clusters.
The pval.type="some"
setting serves as a compromise between "all"
and "any"
.
A combined p-value is calculated by taking the middlemost value of the Holm-corrected p-values for each gene.
(By default, this the median for odd numbers of contrasts and one-after-the-median for even numbers, but the exact proportion can be changed by setting min.prop
- see ?combinePValues
.)
Here, the null hypothesis is that the gene is not DE in at least half of the contrasts.
Genes are then ranked by the combined p-value.
The aim is to provide a more focused marker set without being overly stringent, though obviously it loses the theoretical guarantees of the more extreme settings.
For example, there is no guarantee that the top set contains genes that can distinguish a cluster from any other cluster, which would have been possible with pval.type="any"
.
A slightly different flavor of this approach is achieved by setting method="any"
with min.prop=0.5
.
Genes will then only be high-ranked if they are significant in at least some min.prop
proportion of the comparisons.
For example, if min.prop=0.3
, any gene with a value of Top
less than or equal to 5 will be in the top 5 DEGs of at least 30
This increases the stringency of the "any"
setting but with similar loss of theoretical guarantees.
The BH method is then applied on the Simes/IUT p-values across all genes to obtain the FDR
field.
The reported FDRs are intended only as a rough measure of significance.
Properly correcting for multiple testing is not generally possible when clusters
is determined from the same x
used for DE testing.
If log.p=TRUE
, log-transformed p-values and FDRs will be reported.
This may be useful in over-powered studies with many cells, where directly reporting the raw p-values would result in many zeroes due to the limits of machine precision.
Aaron Lun
Simes RJ (1986). An improved Bonferroni procedure for multiple tests of significance. Biometrika 73:751-754.
Berger RL and Hsu JC (1996). Bioequivalence trials, intersection-union tests and equivalence confidence sets. Statist. Sci. 11, 283-319.
pairwiseTTests
and pairwiseWilcox
, for functions that can generate de.lists
and pairs
.
findMarkers
, which automatically performs combineMarkers
on the t-test or Wilcoxon test results.
library(scater) sce <- mockSCE() sce <- logNormCounts(sce) # Any clustering method is okay. kout <- kmeans(t(logcounts(sce)), centers=3) clusters <- paste0("Cluster", kout$cluster) out <- pairwiseTTests(logcounts(sce), groups=clusters) comb <- combineMarkers(out$statistics, out$pairs) comb[["Cluster1"]] out <- pairwiseWilcox(logcounts(sce), groups=clusters) comb <- combineMarkers(out$statistics, out$pairs, effect.field="AUC") comb[["Cluster2"]] out <- pairwiseBinom(logcounts(sce), groups=clusters) comb <- combineMarkers(out$statistics, out$pairs) comb[["Cluster3"]]