Method benchmarking is a core part of computational biology research, with an intrinsic power to establish best practices in method selection and application, as well as help identifying gaps and possibilities for improvement. A typical benchmark evaluates a set of methods using multiple different metrics, intended to capture different aspects of their performance. The best method to choose in any given situation can then be found, e.g., by averaging the different performance metrics, possibly putting more emphasis on those that are more important to the specific situation.
Inspired by the
OECD ‘Better Life Index’,
the bettr
package was developed to provide support for this last step. It
allows users to easily create performance summaries emphasizing the aspects
that are most important to them. bettr
can be used interactively, via a
R/shiny application, or programmatically by calling the underlying functions.
In this vignette, we illustrate both alternatives, using example data
provided with the package.
Given the abundance of methods available for
computational analysis of biological data, both within and beyond Bioconductor,
and the importance of careful, adaptive benchmarking, we believe that
bettr
will be a useful complement to currently available Bioconductor
infrastructure related to benchmarking and performance estimation.
Other packages (e.g., pipeComp or
SummarizedBenchmark) provide frameworks for executing
benchmarks by applying and recording pre-defined workflows to data. Packages
such as iCOBRA and ROCR instead provide
functionality for calculating well-established evaluation metric. In contrast,
bettr
focuses on visual exploration of benchmark results, represented by
the values of several evaluation metrics.
bettr
can be installed from Bioconductor (from release 3.19 onwards):
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("bettr")
suppressPackageStartupMessages({
library("bettr")
library("SummarizedExperiment")
library("tibble")
library("dplyr")
})
The main input to bettr
is a data.frame
containing values of several
metrics for several methods. In addition, the user can provide additional
annotations and characteristics for the methods and metrics, which can be
used to group and filter them in the interactive application.
## Data for two metrics (metric1, metric2) for three methods (M1, M2, M3)
df <- data.frame(Method = c("M1", "M2", "M3"),
metric1 = c(1, 2, 3),
metric2 = c(3, 1, 2))
## More information for metrics
metricInfo <- data.frame(Metric = c("metric1", "metric2", "metric3"),
Group = c("G1", "G2", "G2"))
## More information for methods ('IDs')
idInfo <- data.frame(Method = c("M1", "M2", "M3"),
Type = c("T1", "T1", "T2"))
To simplify handling and sharing, the data can be combined into a
SummarizedExperiment
(with methods as rows and metrics as columns) as
follows:
se <- assembleSE(df = df, idCol = "Method", metricInfo = metricInfo,
idInfo = idInfo)
se
#> class: SummarizedExperiment
#> dim: 3 2
#> metadata(1): bettrInfo
#> assays(1): values
#> rownames(3): M1 M2 M3
#> rowData names(2): Method Type
#> colnames(2): metric1 metric2
#> colData names(2): Metric Group
The interactive application to explore the rankings can then be launched by
means of the bettr()
function. The input can be either the assembled
SummarizedExperiment
object or the individual components.
## Alternative 1
bettr(bettrSE = se)
## Alternative 2
bettr(df = df, idCol = "Method", metricInfo = metricInfo, idInfo = idInfo)
Next, we show a more elaborate example, visualizing data from the benchmark of
single-cell clustering methods performed by
Duo et al (2018). The values
for a set of evaluation metrics applied to results obtained by several
clustering methods are provided in a .csv
file in the package:
res <- read.csv(system.file("extdata", "duo2018_results.csv",
package = "bettr"))
dim(res)
#> [1] 14 49
tibble(res)
#> # A tibble: 14 × 49
#> method ARI_Koh ARI_KohTCC ARI_Kumar ARI_KumarTCC ARI_SimKumar4easy
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 CIDR 0.672 0.805 0.989 0.977 1
#> 2 FlowSOM 0.699 0.855 0.512 0.563 1
#> 3 PCAHC 0.869 0.891 1 1 1
#> 4 PCAKmeans 0.836 0.903 0.989 0.978 1
#> 5 RaceID2 0.280 0.276 0.949 1 0.644
#> 6 RtsneKmeans 0.966 0.967 0.989 1 1
#> 7 SAFE 0.613 0.950 0.989 1 0.952
#> 8 SC3 0.939 0.939 1 1 1
#> 9 SC3svm 0.927 0.929 1 1 1
#> 10 Seurat 0.862 0.902 0.988 0.989 1
#> 11 TSCAN 0.639 0.618 1 1 1
#> 12 monocle 0.855 0.963 0.988 1 0.995
#> 13 pcaReduce 0.935 0.979 1 1 1
#> 14 ascend NA NA 1 0.988 1
#> # ℹ 43 more variables: ARI_SimKumar4hard <dbl>, ARI_SimKumar8hard <dbl>,
#> # ARI_Trapnell <dbl>, ARI_TrapnellTCC <dbl>, ARI_Zhengmix4eq <dbl>,
#> # ARI_Zhengmix4uneq <dbl>, ARI_Zhengmix8eq <dbl>, elapsed_Koh <dbl>,
#> # elapsed_KohTCC <dbl>, elapsed_Kumar <dbl>, elapsed_KumarTCC <dbl>,
#> # elapsed_SimKumar4easy <dbl>, elapsed_SimKumar4hard <dbl>,
#> # elapsed_SimKumar8hard <dbl>, elapsed_Trapnell <dbl>,
#> # elapsed_TrapnellTCC <dbl>, elapsed_Zhengmix4eq <dbl>, …
As we can see, we have 14 methods (rows) and 48 different metrics (columns).
The first column provides the name of the clustering method.
More precisely, the columns correspond to four different metrics, each of
which was applied to clustering output from of 12 data sets. We encode this
“grouping” of metrics in a data frame, in such a way that we can later
collapse performance across data sets in bettr
:
metricInfo <- tibble(Metric = colnames(res)[-1]) |>
mutate(Class = sub("_.*", "", Metric))
head(metricInfo)
#> # A tibble: 6 × 2
#> Metric Class
#> <chr> <chr>
#> 1 ARI_Koh ARI
#> 2 ARI_KohTCC ARI
#> 3 ARI_Kumar ARI
#> 4 ARI_KumarTCC ARI
#> 5 ARI_SimKumar4easy ARI
#> 6 ARI_SimKumar4hard ARI
table(metricInfo$Class)
#>
#> ARI elapsed nclust.vs.true s.norm.vs.true
#> 12 12 12 12
In order to make different metrics comparable, we next define the
transformation that should be applied to each of them within bettr
. First,
we need to make sure that the metric are consistent in terms of whether large
values indicate “good” or “bad” performance. In our case, for both the elapsed
(elapsed run time), nclust.vs.true
(difference between estimated and true
number of clusters) and s.norm.vs.true
(difference between estimated and
true normalized Shannon entropy for a clustering), a small value indicates
“better” performance, while for the ARI
(adjusted Rand index), larger
values are better. Hence, we will flip the sign of the first three before
doing additional analyses. Moreover, the different metrics clearly live in
different numeric ranges - the maximal value of the ARI
is 1, while the
other metrics can have much larger values. As an example, here we therefore
scale the three other metrics linearly to the interval [0, 1]
to make them
more comparable to the ARI
values. We record these transformations in a list,
that will be passed to bettr
:
## Initialize list
initialTransforms <- lapply(res[, grep("elapsed|nclust.vs.true|s.norm.vs.true",
colnames(res), value = TRUE)],
function(i) {
list(flip = TRUE, transform = '[0,1]')
})
length(initialTransforms)
#> [1] 36
names(initialTransforms)
#> [1] "elapsed_Koh" "elapsed_KohTCC"
#> [3] "elapsed_Kumar" "elapsed_KumarTCC"
#> [5] "elapsed_SimKumar4easy" "elapsed_SimKumar4hard"
#> [7] "elapsed_SimKumar8hard" "elapsed_Trapnell"
#> [9] "elapsed_TrapnellTCC" "elapsed_Zhengmix4eq"
#> [11] "elapsed_Zhengmix4uneq" "elapsed_Zhengmix8eq"
#> [13] "s.norm.vs.true_Koh" "s.norm.vs.true_KohTCC"
#> [15] "s.norm.vs.true_Kumar" "s.norm.vs.true_KumarTCC"
#> [17] "s.norm.vs.true_SimKumar4easy" "s.norm.vs.true_SimKumar4hard"
#> [19] "s.norm.vs.true_SimKumar8hard" "s.norm.vs.true_Trapnell"
#> [21] "s.norm.vs.true_TrapnellTCC" "s.norm.vs.true_Zhengmix4eq"
#> [23] "s.norm.vs.true_Zhengmix4uneq" "s.norm.vs.true_Zhengmix8eq"
#> [25] "nclust.vs.true_Koh" "nclust.vs.true_KohTCC"
#> [27] "nclust.vs.true_Kumar" "nclust.vs.true_KumarTCC"
#> [29] "nclust.vs.true_SimKumar4easy" "nclust.vs.true_SimKumar4hard"
#> [31] "nclust.vs.true_SimKumar8hard" "nclust.vs.true_Trapnell"
#> [33] "nclust.vs.true_TrapnellTCC" "nclust.vs.true_Zhengmix4eq"
#> [35] "nclust.vs.true_Zhengmix4uneq" "nclust.vs.true_Zhengmix8eq"
head(initialTransforms)
#> $elapsed_Koh
#> $elapsed_Koh$flip
#> [1] TRUE
#>
#> $elapsed_Koh$transform
#> [1] "[0,1]"
#>
#>
#> $elapsed_KohTCC
#> $elapsed_KohTCC$flip
#> [1] TRUE
#>
#> $elapsed_KohTCC$transform
#> [1] "[0,1]"
#>
#>
#> $elapsed_Kumar
#> $elapsed_Kumar$flip
#> [1] TRUE
#>
#> $elapsed_Kumar$transform
#> [1] "[0,1]"
#>
#>
#> $elapsed_KumarTCC
#> $elapsed_KumarTCC$flip
#> [1] TRUE
#>
#> $elapsed_KumarTCC$transform
#> [1] "[0,1]"
#>
#>
#> $elapsed_SimKumar4easy
#> $elapsed_SimKumar4easy$flip
#> [1] TRUE
#>
#> $elapsed_SimKumar4easy$transform
#> [1] "[0,1]"
#>
#>
#> $elapsed_SimKumar4hard
#> $elapsed_SimKumar4hard$flip
#> [1] TRUE
#>
#> $elapsed_SimKumar4hard$transform
#> [1] "[0,1]"
We can specify four different aspects of the desired transform, which will be applied in the following order:
flip
(TRUE
or FALSE
, whether to flip the sign of the values).
The default is FALSE
.offset
(a numeric value to add to the observed values, possibly after
applying the sign flip). The default is 0.transform
(one of None
, [0,1]
, [-1,1]
, z-score
, or Rank
).
The default is None
.cuts
(a numeric vector of cuts that will be used to turn a numeric
variable into a categorical one). The default is NULL
.Only values that deviate from the defaults need to be specified.
Finally, we can define a set of colors that we would like to use for
visualizing the methods and metrics in bettr
.
metricColors <- list(
Class = c(ARI = "purple", elapsed = "forestgreen",
nclust.vs.true = "blue",
s.norm.vs.true = "orange"))
idColors <- list(
method = c(
CIDR = "#332288", FlowSOM = "#6699CC", PCAHC = "#88CCEE",
PCAKmeans = "#44AA99", pcaReduce = "#117733",
RtsneKmeans = "#999933", Seurat = "#DDCC77", SC3svm = "#661100",
SC3 = "#CC6677", TSCAN = "grey34", ascend = "orange", SAFE = "black",
monocle = "red", RaceID2 = "blue"
))
All the information defined so far can be combined in a SummarizedExperiment
object, as shown above for the small example data:
duo2018 <- assembleSE(df = res, idCol = "method", metricInfo = metricInfo,
initialTransforms = initialTransforms,
metricColors = metricColors, idColors = idColors)
duo2018
#> class: SummarizedExperiment
#> dim: 14 48
#> metadata(1): bettrInfo
#> assays(1): values
#> rownames(14): CIDR FlowSOM ... pcaReduce ascend
#> rowData names(0):
#> colnames(48): ARI_Koh ARI_KohTCC ... nclust.vs.true_Zhengmix4uneq
#> nclust.vs.true_Zhengmix8eq
#> colData names(2): Metric Class
The assay
of the SummarizedExperiment
object contains the values for
the 48 performance measures for the 14 clustering methods.
The metricInfo
is stored in the colData
, and the lists of colors and the
initial transforms in the metadata
:
## Display the whole performance table
tibble(assay(duo2018, "values"))
#> # A tibble: 14 × 48
#> ARI_Koh ARI_KohTCC ARI_Kumar ARI_KumarTCC ARI_SimKumar4easy ARI_SimKumar4hard
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.672 0.805 0.989 0.977 1 1
#> 2 0.699 0.855 0.512 0.563 1 1
#> 3 0.869 0.891 1 1 1 1
#> 4 0.836 0.903 0.989 0.978 1 1
#> 5 0.280 0.276 0.949 1 0.644 0.194
#> 6 0.966 0.967 0.989 1 1 1
#> 7 0.613 0.950 0.989 1 0.952 NA
#> 8 0.939 0.939 1 1 1 1
#> 9 0.927 0.929 1 1 1 1
#> 10 0.862 0.902 0.988 0.989 1 1
#> 11 0.639 0.618 1 1 1 1
#> 12 0.855 0.963 0.988 1 0.995 0.992
#> 13 0.935 0.979 1 1 1 1
#> 14 NA NA 1 0.988 1 1
#> # ℹ 42 more variables: ARI_SimKumar8hard <dbl>, ARI_Trapnell <dbl>,
#> # ARI_TrapnellTCC <dbl>, ARI_Zhengmix4eq <dbl>, ARI_Zhengmix4uneq <dbl>,
#> # ARI_Zhengmix8eq <dbl>, elapsed_Koh <dbl>, elapsed_KohTCC <dbl>,
#> # elapsed_Kumar <dbl>, elapsed_KumarTCC <dbl>, elapsed_SimKumar4easy <dbl>,
#> # elapsed_SimKumar4hard <dbl>, elapsed_SimKumar8hard <dbl>,
#> # elapsed_Trapnell <dbl>, elapsed_TrapnellTCC <dbl>,
#> # elapsed_Zhengmix4eq <dbl>, elapsed_Zhengmix4uneq <dbl>, …
## Showing the first metric, evaluated on all datasets
head(colData(duo2018), 12)
#> DataFrame with 12 rows and 2 columns
#> Metric Class
#> <character> <character>
#> ARI_Koh ARI_Koh ARI
#> ARI_KohTCC ARI_KohTCC ARI
#> ARI_Kumar ARI_Kumar ARI
#> ARI_KumarTCC ARI_KumarTCC ARI
#> ARI_SimKumar4easy ARI_SimKumar4easy ARI
#> ... ... ...
#> ARI_Trapnell ARI_Trapnell ARI
#> ARI_TrapnellTCC ARI_TrapnellTCC ARI
#> ARI_Zhengmix4eq ARI_Zhengmix4eq ARI
#> ARI_Zhengmix4uneq ARI_Zhengmix4uneq ARI
#> ARI_Zhengmix8eq ARI_Zhengmix8eq ARI
## These are the color definitions (can mix character and hex values)
metadata(duo2018)$bettrInfo$idColors
#> $method
#> CIDR FlowSOM PCAHC PCAKmeans pcaReduce RtsneKmeans
#> "#332288" "#6699CC" "#88CCEE" "#44AA99" "#117733" "#999933"
#> Seurat SC3svm SC3 TSCAN ascend SAFE
#> "#DDCC77" "#661100" "#CC6677" "grey34" "orange" "black"
#> monocle RaceID2
#> "red" "blue"
metadata(duo2018)$bettrInfo$metricColors
#> $Class
#> ARI elapsed nclust.vs.true s.norm.vs.true
#> "purple" "forestgreen" "blue" "orange"
names(metadata(duo2018)$bettrInfo$initialTransforms)
#> [1] "elapsed_Koh" "elapsed_KohTCC"
#> [3] "elapsed_Kumar" "elapsed_KumarTCC"
#> [5] "elapsed_SimKumar4easy" "elapsed_SimKumar4hard"
#> [7] "elapsed_SimKumar8hard" "elapsed_Trapnell"
#> [9] "elapsed_TrapnellTCC" "elapsed_Zhengmix4eq"
#> [11] "elapsed_Zhengmix4uneq" "elapsed_Zhengmix8eq"
#> [13] "s.norm.vs.true_Koh" "s.norm.vs.true_KohTCC"
#> [15] "s.norm.vs.true_Kumar" "s.norm.vs.true_KumarTCC"
#> [17] "s.norm.vs.true_SimKumar4easy" "s.norm.vs.true_SimKumar4hard"
#> [19] "s.norm.vs.true_SimKumar8hard" "s.norm.vs.true_Trapnell"
#> [21] "s.norm.vs.true_TrapnellTCC" "s.norm.vs.true_Zhengmix4eq"
#> [23] "s.norm.vs.true_Zhengmix4uneq" "s.norm.vs.true_Zhengmix8eq"
#> [25] "nclust.vs.true_Koh" "nclust.vs.true_KohTCC"
#> [27] "nclust.vs.true_Kumar" "nclust.vs.true_KumarTCC"
#> [29] "nclust.vs.true_SimKumar4easy" "nclust.vs.true_SimKumar4hard"
#> [31] "nclust.vs.true_SimKumar8hard" "nclust.vs.true_Trapnell"
#> [33] "nclust.vs.true_TrapnellTCC" "nclust.vs.true_Zhengmix4eq"
#> [35] "nclust.vs.true_Zhengmix4uneq" "nclust.vs.true_Zhengmix8eq"
## An example of a transformation - elapsed time for the Koh dataset
metadata(duo2018)$bettrInfo$initialTransforms$elapsed_Koh
#> $flip
#> [1] TRUE
#>
#> $transform
#> [1] "[0,1]"
Now, we can launch the app for this data set:
bettr(bettrSE = duo2018, bstheme = "sandstone")
The screenshot below illustrates the default view of the interactive interface.