plyxp
provides efficient abstractions to the
SummarizedExperiment such that using common dplyr functions
feels as natural to operating on a data.frame or
tibble. plyxp
uses data-masking
from the rlang
package in order to connect dplyr functions
to SummarizedExperiment slots in a manner that aims to be
intuitive and avoiding ambiguity in outcomes.
plyxp
works on SummarizedExperiment objects, as
well as most classes derived from this, including DESeqDataSet,
SingleCellExperiment, etc.
It supports the following operations:
mutate
select
summarize
pull
filter
arrange
library(airway)
data(airway)
library(dplyr)
library(plyxp)
# to use plyxp, call `new_plyxp()` on your SummarizedExperiment object
airway <- new_plyxp(airway)
# add data (mutate) to any of the three tables,
# assay, colData, rowData,
# ...using contextual helpers cols() and rows()
airway |>
mutate(log_counts = log1p(counts),
cols(treated = dex == "trt"),
rows(new_id = paste0("gene-", gene_name)))
## # A RangedSummarizedExperiment-tibble Abstraction: 63,677 × 8
## .features .samples | counts log_counts | gene_id gene_name gene_biotype
## <chr> <chr> | <int> <dbl> | <chr> <chr> <chr>
## 1 ENSG000000000… SRR1039… | 679 6.52 | ENSG00… TSPAN6 protein_cod…
## 2 ENSG000000000… SRR1039… | 0 0 | ENSG00… TNMD protein_cod…
## 3 ENSG000000004… SRR1039… | 467 6.15 | ENSG00… DPM1 protein_cod…
## 4 ENSG000000004… SRR1039… | 260 5.56 | ENSG00… SCYL3 protein_cod…
## 5 ENSG000000004… SRR1039… | 60 4.11 | ENSG00… C1orf112 protein_cod…
## … … … … … … … …
## n-4 ENSG000002734… SRR1039… | 0 0 | ENSG00… RP11-180… antisense
## n-3 ENSG000002734… SRR1039… | 0 0 | ENSG00… TSEN34 protein_cod…
## n-2 ENSG000002734… SRR1039… | 0 0 | ENSG00… RP11-138… lincRNA
## n-1 ENSG000002734… SRR1039… | 0 0 | ENSG00… AP000230… lincRNA
## n ENSG000002734… SRR1039… | 0 0 | ENSG00… RP11-80H… lincRNA
## # ℹ n = 509,416
## # ℹ 7 more variables: new_id <chr>, `` <>, SampleName <fct>, cell <fct>,
## # dex <fct>, albut <fct>, treated <lgl>
The operations can span contexts, and only the necessary data will be extracted from each context for computation:
airway$sizeFactor <- runif(8, .9, 1.1)
# making scaled counts, then computing row means:
airway |>
mutate(scaled_counts = counts / .cols$sizeFactor, #
rows(ave_scl_cts = rowMeans(.assays_asis$scaled_counts)))
## # A RangedSummarizedExperiment-tibble Abstraction: 63,677 × 8
## .features .samples | counts scaled_counts | gene_id gene_name gene_biotype
## <chr> <chr> | <int> <dbl> | <chr> <chr> <chr>
## 1 ENSG000000… SRR1039… | 679 752. | ENSG00… TSPAN6 protein_cod…
## 2 ENSG000000… SRR1039… | 0 0 | ENSG00… TNMD protein_cod…
## 3 ENSG000000… SRR1039… | 467 517. | ENSG00… DPM1 protein_cod…
## 4 ENSG000000… SRR1039… | 260 288. | ENSG00… SCYL3 protein_cod…
## 5 ENSG000000… SRR1039… | 60 66.5 | ENSG00… C1orf112 protein_cod…
## … … … … … … … …
## n-4 ENSG000002… SRR1039… | 0 0 | ENSG00… RP11-180… antisense
## n-3 ENSG000002… SRR1039… | 0 0 | ENSG00… TSEN34 protein_cod…
## n-2 ENSG000002… SRR1039… | 0 0 | ENSG00… RP11-138… lincRNA
## n-1 ENSG000002… SRR1039… | 0 0 | ENSG00… AP000230… lincRNA
## n ENSG000002… SRR1039… | 0 0 | ENSG00… RP11-80H… lincRNA
## # ℹ n = 509,416
## # ℹ 7 more variables: ave_scl_cts <dbl>, `` <>, SampleName <fct>, cell <fct>,
## # dex <fct>, albut <fct>, sizeFactor <dbl>
Calling .cols
in the assay context produces an object of
the matching size and orientation to the other assay data.
Alternatively we could have used purrr to compute row means:
airway |>
mutate(scaled_counts = counts / .cols$sizeFactor,
# You may expect a list when accessing other contexts
# from either the rows() or cols() contexts.
rows(ave_scl_cts = purrr::map_dbl(.assays$scaled_counts, mean)))
## # A RangedSummarizedExperiment-tibble Abstraction: 63,677 × 8
## .features .samples | counts scaled_counts | gene_id gene_name gene_biotype
## <chr> <chr> | <int> <dbl> | <chr> <chr> <chr>
## 1 ENSG000000… SRR1039… | 679 752. | ENSG00… TSPAN6 protein_cod…
## 2 ENSG000000… SRR1039… | 0 0 | ENSG00… TNMD protein_cod…
## 3 ENSG000000… SRR1039… | 467 517. | ENSG00… DPM1 protein_cod…
## 4 ENSG000000… SRR1039… | 260 288. | ENSG00… SCYL3 protein_cod…
## 5 ENSG000000… SRR1039… | 60 66.5 | ENSG00… C1orf112 protein_cod…
## … … … … … … … …
## n-4 ENSG000002… SRR1039… | 0 0 | ENSG00… RP11-180… antisense
## n-3 ENSG000002… SRR1039… | 0 0 | ENSG00… TSEN34 protein_cod…
## n-2 ENSG000002… SRR1039… | 0 0 | ENSG00… RP11-138… lincRNA
## n-1 ENSG000002… SRR1039… | 0 0 | ENSG00… AP000230… lincRNA
## n ENSG000002… SRR1039… | 0 0 | ENSG00… RP11-80H… lincRNA
## # ℹ n = 509,416
## # ℹ 7 more variables: ave_scl_cts <dbl>, `` <>, SampleName <fct>, cell <fct>,
## # dex <fct>, albut <fct>, sizeFactor <dbl>
See below for details on how objects are made available across contexts.
plyxp
also enables common grouping and summarization
routines:
summary <- airway |>
group_by(rows(gene_biotype)) |>
summarize(col_sums = colSums(counts),
# may rename rows with .features
rows(.features = unique(gene_biotype)))
# summarize returns a SummarizedExperiment here,
# retaining rowData and colData
summary |> rowData()
## DataFrame with 30 rows and 1 column
## gene_biotype
## <character>
## protein_coding protein_coding
## pseudogene pseudogene
## processed_transcript processed_transcript
## antisense antisense
## lincRNA lincRNA
## ... ...
## IG_C_pseudogene IG_C_pseudogene
## TR_D_gene TR_D_gene
## IG_J_pseudogene IG_J_pseudogene
## 3prime_overlapping_ncrna 3prime_overlapping_n..
## processed_pseudogene processed_pseudogene
# visualizing the output as a tibble:
library(tibble)
summary |>
pull(col_sums) |>
as_tibble(rownames = "type")
## # A tibble: 30 × 9
## type SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 protein_co… 19413626 45654 4 1188 96378 0
## 2 pseudogene 17741060 45864 4 462 38656 0
## 3 processed_… 23926011 133335 0 1049 64884 0
## 4 antisense 14360299 120060 4 1113 36267 0
## 5 lincRNA 23003444 206075 6038 626 81606 0
## 6 polymorphi… 29233398 125015 5618 803 88868 0
## 7 IG_V_pseud… 18114369 145078 7662 316 44385 0
## 8 IG_V_gene 20103401 170641 5579 256 92499 0
## 9 sense_over… 807285 147563 7869 339 491 0
## 10 sense_intr… 733916 149486 9443 202 502 0
## # ℹ 20 more rows
## # ℹ 2 more variables: SRR1039520 <dbl>, SRR1039521 <dbl>
plyxp
The SummarizedExperiment
object contains three main
components/“contexts” that we mask, the assays()
,
rowData()
1 and colData()
.
plyxp
provides variables as-is to data within
their current contexts enabling you to call S4 methods on S4
objects with dplyr
verbs. If you require access to
variables outside the context, you may use pronouns made
available through plyxp
to specify where to find those
variables.
The .assays
, .rows
and .cols
pronouns outputs depends on the evaluating context. Users should expect
that the underlying data returned from .rows
or
.cols
pronouns in the assays
context is a vector, replicated to match size of the assay
context.
Alternatively, using a pronoun in either the rows()
or
cols()
contexts will return a list equal in length to
either nrows(rowData())
or nrows(colData())
respectively.
.assays
\(\to\)
contextual pronoun, returns list of the matrix, sliced by the dimension
it was referenced from.
.assays$foo
is an alias for
lapply(seq_len(nrow()), \(i, x) x[i,,drop=FALSE], x = foo)
.assays$foo
is an alias for
lapply(seq_len(ncol()), \(i, x) x[,i,drop=FALSE], x = foo)
.assays_asis
\(\to\)
pronoun to direct bindings in assays()
assay_ctx(expr, asis = FALSE)
\(\to\) short hand to bind the assay context
in front of the current context.rows(...)
\(\to\)
sentinel function call to indicate evaluation context..rows
\(\to\)
contextual pronoun
.rows$foo
is an alias for
vctrs::vec_rep(foo, times = ncol())
.rows$foo
is an alias for
vctrs::vec_rep(list(foo), times = n())
.rows_asis
\(\to\)
pronoun to direct bindings in rowData()
row_ctx(expr, asis = FALSE)
\(\to\) shorthand to bind the rowData context
in front of the current contextcols(...)
\(\to\)
sentinel function call to indicate evaluation context..cols
\(\to\)
contextual pronoun
.cols$foo
is an alias for
vctrs::vec_rep_each(foo, times = nrow())
.rows$foo
is an alias for
vctrs::vec_rep(list(foo), times = n())
.cols_asis
\(\to\)
pronoun to direct bindings in colData()
col_ctx(expr, asis = FALSE)
\(\to\) shorthand to bind the colData context
in front of the current contextplyxp
We can compare two ways of dividing out a vector from
colData
along the columns of assay
data:
# here the `.cols$` pronoun reshapes the data to fit the `assays` context
airway |>
mutate(scaled_counts = counts / .cols$sizeFactor)
## # A RangedSummarizedExperiment-tibble Abstraction: 63,677 × 8
## .features .samples | counts scaled_counts | gene_id gene_name gene_biotype |
## <chr> <chr> | <int> <dbl> | <chr> <chr> <chr> |
## 1 ENSG0000… SRR1039… | 679 752. | ENSG00… TSPAN6 protein_cod… |
## 2 ENSG0000… SRR1039… | 0 0 | ENSG00… TNMD