Quick start

plyxp provides efficient abstractions to the SummarizedExperiment such that using common dplyr functions feels as natural to operating on a data.frame or tibble. plyxp uses data-masking from the rlang package in order to connect dplyr functions to SummarizedExperiment slots in a manner that aims to be intuitive and avoiding ambiguity in outcomes.

Supported data types and operations

plyxp works on SummarizedExperiment objects, as well as most classes derived from this, including DESeqDataSet, SingleCellExperiment, etc.

It supports the following operations:

  • mutate
  • select
  • summarize
  • pull
  • filter
  • arrange

Typical use case

library(airway)
data(airway)
library(dplyr)
library(plyxp)
# to use plyxp, call `new_plyxp()` on your SummarizedExperiment object
airway <- new_plyxp(airway)
# add data (mutate) to any of the three tables,
# assay, colData, rowData,
# ...using contextual helpers cols() and rows()
airway |>
  mutate(log_counts = log1p(counts),
         cols(treated = dex == "trt"),
         rows(new_id = paste0("gene-", gene_name)))
## # A RangedSummarizedExperiment-tibble Abstraction: 63,677 × 8
##     .features      .samples | counts log_counts | gene_id gene_name gene_biotype
##     <chr>          <chr>    |  <int>      <dbl> | <chr>   <chr>     <chr>       
##   1 ENSG000000000… SRR1039… |    679       6.52 | ENSG00… TSPAN6    protein_cod…
##   2 ENSG000000000… SRR1039… |      0       0    | ENSG00… TNMD      protein_cod…
##   3 ENSG000000004… SRR1039… |    467       6.15 | ENSG00… DPM1      protein_cod…
##   4 ENSG000000004… SRR1039… |    260       5.56 | ENSG00… SCYL3     protein_cod…
##   5 ENSG000000004… SRR1039… |     60       4.11 | ENSG00… C1orf112  protein_cod…
##   …        …           …           …         …       …        …           …     
## n-4 ENSG000002734… SRR1039… |      0       0    | ENSG00… RP11-180… antisense   
## n-3 ENSG000002734… SRR1039… |      0       0    | ENSG00… TSEN34    protein_cod…
## n-2 ENSG000002734… SRR1039… |      0       0    | ENSG00… RP11-138… lincRNA     
## n-1 ENSG000002734… SRR1039… |      0       0    | ENSG00… AP000230… lincRNA     
## n   ENSG000002734… SRR1039… |      0       0    | ENSG00… RP11-80H… lincRNA     
## # ℹ n = 509,416
## # ℹ 7 more variables: new_id <chr>, `` <>, SampleName <fct>, cell <fct>,
## #   dex <fct>, albut <fct>, treated <lgl>

The operations can span contexts, and only the necessary data will be extracted from each context for computation:

airway$sizeFactor <- runif(8, .9, 1.1)

# making scaled counts, then computing row means:
airway |>
  mutate(scaled_counts = counts / .cols$sizeFactor, #
         rows(ave_scl_cts = rowMeans(.assays_asis$scaled_counts)))
## # A RangedSummarizedExperiment-tibble Abstraction: 63,677 × 8
##     .features   .samples | counts scaled_counts | gene_id gene_name gene_biotype
##     <chr>       <chr>    |  <int>         <dbl> | <chr>   <chr>     <chr>       
##   1 ENSG000000… SRR1039… |    679         732.  | ENSG00… TSPAN6    protein_cod…
##   2 ENSG000000… SRR1039… |      0           0   | ENSG00… TNMD      protein_cod…
##   3 ENSG000000… SRR1039… |    467         503.  | ENSG00… DPM1      protein_cod…
##   4 ENSG000000… SRR1039… |    260         280.  | ENSG00… SCYL3     protein_cod…
##   5 ENSG000000… SRR1039… |     60          64.7 | ENSG00… C1orf112  protein_cod…
##   …      …          …           …            …       …        …           …     
## n-4 ENSG000002… SRR1039… |      0           0   | ENSG00… RP11-180… antisense   
## n-3 ENSG000002… SRR1039… |      0           0   | ENSG00… TSEN34    protein_cod…
## n-2 ENSG000002… SRR1039… |      0           0   | ENSG00… RP11-138… lincRNA     
## n-1 ENSG000002… SRR1039… |      0           0   | ENSG00… AP000230… lincRNA     
## n   ENSG000002… SRR1039… |      0           0   | ENSG00… RP11-80H… lincRNA     
## # ℹ n = 509,416
## # ℹ 7 more variables: ave_scl_cts <dbl>, `` <>, SampleName <fct>, cell <fct>,
## #   dex <fct>, albut <fct>, sizeFactor <dbl>

Calling .cols in the assay context produces an object of the matching size and orientation to the other assay data.

Alternatively we could have used purrr to compute row means:

airway |>
  mutate(scaled_counts = counts / .cols$sizeFactor,
         # You may expect a list when accessing other contexts
         # from either the rows() or cols() contexts.
         rows(ave_scl_cts = purrr::map_dbl(.assays$scaled_counts, mean)))
## # A RangedSummarizedExperiment-tibble Abstraction: 63,677 × 8
##     .features   .samples | counts scaled_counts | gene_id gene_name gene_biotype
##     <chr>       <chr>    |  <int>         <dbl> | <chr>   <chr>     <chr>       
##   1 ENSG000000… SRR1039… |    679         732.  | ENSG00… TSPAN6    protein_cod…
##   2 ENSG000000… SRR1039… |      0           0   | ENSG00… TNMD      protein_cod…
##   3 ENSG000000… SRR1039… |    467         503.  | ENSG00… DPM1      protein_cod…
##   4 ENSG000000… SRR1039… |    260         280.  | ENSG00… SCYL3     protein_cod…
##   5 ENSG000000… SRR1039… |     60          64.7 | ENSG00… C1orf112  protein_cod…
##   …      …          …           …            …       …        …           …     
## n-4 ENSG000002… SRR1039… |      0           0   | ENSG00… RP11-180… antisense   
## n-3 ENSG000002… SRR1039… |      0           0   | ENSG00… TSEN34    protein_cod…
## n-2 ENSG000002… SRR1039… |      0           0   | ENSG00… RP11-138… lincRNA     
## n-1 ENSG000002… SRR1039… |      0           0   | ENSG00… AP000230… lincRNA     
## n   ENSG000002… SRR1039… |      0           0   | ENSG00… RP11-80H… lincRNA     
## # ℹ n = 509,416
## # ℹ 7 more variables: ave_scl_cts <dbl>, `` <>, SampleName <fct>, cell <fct>,
## #   dex <fct>, albut <fct>, sizeFactor <dbl>

See below for details on how objects are made available across contexts.

plyxp also enables common grouping and summarization routines:

summary <- airway |>
  group_by(rows(gene_biotype)) |>
  summarize(col_sums = colSums(counts),
            # may rename rows with .features
            rows(.features = unique(gene_biotype)))
# summarize returns a SummarizedExperiment here,
# retaining rowData and colData

summary |> rowData()
## DataFrame with 30 rows and 1 column
##                                    gene_biotype
##                                     <character>
## protein_coding                   protein_coding
## pseudogene                           pseudogene
## processed_transcript       processed_transcript
## antisense                             antisense
## lincRNA                                 lincRNA
## ...                                         ...
## IG_C_pseudogene                 IG_C_pseudogene
## TR_D_gene                             TR_D_gene
## IG_J_pseudogene                 IG_J_pseudogene
## 3prime_overlapping_ncrna 3prime_overlapping_n..
## processed_pseudogene       processed_pseudogene
# visualizing the output as a tibble:
library(tibble)
summary |>
  pull(col_sums) |>
  as_tibble(rownames = "type")
## # A tibble: 30 × 9
##    type        SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517
##    <chr>            <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
##  1 protein_co…   19413626      45654          4       1188      96378          0
##  2 pseudogene    17741060      45864          4        462      38656          0
##  3 processed_…   23926011     133335          0       1049      64884          0
##  4 antisense     14360299     120060          4       1113      36267          0
##  5 lincRNA       23003444     206075       6038        626      81606          0
##  6 polymorphi…   29233398     125015       5618        803      88868          0
##  7 IG_V_pseud…   18114369     145078       7662        316      44385          0
##  8 IG_V_gene     20103401     170641       5579        256      92499          0
##  9 sense_over…     807285     147563       7869        339        491          0
## 10 sense_intr…     733916     149486       9443        202        502          0
## # ℹ 20 more rows
## # ℹ 2 more variables: SRR1039520 <dbl>, SRR1039521 <dbl>

Manipulating SummarizedExperiment with plyxp

The SummarizedExperiment object contains three main components/“contexts” that we mask, the assays(), rowData()1 and colData().

Simplified view of data masking structure. Figure made with [Biorender](https://biorender.com)

Simplified view of data masking structure. Figure made with Biorender

plyxp provides variables as-is to data within their current contexts enabling you to call S4 methods on S4 objects with dplyr verbs. If you require access to variables outside the context, you may use pronouns made available through plyxp to specify where to find those variables.

Simplified view of reshaping pronouns. Arrows indicates to where the pronoun provides access. For each pronoun listed, there is an `_asis` variant that returns underlying data without reshaping it to fit the context. Figure made with [Biorender](https://biorender.com)

Simplified view of reshaping pronouns. Arrows indicates to where the pronoun provides access. For each pronoun listed, there is an _asis variant that returns underlying data without reshaping it to fit the context. Figure made with Biorender

The .assays, .rows and .cols pronouns outputs depends on the evaluating context. Users should expect that the underlying data returned from .rows or .cols pronouns in the assays context is a vector, replicated to match size of the assay context.


Alternatively, using a pronoun in either the rows() or cols() contexts will return a list equal in length to either nrows(rowData()) or nrows(colData()) respectively.

assay context

  • Default evaluation context
  • .assays \(\to\) contextual pronoun, returns list of the matrix, sliced by the dimension it was referenced from.
    • within the rowData context: .assays$foo is an alias for lapply(seq_len(nrow()), \(i, x) x[i,,drop=FALSE], x = foo)
    • within the colData context: .assays$foo is an alias for lapply(seq_len(ncol()), \(i, x) x[,i,drop=FALSE], x = foo)
  • .assays_asis \(\to\) pronoun to direct bindings in assays()
  • assay_ctx(expr, asis = FALSE) \(\to\) short hand to bind the assay context in front of the current context.

rows context

  • rows(...) \(\to\) sentinel function call to indicate evaluation context.
  • .rows \(\to\) contextual pronoun
    • within assay context: .rows$foo is an alias for vctrs::vec_rep(foo, times = ncol())
    • within colData context: .rows$foo is an alias for vctrs::vec_rep(list(foo), times = n())
  • .rows_asis \(\to\) pronoun to direct bindings in rowData()
  • row_ctx(expr, asis = FALSE) \(\to\) shorthand to bind the rowData context in front of the current context

cols context

  • cols(...) \(\to\) sentinel function call to indicate evaluation context.
  • .cols \(\to\) contextual pronoun
    • within assay context: .cols$foo is an alias for vctrs::vec_rep_each(foo, times = nrow())
    • within rowData context: .rows$foo is an alias for vctrs::vec_rep(list(foo), times = n())
  • .cols_asis \(\to\) pronoun to direct bindings in colData()
  • col_ctx(expr, asis = FALSE) \(\to\) shorthand to bind the colData context in front of the current context

Multiple expressions enabled via plyxp

We can compare two ways of dividing out a vector from colData along the columns of assay data:

# here the `.cols$` pronoun reshapes the data to fit the `assays` context
airway |>
  mutate(scaled_counts = counts / .cols$sizeFactor)
## # A RangedSummarizedExperiment-tibble Abstraction: 63,677 × 8
##     .features .samples | counts scaled_counts | gene_id gene_name gene_biotype |
##     <chr>     <chr>    |  <int>         <dbl> | <chr>   <chr>     <chr>        |
##   1 ENSG0000… SRR1039… |    679         732.  | ENSG00… TSPAN6    protein_cod… |
##   2 ENSG0000… SRR1039… |      0           0   | ENSG00… TNMD      protein_cod… |
##   3 ENSG0000… SRR1039… |    467         503.  | ENSG00… DPM1      protein_cod… |
##   4 ENSG0000… SRR1039… |    260         280.  | ENSG00… SCYL3     protein_cod… |
##   5 ENSG0000… SRR1039… |     60          64.7 | ENSG00… C1orf112  protein_cod… |
##   …     …         …           …