Contents

Abstract

autonomics is created to make cross-platform omics analysis intuitive and enjoyable. It contains import + preprocess + analyze + visualize one-liners for RNAseq, MS proteomics, SOMAscan and Metabolon platforms and a generic importer for data from any rectangular omics file. With a focus on not only automation but also customization, these importers have a flexible formula/ block/contrastdefs interface which allows to define any statistical formula, fit any general linear model, quantify any contrast, and use random effects or precision weights if required, building on top of the powerful limma platform for statistical analysis. It also offers exquisite support for analyzing complex designs such as the innovative contrastogram visualization to unravel and summarize the differential expression structure in complex designs. By autonomating repetitive tasks, generifying common steps, and intuifying complex designs, it makes cross-platform omics data analysis a fun experience. Try it and enjoy :-).

Autonomics offers import + preprocess + analyze + visualize one-liners for RNAseq, MS proteomics, SOMAscan and Metabolon platforms, as well a generic importer for data from any rectangular omics file. We will discuss each of these in more detail in the following sections, but all of them follow the following general steps:

1 RNAseq

Autonomics provides ready-to-use importers for both count as well as BAM files, which read / preprocess / analyze RNAseq data. Specific to RNAseq is counting reads using Rsubread::featureCounts (for read_rna_bams)), normalizing for library composition (cpm) or gene length (tpm), estimating voom precision weights, and using these in linear modeling. Let’s illustrate both of these on example datasets:

read_rnaseq_counts

Billing et al. (2019) studied the differentiation of embryonic (E00) into mesenchymal stem cells (E00.8, E01, E02, E05, E15, E30), with similar properties as bone-marrow mesenchymal stem cells (M00). Importing the RNAseq counts:

require(autonomics, quietly = TRUE)
## 
## Attaching package: 'autonomics'
## The following objects are masked from 'package:stats':
## 
##     biplot, loadings
## The following object is masked from 'package:base':
## 
##     beta
file <- system.file('extdata/billing19.rnacounts.txt', package = 'autonomics')
object <-  read_rnaseq_counts(file = file, fit = 'limma', plot = TRUE, label = 'gene')
##  Read /tmp/RtmpJJDNTi/Rinst134f727a111cd7/autonomics/extdata/billing19.rnacounts.txt
##               Infer subgroup from sample_ids
##               Infer subgroup from sample_ids
##      Preprocess
##          Keep 24/24 features: count >= 10 (~subgroup)
##          counts: add 0.5
##          cpm:    tmm scale libsizes
##              cpm
##          voom: ~subgroup
##          counts: rm 0.5
##               Add PCA
##     LinMod
##               Code subgroup
##                                level
##                     coefficient E00 E00.8 E01 E02 E05 E15 E30 M00
##                       Intercept  1   .     .   .   .   .   .   . 
##                       E00.8-E00 -1   1     .   .   .   .   .   . 
##                       E01-E00   -1   .     1   .   .   .   .   . 
##                       E02-E00   -1   .     .   1   .   .   .   . 
##                       E05-E00   -1   .     .   .   1   .   .   . 
##                       E15-E00   -1   .     .   .   .   1   .   . 
##                       E30-E00   -1   .     .   .   .   .   1   . 
##                       M00-E00   -1   .     .   .   .   .   .   1
##               lmFit(~subgroup, weights = assays(object)$weights)
##                      coefficient    fit downfdr upfdr downp   upp
##                           <fctr> <fctr>   <int> <int> <int> <int>
##                   1:   Intercept  limma       0    24     0    24
##                   2:   E00.8-E00  limma       1     2     1     4
##                   3:     E01-E00  limma       3     2     5     3
##                   4:     E02-E00  limma       2     2     3     4
##                   5:     E05-E00  limma       1     2     2     2
##                   6:     E15-E00  limma       8     9     8    10
##                   7:     E30-E00  limma       7    10     8    10
##                   8:     M00-E00  limma       6     7     6     7
##      Plot summary
## Warning: ggrepel: 18 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

read_rnaseq_bams

Billing et al. (2016) compared three types of stem cells: embryonic (E), embryonic differentiated into mesenchymal (EM), and bone-marrow mesenchymal (BM). Importing a downsized version of the RNAseq BAM files (with only a subset of all reads):

# not run to avoid issues with R CMD CHECK 
if (requireNamespace('Rsubread')){
  object <- read_rnaseq_bams(
                dir    = download_data('billing16.bam.zip'), 
                paired = TRUE, 
                genome = 'hg38', 
                pca    = TRUE, 
                fit   = 'limma', 
                plot   = TRUE)
}

cpm/tmm/voom effects on power

Proper preprocessing leads to increased power:

  file <- system.file('extdata/billing19.rnacounts.txt', package = 'autonomics')
# log2counts
  object <- read_rnaseq_counts(file, 
       cpm = FALSE, voom = FALSE, fit = 'limma', verbose = FALSE, plot = FALSE)
##     LinMod
##               Code subgroup
##                                level
##                     coefficient E00 E00.8 E01 E02 E05 E15 E30 M00
##                       Intercept  1   .     .   .   .   .   .   . 
##                       E00.8-E00 -1   1     .   .   .   .   .   . 
##                       E01-E00   -1   .     1   .   .   .   .   . 
##                       E02-E00   -1   .     .   1   .   .   .   . 
##                       E05-E00   -1   .     .   .   1   .   .   . 
##                       E15-E00   -1   .     .   .   .   1   .   . 
##                       E30-E00   -1   .     .   .   .   .   1   . 
##                       M00-E00   -1   .     .   .   .   .   .   1
  colSums(summarize_fit(fdt(object), 'limma')[-1, c(3,4)])
## downfdr   upfdr 
##       7      13
# log2cpm
  object <- read_rnaseq_counts(file,
       cpm = TRUE,  voom = FALSE, fit = 'limma', verbose = FALSE, plot = FALSE)
##     LinMod
##               Code subgroup
##                                level
##                     coefficient E00 E00.8 E01 E02 E05 E15 E30 M00
##                       Intercept  1   .     .   .   .   .   .   . 
##                       E00.8-E00 -1   1     .   .   .   .   .   . 
##                       E01-E00   -1   .     1   .   .   .   .   . 
##                       E02-E00   -1   .     .   1   .   .   .   . 
##                       E05-E00   -1   .     .   .   1   .   .   . 
##                       E15-E00   -1   .     .   .   .   1   .   . 
##                       E30-E00   -1   .     .   .   .   .   1   . 
##                       M00-E00   -1   .     .   .   .   .   .   1
  colSums(summarize_fit(fdt(object), 'limma')[-1, c(3,4)])
## downfdr   upfdr 
##      29      34
# log2cpm + voom
  object <- read_rnaseq_counts(file,  # log2 cpm + voom
       cpm = TRUE,  voom = TRUE,  fit = 'limma', verbose = FALSE, plot = FALSE)
##     LinMod
##               Code subgroup
##                                level
##                     coefficient E00 E00.8 E01 E02 E05 E15 E30 M00
##                       Intercept  1   .     .   .   .   .   .   . 
##                       E00.8-E00 -1   1     .   .   .   .   .   . 
##                       E01-E00   -1   .     1   .   .   .   .   . 
##                       E02-E00   -1   .     .   1   .   .   .   . 
##                       E05-E00   -1   .     .   .   1   .   .   . 
##                       E15-E00   -1   .     .   .   .   1   .   . 
##                       E30-E00   -1   .     .   .   .   .   1   . 
##                       M00-E00   -1   .     .   .   .   .   .   1
  colSums(summarize_fit(fdt(object), 'limma')[-1, c(3,4)])
## downfdr   upfdr 
##      28      34

2 Proteingroups/phosphosites

A popular approach in (DDA) MS proteomics data analysis is to use MaxQuant to produce proteinGroups.txt and phospho (STY)Sites.txt files. These can be imported using read_proteingroups / read_phosphosites, which :

LFQ intensities

An LFQ intensities example is the Fukuda et al. (2020) dataset, which compares the proteome of 30d zebrafish juveniles versus adults using label-free quantification. In the volcano plot measured values are shown with circles, imputed values as triangles.

file <- system.file('extdata/fukuda20.proteingroups.txt', package = 'autonomics')
object <- read_maxquant_proteingroups(file = file, plot = TRUE)
##     Read      proteingroups   /tmp/RtmpJJDNTi/Rinst134f727a111cd7/autonomics/extdata/fukuda20.proteingroups.txt
##               contaminanthdrs /tmp/RtmpJJDNTi/Rinst134f727a111cd7/autonomics/extdata/contaminants.tsv
##               maxquanthdrs
##     Annotate  Uncollapse        20 proIds into     32  proteins
##               Drop REV_                            32  proteins
##               Annotate                                     0 using uniprothdrs
##                                                            1 using contaminanthdrs
##                                                           29 using maxquanthdrs
##                                                            0 using uniprotrestapi
##               Filter                               32  proteins
##               within proId                         31  proteins  swissprot > trembl
##                                                    25  proteins  fullseq   > fragment
##                                                    20  proteins  protein   > transcript > homolog > prediction > uncertain
##               Add REV__                            20  proteins
##               Recollapse        20 proIds
##     SumExp
##               Replace 0->NA for 27/120 values (in 11/20 features of 6/6 samples)
##               Standardize snames: LFQ intensity 30dpt.R1  ->  30dpt.R1
##               Infer subgroup from sample_ids
##               Retain 19/20 features: ~reverse == ""
##               Retain 18/19 features: ~contaminant == ""
##               Add PCA
##     LinMod
##               Code subgroup
##                                   level
##                     coefficient    X30dpt Adult
##                       Intercept     1      .   
##                       Adult-X30dpt -1      1
##               lmFit(~subgroup)
##                       coefficient    fit downfdr upfdr downp   upp
##                            <fctr> <fctr>   <int> <int> <int> <int>
##                   1:    Intercept  limma       0    18     0    18
##                   2: Adult-X30dpt  limma       2     2     3     2
##      Plot summary
## Warning: ggrepel: 17 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps