1 Acknowledgments

hermes is a successor of the Roche internal rnaseqTools R package, and therefore many code ideas have been borrowed from it. Therefore we would like to thank the rnaseqTools authors for their work.

In particular, we would like to acknowledge Chendi Liao and Joe Paulson for their guidance and explanations during the development of hermes. We also discussed the class design with Valerie Obenchain, and discussed RNAseq data standards with Armen Karapetyan. We borrowed some ideas from the Roche internal biokitr R package and discussed them with its maintainer Daniel Marbach.

Finally, hermes originated as part of the NEST project. We are grateful for the entire team’s support.

Thanks a lot to everyone involved!

2 Installation

First let’s see how we can install the hermes package.

2.1 BioConductor

With the development version (3.15) of BioConductor, you can install the current package version with:

if (!require("BiocManager")) {
  install.packages("BiocManager")
}
BiocManager::install("hermes")

2.2 GitHub

You can install the unstable development version from GitHub with:

if (!require("devtools")) {
  install.packages("devtools")
}
devtools::install_github("insightsengineering/hermes")

3 Introduction

The hermes R package provides classes, methods and functions to import, quality-check, filter, normalize, analyze RNAseq counts data. The core functionality is built on the BioConductor ecosystem, especially SummarizedExperiment. This is the vignette to read for new users of this package.

In this vignette you are going to learn how to:

  • Import RNAseq count data into the hermes ready format.
  • Annotate gene information automatically from a central database (e.g. BioMart).
  • Add quality control (QC) flags to genes and samples.
  • Filter the data set.
  • Normalize the counts.
  • Quickly produce descriptive plots.
  • Perform principal components analysis.
  • Produce a templated QC report.
  • Perform differential expression analysis.

The packages used in this vignette are:

library(hermes)
library(SummarizedExperiment)

The datasets used in this vignette are:

?expression_set
?summarized_experiment

4 Importing Data

The data for hermes needs to be imported into the HermesData or RangedHermesData format.

4.1 Importing a SummarizedExperiment

The simplest way to import data is from a SummarizedExperiment (SE) object. This is because a HermesData object is just a special SE, with few additional requirements and slots.

In a nutshell, the object needs to have a counts assay, have certain gene and sample variables, and have unique row and column names. The row names, i.e. the gene names, must start with a common prefix GeneID or ENSG to enable easy annotations. See ?HermesData for the detailed requirements.

When the SE follows the minimum conventions, we can just call the HermesData constructor on it:

object <- HermesData(summarized_experiment)

And we have a HermesData object.

object
#> class: HermesData
#> assays(1): counts
#> genes(5085): GeneID:11185 GeneID:10677 ... GeneID:9087 GeneID:9426
#> additional gene information(12): HGNC HGNCGeneName ... chromosome_name
#>   LowExpressionFlag
#> samples(20): 06520011B0023R 06520067C0018R ... 06520015C0016R
#>   06520019C0023R
#> additional sample information(74): Filename SampleID ... LowDepthFlag
#>   TechnicalFailureFlag

Note that in this case deprecated names were used for the rowData and colData variables, therefore they appear under “additional” gene and sample information. However we can still call the default constructor because the new names will be filled with missing values, e.g.:

head(annotation(object))
#> DataFrame with 6 rows and 4 columns
#>                     symbol      desc chromosome      size
#>                  <logical> <logical>  <logical> <logical>
#> GeneID:11185            NA        NA         NA        NA
#> GeneID:10677            NA        NA         NA        NA
#> GeneID:101928428        NA        NA         NA        NA
#> GeneID:100422835        NA        NA         NA        NA
#> GeneID:102466731        NA        NA         NA        NA
#> GeneID:64881            NA        NA         NA        NA

If we want to map old column names to new column names to avoid duplication with new missing value columns, we can do this using the rename() method. For example here:

object <- summarized_experiment %>%
  rename(
    row_data = c(
      symbol = "HGNC",
      desc = "HGNCGeneName",
      chromosome = "Chromosome",
      size = "WidthBP",
      low_expression_flag = "LowExpressionFlag"
    ),
    col_data = c(
      low_depth_flag = "LowDepthFlag",
      technical_failure_flag = "TechnicalFailureFlag"
    )
  ) %>%
  HermesData()

For example we can now see in the annotations that we successfully carried over the information since we mapped the old annotations to the new required names above:

head(annotation(object))
#> DataFrame with 6 rows and 4 columns
#>                        symbol                   desc  chromosome      size
#>                   <character>            <character> <character> <integer>
#> GeneID:11185             INMT indolethylamine N-me..           7      5468
#> GeneID:10677             AVIL               advillin          12     18694
#> GeneID:101928428 LOC101928428 RNA-binding protein ..  GL000220.1       138
#> GeneID:100422835      MIR3183          microRNA 3183          17        84
#> GeneID:102466731     MIR6769A         microRNA 6769a          16        73
#> GeneID:64881           PCDH20       protocadherin 20          13      5838

For a bit more details we can also call summary() on the object.

summary(object)
#> HermesData object with 20 samples of 5085 genes.
#> - Library sizes across samples: mean 5476759, median 5365970, range 4632496 to 7262374
#> - Included assays (1): counts
#> - Additional gene information (7): GeneID StartBP ... SYMBOL
#>   chromosome_name
#> - Additional sample information (73): Filename SampleID ... STDSSDY
#>   technical_failure_flag
#> - Low expression genes (3021): GeneID:10677 GeneID:101928428 ...
#>   GeneID:9084 GeneID:9426
#> - Samples with too low depth or technical failures (20): NA NA ... NA
#>   NA

For the below, let’s use the already prepared HermesData object.

object <- hermes_data

Likewise, when we receive the error “no ‘counts’ assay found”, we can use the rename() function to change the name of the assay in the SummarizedExperiment object to counts. For example, the following object of type SummarizedExperiment would have the assay name count, and would produce the assay name error:

object_exp <- summarized_experiment %>%
  rename(assays = c(count = "counts"))

And we would use the following code to convert the assay name to counts, making it able to convert into HermesData object:

object_exp <- rename(object_exp,
  assays = c(counts = "count")
)
object_exp <- HermesData(object_exp)

4.2 Importing an ExpressionSet

If we start from an ExpressionSet, we can first convert it to a RangedSummarizedExperiment and then import it to RangedHermesData:

se <- makeSummarizedExperimentFromExpressionSet(expression_set)
object2 <- HermesData(se)
object2
#> class: RangedHermesData
#> assays(1): counts
#> genes(5085): GeneID:11185 GeneID:10677 ... GeneID:9087 GeneID:9426
#> additional gene information(12): HGNC HGNCGeneName ... chromosome_name
#>   LowExpressionFlag
#> samples(20): 06520011B0023R 06520067C0018R ... 06520015C0016R
#>   06520019C0023R
#> additional sample information(74): Filename SampleID ... LowDepthFlag
#>   TechnicalFailureFlag

4.3 Importing a Matrix

In general we can also import a matrix of counts. We just have to pass the required gene and sample information as data frames to the constructor.

counts_matrix <- assay(hermes_data)
object3 <- HermesDataFromMatrix(
  counts = counts_matrix,
  rowData = rowData(hermes_data),
  colData = colData(hermes_data)
)
object3
#> class: HermesData
#> assays(1): counts
#> genes(5085): GeneID:11185 GeneID:10677 ... GeneID:9087 GeneID:9426
#> additional gene information(3): GeneID SYMBOL chromosome_name
#> samples(20): 06520011B0023R 06520067C0018R ... 06520015C0016R
#>   06520019C0023R
#> additional sample information(72): Filename SampleID ... TTYPE STDSSDY
identical(object, object3)
#> [1] TRUE

Note that we can easily access the counts assay (matrix) in the final object with counts():

cnts <- counts(object)
cnts[1:3, 1:3]
#>                  06520011B0023R 06520067C0018R 06520063C0043R
#> GeneID:11185                  3             66             35
#> GeneID:10677               1668            236             95
#> GeneID:101928428              0              0              0

5 Annotations

hermes provides a modular approach for querying gene annotations, in order to allow for future extensions in this or other downstream packages.

5.1 Connection to Database

The first step is to connect to a database. In hermes the only option is currently databases that utilize the BioMart software suite. However due to the generic function design, it is simple to extend hermes with other data base connections.

In order to save time during vignette build, we zoom in here on a subset of the original object containing only the first 10 genes.

small_object <- object[1:10, ]

The corresponding function takes the common gene ID prefix as argument to determine the format of the gene IDs and the filter variable to use in the query later on.

httr::set_config(httr::config(ssl_verifypeer = 0L))
connection <- connect_biomart(prefix(small_object))

Here we are using the prefix() method to access the prefix saved in the HermesData object.

5.2 Querying and Saving Annotations

Then the second step is to query the gene annotations and save them in the object.

annotation(small_object) <- query(genes(small_object), connection)

Here we are using the genes() method to access the gene IDs (row names) of the HermesData object. Note that not all genes might be found in the data base and the corresponding rows would then be NA in the annotations.

6 Quality Control Flags

hermes provides automatic gene and sample flagging, as well as manual sample flagging functionality.

6.1 Automatic Gene and Sample Flagging

For genes, it is counted how many samples don’t pass a minimum expression CPM (counts per million reads mapped) threshold. If too many, then this gene is flagged as a “low expression” gene.

For samples, two flags are provided. The “technical failure” flag is based on the average Pearson correlation with other samples. The “low depth” flag is based on the library size, i.e. the total sum of counts for a sample across all genes.

Thresholds for the above flags can be initialized with control_quality(), and the flags are added with add_quality_flags().

my_controls <- control_quality(min_cpm = 10, min_cpm_prop = 0.4, min_corr = 0.4, min_depth = 1e4)
#> Loading required namespace: testthat
object_flagged <- add_quality_flags(object, control = my_controls)

6.2 Manual Sample Flagging

Sometimes it is necessary to manually flag certain samples as technical failures, e.g. after looking at one of the analyses discussed below. This is possible, too.

object_flagged <- set_tech_failure(object_flagged, sample_ids = "06520011B0023R")

6.3 Accessing Flags

All flags have access functions.

head(get_tech_failure(object_flagged))
#> 06520011B0023R 06520067C0018R 06520063C0043R 06520105C0017R 06520092C0017R 
#>           TRUE          FALSE          FALSE          FALSE          FALSE 
#> 06520103C0017R 
#>          FALSE
head(get_low_depth(object_flagged))
#> 06520011B0023R 06520067C0018R 06520063C0043R 06520105C0017R 06520092C0017R 
#>          FALSE          FALSE          FALSE          FALSE          FALSE 
#> 06520103C0017R 
#>          FALSE
head(get_low_expression(object_flagged))
#>     GeneID:11185     GeneID:10677 GeneID:101928428 GeneID:100422835 
#>             TRUE            FALSE             TRUE             TRUE 
#> GeneID:102466731     GeneID:64881 
#>             TRUE             TRUE

7 Filtering Data

We can either filter based on the default QC flags, or based on custom variables from the gene or sample information.

7.1 Based on Default QC Flags

This is simple with the filter() function. It is also possible to selectively only filter the genes or the samples using the what argument.

object_flagged_filtered <- filter(object_flagged)
object_flagged_genes_filtered <- filter(object_flagged, what = "genes")

7.2 Based on Custom Variables

This can be done with the subset() function. Genes can be filtered with the subset argument via expressions using the gene information variables, and samples can be filtered with the select argument using the sample information variables. In order to see which ones are available these can be queries first.

names(rowData(object_flagged))
#> [1] "symbol"              "desc"                "GeneID"             
#> [4] "chromosome"          "size"                "SYMBOL"             
#> [7] "chromosome_name"     "low_expression_flag"
names(colData(object_flagged))
#>  [1] "Filename"          "SampleID"          "AGEGRP"           
#>  [4] "AGE18"             "STDDRS"            "STDDRSD"          
#>  [7] "STDSSDT"           "TRTDRS"            "TRTDRSD"          
#> [10] "BHDCIRC"           "BHDCIRCU"          "ADAFL"            
#> [13] "BLANP"             "BKPS"              "BLKS"             
#> [16] "BTANNER"           "FRPST"             "DURIDX"           
#> [19] "DURSAF"            "DURSUR"            "LNTHRPY"          
#> [22] "AENCIFL"           "STUDYID"           "USUBJID"          
#> [25] "RFSTDTC"           "RFENDTC"           "RFXSTDTC"         
#> [28] "RFXENDTC"          "RFICDTC"           "RFPENDTC"         
#> [31] "DTHDTC"            "DTHFL"             "SITEID"           
#> [34] "INVID"             "AGE"               "AGEU"             
#> [37] "SEX"               "RACE"              "ETHNIC"           
#> [40] "ARMCD"             "ARM"               "ACTARMCD"         
#> [43] "ACTARM"            "COUNTRY"           "DMDTC"            
#> [46] "DMDY"              "BAGE"              "BAGEU"            
#> [49] "BWT"               "BWTU"              "BHT"              
#> [52] "BHTU"              "BBMI"              "ITTFL"            
#> [55] "SAFFL"             "INFCODT"           "RANDDT"           
#> [58] "TRTSDTC"           "TRTSDTM"           "TRTSTMF"          
#> [61] "TRTEDTM"           "TRTETMF"           "TRTDUR"           
#> [64] "DISCSTUD"          "DISCDEAT"          "DISCAE"           
#> [67] "DISTRTFL"          "AEWITHFL"          "ALIVDT"           
#> [70] "COHORT"            "TTYPE"             "STDSSDY"          
#> [73] "low_depth_flag"    "tech_failure_flag"
head(rowData(object_flagged)$chromosome)
#> [1] "7"          "12"         "GL000220.1" "17"         "16"        
#> [6] "13"
head(object_flagged$ARMCD)
#> [1] "COH1"  "COH1"  "COH8"  "COH12" "COH9O" "COH9E"
object_flagged_subsetted <- subset(
  object_flagged,
  subset = chromosome == "5",
  select = ARMCD == "COH1"
)

8 Normalizing Counts

Normalizing counts within samples (CPM), genes (RPKM) or across both (TPM) can be achieved with the normalize() function. The normalize() function can also transform the counts by the variance stabilizing transformation (vst) and the regularized log transformation (rlog) as proposed in the DESeq2 package.

object_normalized <- normalize(object_flagged_filtered)
#> -- note: fitType='parametric', but the dispersion trend was not well captured by the
#>    function: y = a/x + b, and a local regression fit was automatically substituted.
#>    specify fitType='local' or 'mean' to avoid this message next time.

The corresponding assays are saved in the object and can be accessed with assay().

assay(object_normalized, "tpm")[1:3, 1:3]
#>               06520067C0018R 06520063C0043R 06520105C0017R
#> GeneID:10677        4.096418       3.323016       7.714990
#> GeneID:286205       2.985506       3.182624       3.769962
#> GeneID:8365        11.711741      12.421108      12.466491

The used control settings can be accessed afterwards from the metadata of the object:

metadata(object_normalized)
#> $control_quality_flags
#> $control_quality_flags$min_cpm
#> [1] 10
#> 
#> $control_quality_flags$min_cpm_prop
#> [1] 0.4
#> 
#> $control_quality_flags$min_corr
#> [1] 0.4
#> 
#> $control_quality_flags$min_depth
#> [1] 10000
#> 
#> 
#> $control_normalize
#> $control_normalize$log
#> [1] TRUE
#> 
#> $control_normalize$lib_sizes
#> NULL
#> 
#> $control_normalize$prior_count
#> [1] 1
#> 
#> $control_normalize$fit_type
#> [1] "parametric"

Note that also the filtering settings are saved in here. For custom normalization options, use control_normalize(). For example, to not use log scale but the original scale of the counts:

object_normalized_original <- normalize(
  object_flagged_filtered,
  control = control_normalize(log = FALSE)
)
#> -- note: fitType='parametric', but the dispersion trend was not well captured by the
#>    function: y = a/x + b, and a local regression fit was automatically substituted.
#>    specify fitType='local' or 'mean' to avoid this message next time.
assay(object_normalized_original, "tpm")[1:3, 1:3]
#>               06520067C0018R 06520063C0043R 06520105C0017R
#> GeneID:10677       16.105854       9.007544       209.1084
#> GeneID:286205       6.920033       8.079569        12.6418
#> GeneID:8365      3353.172671    5483.360511      5658.6256

9 Descriptive Plots

9.1 Simple Plots

A series of simple descriptive plots can be obtained by just calling autoplot() on an object.

autoplot(object)