1 Introduction

2 Installation and options

ISAnalytics can be installed quickly in different ways:

  • You can install it via Bioconductor
  • You can install it via GitHub using the package devtools

There are always 2 versions of the package active:

  • RELEASE is the latest stable version
  • DEVEL is the development version, it is the most up-to-date version where all new features are introduced

2.1 Installation from bioconductor

RELEASE version:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("ISAnalytics")

DEVEL version:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# The following initializes usage of Bioc devel
BiocManager::install(version='devel')

BiocManager::install("ISAnalytics")

2.2 Installation from GitHub

RELEASE:

if (!require(devtools)) {
  install.packages("devtools")
}
devtools::install_github("calabrialab/ISAnalytics",
                         ref = "RELEASE_3_14",
                         dependencies = TRUE,
                         build_vignettes = TRUE)

DEVEL:

if (!require(devtools)) {
  install.packages("devtools")
}
devtools::install_github("calabrialab/ISAnalytics",
                         ref = "master",
                         dependencies = TRUE,
                         build_vignettes = TRUE)

2.3 Setting options

ISAnalytics has a verbose option that allows some functions to print additional information to the console while they’re executing. To disable this feature do:

# DISABLE
options("ISAnalytics.verbose" = FALSE)

# ENABLE
options("ISAnalytics.verbose" = TRUE)

Some functions also produce report in a user-friendly HTML format, to set this feature:

# DISABLE HTML REPORTS
options("ISAnalytics.reports" = FALSE)

# ENABLE HTML REPORTS
options("ISAnalytics.reports" = TRUE)
library(ISAnalytics)

2.4 What is a collision and why should you care?

We’re not going into too much detail here, but we’re going to explain in a very simple way what a “collision” is and how the function in this package deals with them.

We say that an integration (aka a unique combination of chromosome, integration locus and strand) is a collision if this combination is shared between different independent samples: an independent sample is a unique combination of ProjectID and SubjectID (where subjects usually represent patients). The reason behind this is that it’s highly improbable to observe the very same integration in two different subjects and this phenomenon might be an indicator of some kind of contamination in the sequencing phase or in PCR phase, for this reason we might want to exclude such contamination from our analysis. ISAnalytics provides a function that processes the imported data for the removal or reassignment of these “problematic” integrations, remove_collisions().

The processing is done using the sequence count value, so the corresponding matrix is needed for this operation.

2.5 The logic behind the function

The remove_collisions() function follows several logical steps to decide whether an integration is a collision and if it is it decides whether to re-assign it or remove it entirely based on different criterias.

2.5.1 Identifying the collisions

As we said before, a collision is a triplet made of chr, integration locus and strand, which is shared between different independent samples, aka a pair made of ProjectID and SubjectID. The function uses the information stored in the association file to assess which independent samples are present and counts the number of independent samples for each integration: those who have a count > 1 are considered collisions.


Table 1: Example of collisions: the same integration (1, 123454, +) is found in 2 different independent samples ((SUBJ01, PJ01) & (SUBJ02, PJ01))
chr integration_locus strand seqCount CompleteAmplificationID SubjectID ProjectID
1 123454 + 653 SAMPLE1 SUBJ01 PJ01
1 123454 + 456 SAMPLE2 SUBJ02 PJ01

2.5.2 Re-assign vs remove

Once the collisions are identified, the function follows 3 steps where it tries to re-assign the combination to a single independent sample. The criterias are:

  1. Compare dates: if it’s possible to have an absolute ordering on dates, the integration is re-assigned to the sample that has the earliest date. If two samples share the same date it’s impossible to decide, so the next criteria is tested
  2. Compare replicate number: if a sample has the same integration in more than one replicate, it’s more probable the integration is not an artifact. If it’s possible to have an absolute ordering, the collision is re-assigned to the sample whose grouping is largest
  3. Compare the sequence count value: if the previous criteria wasn’t sufficient to make a decision, for each group of independent samples it’s evaluated the sum of the sequence count value - for each group there is a cumulative value of the sequence count and this is compared to the value of other groups. If there is a single group which has a ratio n times bigger than other groups, this one is chosen for re-assignment. The factor n is passed as a parameter in the function (reads_ratio), the default value is 10.

If none of the criterias were sufficient to make a decision, the integration is simply removed from the matrix.

3 Usage

data("integration_matrices", package = "ISAnalytics")
data("association_file", package = "ISAnalytics")
## Multi quantification matrix
no_coll <- remove_collisions(x = integration_matrices,
                             association_file = association_file,
                             report_path = NULL)
#> Identifying collisions...
#> Processing collisions...
#> 
  |                                                                                                                    
  |                                                                                                              |   0%
  |                                                                                                                    
  |============================                                                                                  |  25%
  |                                                                                                                    
  |=======================================================                                                       |  50%
  |                                                                                                                    
  |==================================================================================                            |  75%
  |                                                                                                                    
  |==============================================================================================================| 100%
#> Finished!
## Matrix list
separated <- separate_quant_matrices(integration_matrices)
no_coll_list <- remove_collisions(x = separated,
                             association_file = association_file,
                             report_path = NULL)
#> Identifying collisions...
#> Processing collisions...
#> 
  |                                                                                                                    
  |                                                                                                              |   0%
  |                                                                                                                    
  |============================                                                                                  |  25%
  |                                                                                                                    
  |=======================================================                                                       |  50%
  |                                                                                                                    
  |==================================================================================                            |  75%
  |                                                                                                                    
  |==============================================================================================================| 100%
#> Finished!
## Only sequence count
no_coll_single <- remove_collisions(x = separated$seqCount,
                             association_file = association_file,
                             quant_cols = c(seqCount = "Value"),
                             report_path = NULL)
#> Identifying collisions...
#> Processing collisions...
#> 
  |                                                                                                                    
  |                                                                                                              |   0%
  |                                                                                                                    
  |============================                                                                                  |  25%
  |                                                                                                                    
  |=======================================================                                                       |  50%
  |                                                                                                                    
  |==================================================================================                            |  75%
  |                                                                                                                    
  |==============================================================================================================| 100%
#> Finished!

Important notes on the association file:

  • You have to be sure your association file is properly filled out. The function requires you to specify a date column (by default “SequencingDate”), you have to ensure this column doesn’t contain NA values or incorrect values.

The function accepts different inputs, namely:

  • A multi-quantification matrix: this is always the recommended approach
  • A named list of matrices where names are quantification types in quantification_types()
  • The single sequence count matrix: this is not the recommended approach since it requires a realignment step for other quantification matrices if you have them.

If the option ISAnalytics.reports is active, an interactive report in HTML format will be produced at the specified path.

4 Re-align other matrices

If you’ve given as input the standalone sequence count matrix to remove_collisions(), to realign other matrices you have to call the function realign_after_collisions(), passing as input the processed sequence count matrix and the named list of other matrices to realign. NOTE: the names in the list must be quantification types.

other_realigned <- realign_after_collisions(
  sc_matrix = no_coll_single,
  other_matrices = list(fragmentEstimate = separated$fragmentEstimate)
)

5 Reproducibility

R session information.

#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.1.2 (2021-11-01)
#>  os       Ubuntu 20.04.3 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  C
#>  ctype    en_US.UTF-8
#>  tz       America/New_York
#>  date     2022-01-16
#>  pandoc   2.5 @ /usr/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#>  package      * version date (UTC) lib source
#>  assertthat     0.2.1   2019-03-21 [2] CRAN (R 4.1.2)
#>  BiocManager    1.30.16 2021-06-15 [2] CRAN (R 4.1.2)
#>  BiocParallel   1.28.3  2022-01-16 [2] Bioconductor
#>  BiocStyle    * 2.22.0  2022-01-16 [2] Bioconductor
#>  bookdown       0.24    2021-09-02 [2] CRAN (R 4.1.2)
#>  bslib          0.3.1   2021-10-06 [2] CRAN (R 4.1.2)
#>  cli            3.1.0   2021-10-27 [2] CRAN (R 4.1.2)
#>  colorspace     2.0-2   2021-06-24 [2] CRAN (R 4.1.2)
#>  crayon         1.4.2   2021-10-29 [2] CRAN (R 4.1.2)
#>  data.table     1.14.2  2021-09-27 [2] CRAN (R 4.1.2)
#>  DBI            1.1.2   2021-12-20 [2] CRAN (R 4.1.2)
#>  digest         0.6.29  2021-12-01 [2] CRAN (R 4.1.2)
#>  dplyr          1.0.7   2021-06-18 [2] CRAN (R 4.1.2)
#>  ellipsis       0.3.2   2021-04-29 [2] CRAN (R 4.1.2)
#>  evaluate       0.14    2019-05-28 [2] CRAN (R 4.1.2)
#>  fansi          1.0.2   2022-01-14 [2] CRAN (R 4.1.2)
#>  fastmap        1.1.0   2021-01-25 [2] CRAN (R 4.1.2)
#>  fs             1.5.2   2021-12-08 [2] CRAN (R 4.1.2)
#>  generics       0.1.1   2021-10-25 [2] CRAN (R 4.1.2)
#>  ggplot2        3.3.5   2021-06-25 [2] CRAN (R 4.1.2)
#>  ggrepel        0.9.1   2021-01-15 [2] CRAN (R 4.1.2)
#>  glue           1.6.0   2021-12-17 [2] CRAN (R 4.1.2)
#>  gtable         0.3.0   2019-03-25 [2] CRAN (R 4.1.2)
#>  highr          0.9     2021-04-16 [2] CRAN (R 4.1.2)
#>  hms            1.1.1   2021-09-26 [2] CRAN (R 4.1.2)
#>  htmltools      0.5.2   2021-08-25 [2] CRAN (R 4.1.2)
#>  httr           1.4.2   2020-07-20 [2] CRAN (R 4.1.2)
#>  ISAnalytics  * 1.4.3   2022-01-16 [1] Bioconductor
#>  jquerylib      0.1.4   2021-04-26 [2] CRAN (R 4.1.2)
#>  jsonlite       1.7.2   2020-12-09 [2] CRAN (R 4.1.2)
#>  knitr          1.37    2021-12-16 [2] CRAN (R 4.1.2)
#>  lattice        0.20-45 2021-09-22 [2] CRAN (R 4.1.2)
#>  lifecycle      1.0.1   2021-09-24 [2] CRAN (R 4.1.2)
#>  lubridate      1.8.0   2021-10-07 [2] CRAN (R 4.1.2)
#>  magrittr     * 2.0.1   2020-11-17 [2] CRAN (R 4.1.2)
#>  mnormt         2.0.2   2020-09-01 [2] CRAN (R 4.1.2)
#>  munsell        0.5.0   2018-06-12 [2] CRAN (R 4.1.2)
#>  nlme           3.1-155 2022-01-13 [2] CRAN (R 4.1.2)
#>  pillar         1.6.4   2021-10-18 [2] CRAN (R 4.1.2)
#>  pkgconfig      2.0.3   2019-09-22 [2] CRAN (R 4.1.2)
#>  plyr           1.8.6   2020-03-03 [2] CRAN (R 4.1.2)
#>  psych          2.1.9   2021-09-22 [2] CRAN (R 4.1.2)
#>  purrr          0.3.4   2020-04-17 [2] CRAN (R 4.1.2)
#>  R6             2.5.1   2021-08-19 [2] CRAN (R 4.1.2)
#>  Rcapture       1.4-3   2019-12-16 [2] CRAN (R 4.1.2)
#>  Rcpp           1.0.8   2022-01-13 [2] CRAN (R 4.1.2)
#>  readr          2.1.1   2021-11-30 [2] CRAN (R 4.1.2)
#>  RefManageR   * 1.3.0   2020-11-13 [2] CRAN (R 4.1.2)
#>  rlang          0.4.12  2021-10-18 [2] CRAN (R 4.1.2)
#>  rmarkdown      2.11    2021-09-14 [2] CRAN (R 4.1.2)
#>  sass           0.4.0   2021-05-12 [2] CRAN (R 4.1.2)
#>  scales         1.1.1   2020-05-11 [2] CRAN (R 4.1.2)
#>  sessioninfo  * 1.2.2   2021-12-06 [2] CRAN (R 4.1.2)
#>  stringi        1.7.6   2021-11-29 [2] CRAN (R 4.1.2)
#>  stringr        1.4.0   2019-02-10 [2] CRAN (R 4.1.2)
#>  tibble         3.1.6   2021-11-07 [2] CRAN (R 4.1.2)
#>  tidyr          1.1.4   2021-09-27 [2] CRAN (R 4.1.2)
#>  tidyselect     1.1.1   2021-04-30 [2] CRAN (R 4.1.2)
#>  tmvnsim        1.0-2   2016-12-15 [2] CRAN (R 4.1.2)
#>  tzdb           0.2.0   2021-10-27 [2] CRAN (R 4.1.2)
#>  utf8           1.2.2   2021-07-24 [2] CRAN (R 4.1.2)
#>  vctrs          0.3.8   2021-04-29 [2] CRAN (R 4.1.2)
#>  xfun           0.29    2021-12-14 [2] CRAN (R 4.1.2)
#>  xml2           1.3.3   2021-11-30 [2] CRAN (R 4.1.2)
#>  yaml           2.2.1   2020-02-01 [2] CRAN (R 4.1.2)
#>  zip            2.2.0   2021-05-31 [2] CRAN (R 4.1.2)
#> 
#>  [1] /tmp/RtmplowxUB/Rinst3081361f546c34
#>  [2] /home/biocbuild/bbs-3.14-bioc/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

6 Bibliography

This vignette was generated using BiocStyle (Oleś, 2022) with knitr (Xie, 2021) and rmarkdown (Allaire, Xie, McPherson, et al., 2021) running behind the scenes.

Citations made with RefManageR (McLean, 2017).

[1] J. Allaire, Y. Xie, J. McPherson, et al. rmarkdown: Dynamic Documents for R. R package version 2.11. 2021. URL: https://github.com/rstudio/rmarkdown.

[2] M. W. McLean. “RefManageR: Import and Manage BibTeX and BibLaTeX References in R”. In: The Journal of Open Source Software (2017). DOI: 10.21105/joss.00338.

[3] A. Oleś. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 2.22.0. 2022. URL: https://github.com/Bioconductor/BiocStyle.

[4] Y. Xie. knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.37. 2021. URL: https://yihui.org/knitr/.