tuberculosis

Lucas Schiffer, MPH1*

1Section of Computational Biomedicine, Boston University School of Medicine, Boston, MA, U.S.A.

30 October 2021

Abstract

The tuberculosis R/Bioconductor package features tuberculosis gene expression data for machine learning. All human samples from GEO that did not come from cell lines, were not taken postmortem, and did not feature recombination have been included. The package has more than 10,000 samples from both microarray and sequencing studies that have been processed from raw data through a hyper-standardized, reproducible pipeline.

Package

tuberculosis 1.0.0

1 The Pipeline
2 Installation
3 Load Package
4 Finding Data
5 Getting Data
6 No Metadata?
7 ML Analysis?
8 Session Info

1 The Pipeline

To fully understand the provenance of data in the tuberculosis R/Bioconductor package, please see the tuberculosis.pipeline GitHub repository; however, all users beyond the extremely curious can ignore these details without consequence. Yet, a brief summary of data processing is appropriate here. Microarray data were processed from raw files (e.g. CEL files) and background corrected using the normal-exponential method and the saddle-point approximation to maximum likelihood as implemented in the limma R/Bioconductor package; no normalization of expression values was done; where platforms necessitated it, the RMA (robust multichip average) algorithm without background correction or normalization was used to generate an expression matrix. Sequencing data were processed from raw files (i.e. fastq files) using the nf-core/rnaseq pipeline inside a Singularity container; the GRCh38 genome build was used for alignment. Gene names for both microarray and sequencing data are HGNC-approved GRCh38 gene names from the genenames.org REST API.

2 Installation

To install tuberculosis from Bioconductor, use BiocManager as follows.

BiocManager::install("tuberculosis")

To install tuberculosis from GitHub, use BiocManager as follows.

BiocManager::install("schifferl/tuberculosis", dependencies = TRUE, build_vignettes = TRUE)

Most users should simply install tuberculosis from Bioconductor.

3 Load Package

To use the package without double colon syntax, it should be loaded as follows.

library(tuberculosis)

The package is lightweight, with few dependencies, and contains no data itself.

4 Finding Data

To find data, users will use the tuberculosis function with a regular expression pattern to list available resources. The resources are organized by GEO series accession numbers. If multiple platforms were used in a single study, the platform accession number follows the series accession number and is separated by a dash. The date before the series accession number denotes the date the resource was created.

tuberculosis("GSE103147")

## 2021-09-15.GSE103147

The function will print the names of matching resources as a message and return them invisibly as a character vector. To see all available resources use "." for the pattern argument.

5 Getting Data

To get data, users will also use the tuberculosis function, but with an additional argument, dryrun = FALSE. This will either download resources from ExperimentHub or load them from the user’s local cache. If a resource has multiple creation dates, the most recent is selected by default; add a date to override this behavior.

tuberculosis("GSE103147", dryrun = FALSE)

## snapshotDate(): 2021-10-18

## $`2021-09-15.GSE103147`
## class: SummarizedExperiment 
## dim: 24353 1649 
## metadata(0):
## assays(1): exprs
## rownames(24353): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(1649): SRR5980424 SRR5980425 ... SRR5982072 SRR5982073
## colData names(0):

The function returns a list of SummarizedExperiment objects, each with a single assay, exprs, where the rows are features (genes) and the columns are observations (samples). If multiple resources are requested, multiple resources will be returned, each as a list element.

tuberculosis("GSE10799.", dryrun = FALSE)

## snapshotDate(): 2021-10-18

## $`2021-09-15.GSE107991`
## class: SummarizedExperiment 
## dim: 24353 54 
## metadata(0):
## assays(1): exprs
## rownames(24353): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(54): SRR6369879 SRR6369880 ... SRR6369931 SRR6369932
## colData names(0):
## 
## $`2021-09-15.GSE107992`
## class: SummarizedExperiment 
## dim: 24353 47 
## metadata(0):
## assays(1): exprs
## rownames(24353): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(47): SRR6369945 SRR6369946 ... SRR6369990 SRR6369991
## colData names(0):
## 
## $`2021-09-15.GSE107993`
## class: SummarizedExperiment 
## dim: 24353 138 
## metadata(0):
## assays(1): exprs
## rownames(24353): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(138): SRR6370167 SRR6370168 ... SRR6370303 SRR6370304
## colData names(0):
## 
## $`2021-09-15.GSE107994`
## class: SummarizedExperiment 
## dim: 24353 175 
## metadata(0):
## assays(1): exprs
## rownames(24353): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(175): SRR6369992 SRR6369993 ... SRR6370165 SRR6370166
## colData names(0):
## 
## $`2021-09-15.GSE107995`
## class: SummarizedExperiment 
## dim: 24353 414 
## metadata(0):
## assays(1): exprs
## rownames(24353): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(414): SRR6369879 SRR6369880 ... SRR6370303 SRR6370304
## colData names(0):

The assay of each SummarizedExperiment object is named exprs rather than counts because it can come from either a microarray or a sequencing platform. If colnames begin with GSE, data comes from a microarray platform; if colnames begin with SRR, data comes from a sequencing platform.

6 No Metadata?

The SummarizedExperiment objects do not have sample metadata as colData, and this limits their use to unsupervised analyses for the time being. Sample metadata are currently undergoing manual curation, with the same level of diligence that was applied in data processing, and will be included in the package when they are ready.

7 ML Analysis?

No Bioconductor package is complete without at least a miniature demonstration analysis, but it is difficult to provide any substantial machine learning analysis without the necessary labels. Therefore, a only a dimension reduction, that is by no means machine learning, is provided here for example with the expectation that it will be replaced in the future.

The largest resource available in the tuberculosis package comes from GEO series accession GSE103147, data that was originally published by Zak et al. in 2016.111 Zak, D. E. et al. A blood RNA signature for tuberculosis disease risk: a prospective cohort study. Lancet 387, 2312–2322 (2016) To download this data for use in dimension reduction, the tuberculosis function is used; then, magrittr::use_series is used to select the SummarizedExperiment object from the list that was returned.

zak_data <-
    tuberculosis("GSE103147", dryrun = FALSE) |>
    magrittr::use_series("2021-09-15.GSE103147")

## snapshotDate(): 2021-10-18

Even though they are not used, the sample identifiers (i.e. column names) of the zak_data will become the row names of the UMAP data.frame, and they are serialized below for use in setting row names later.

row_names <-
    base::colnames(zak_data)

Serialization is also done for column names, only they are created using purrr::map_chr instead. The embedding will be in two dimensions, therefore axis labels, UMAP1 and UMAP2, are created.

col_names <-
    purrr::map_chr(1:2, ~ base::paste("UMAP", .x, sep = ""))

The scater package is used to calculate UMAP coordinates, which are piped to magrittr to set the row and column names. Once the matrix returned by scater::calculateUMAP is coerced to a data.frame, ggplot2 is used to plot the embedding and hrbrthemes is used for theming.

scater::calculateUMAP(zak_data, exprs_values = "exprs") |>
    magrittr::set_rownames(row_names) |>
    magrittr::set_colnames(col_names) |>
    base::as.data.frame() |>
    ggplot2::ggplot(mapping = ggplot2::aes(UMAP1, UMAP2)) +
    ggplot2::geom_point() +
    hrbrthemes::theme_ipsum_rc()

The embedding displays four distinct clusters, perhaps pertaining to stages of progression of tuberculosis infection as distinct classes; although, definitive conclusions are difficult to make without sufficient labeling of clinical sequelae. Again, as stated above, such labels are currently being curated, and will be included in the package when they are ready.

8 Session Info

utils::sessionInfo()

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] tuberculosis_1.0.0          SummarizedExperiment_1.24.0
##  [3] Biobase_2.54.0              GenomicRanges_1.46.0       
##  [5] GenomeInfoDb_1.30.0         IRanges_2.28.0             
##  [7] S4Vectors_0.32.0            BiocGenerics_0.40.0        
##  [9] MatrixGenerics_1.6.0        matrixStats_0.61.0         
## [11] BiocStyle_2.22.0           
## 
## loaded via a namespace (and not attached):
##   [1] ggbeeswarm_0.6.0              colorspace_2.0-2             
##   [3] ellipsis_0.3.2                scuttle_1.4.0                
##   [5] XVector_0.34.0                BiocNeighbors_1.12.0         
##   [7] farver_2.1.0                  ggrepel_0.9.1                
##   [9] bit64_4.0.5                   interactiveDisplayBase_1.32.0
##  [11] AnnotationDbi_1.56.1          RSpectra_0.16-0              
##  [13] fansi_0.5.0                   sparseMatrixStats_1.6.0      
##  [15] extrafont_0.17                cachem_1.0.6                 
##  [17] knitr_1.36                    scater_1.22.0                
##  [19] jsonlite_1.7.2                Rttf2pt1_1.3.9               
##  [21] dbplyr_2.1.1                  png_0.1-7                    
##  [23] uwot_0.1.10                   shiny_1.7.1                  
##  [25] BiocManager_1.30.16           compiler_4.1.1               
##  [27] httr_1.4.2                    assertthat_0.2.1             
##  [29] Matrix_1.3-4                  fastmap_1.1.0                
##  [31] later_1.3.0                   BiocSingular_1.10.0          
##  [33] hrbrthemes_0.8.0              htmltools_0.5.2              
##  [35] tools_4.1.1                   rsvd_1.0.5                   
##  [37] gtable_0.3.0                  glue_1.4.2                   
##  [39] GenomeInfoDbData_1.2.7        dplyr_1.0.7                  
##  [41] rappdirs_0.3.3                Rcpp_1.0.7                   
##  [43] jquerylib_0.1.4               vctrs_0.3.8                  
##  [45] Biostrings_2.62.0             ExperimentHub_2.2.0          
##  [47] extrafontdb_1.0               DelayedMatrixStats_1.16.0    
##  [49] xfun_0.27                     stringr_1.4.0                
##  [51] beachmat_2.10.0               mime_0.12                    
##  [53] lifecycle_1.0.1               irlba_2.3.3                  
##  [55] AnnotationHub_3.2.0           zlibbioc_1.40.0              
##  [57] scales_1.1.1                  promises_1.2.0.1             
##  [59] parallel_4.1.1                SingleCellExperiment_1.16.0  
##  [61] yaml_2.2.1                    curl_4.3.2                   
##  [63] memoise_2.0.0                 gridExtra_2.3                
##  [65] ggplot2_3.3.5                 gdtools_0.2.3                
##  [67] sass_0.4.0                    stringi_1.7.5                
##  [69] RSQLite_2.2.8                 highr_0.9                    
##  [71] BiocVersion_3.14.0            ScaledMatrix_1.2.0           
##  [73] filelock_1.0.2                BiocParallel_1.28.0          
##  [75] systemfonts_1.0.3             rlang_0.4.12                 
##  [77] pkgconfig_2.0.3               bitops_1.0-7                 
##  [79] evaluate_0.14                 lattice_0.20-45              
##  [81] purrr_0.3.4                   labeling_0.4.2               
##  [83] bit_4.0.4                     tidyselect_1.1.1             
##  [85] magrittr_2.0.1                bookdown_0.24                
##  [87] R6_2.5.1                      magick_2.7.3                 
##  [89] generics_0.1.1                DelayedArray_0.20.0          
##  [91] DBI_1.1.1                     pillar_1.6.4                 
##  [93] withr_2.4.2                   KEGGREST_1.34.0              
##  [95] RCurl_1.98-1.5                tibble_3.1.5                 
##  [97] crayon_1.4.1                  utf8_1.2.2                   
##  [99] BiocFileCache_2.2.0           rmarkdown_2.11               
## [101] viridis_0.6.2                 grid_4.1.1                   
## [103] blob_1.2.2                    FNN_1.1.3                    
## [105] digest_0.6.28                 xtable_1.8-4                 
## [107] tidyr_1.1.4                   httpuv_1.6.3                 
## [109] munsell_0.5.0                 beeswarm_0.4.0               
## [111] viridisLite_0.4.0             vipor_0.4.5                  
## [113] bslib_0.3.1