3. Examples

3.1 Import a `phyloseq` object

The HITChip Atlas dataset contains genus-level microbiota profiling with HITChip for 1006 western adults with no reported health complications, reported in (Lahti et al. 2014). The dataset is available via the microbiome R package (Lahti et al. 2017) in phyloseq (McMurdie and Holmes 2013) format.

data(atlas1006, package = "microbiome")

atlas1006

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 130 taxa and 1151 samples ]
sample_data() Sample Data:       [ 1151 samples by 10 sample variables ]
tax_table()   Taxonomy Table:    [ 130 taxa by 3 taxonomic ranks ]

List the taxonomic levels available for data aggregation.

phyloseq::rank_names(atlas1006)

[1] "Phylum" "Family" "Genus"

List the variables available in the sample metadata.

colnames(microbiome::meta(atlas1006))

 [1] "age"                   "sex"                   "nationality"          
 [4] "DNA_extraction_method" "project"               "diversity"            
 [7] "bmi_group"             "subject"               "time"                 
[10] "sample"

Data sanity and integrity check.

# With `group` variable
check_results = data_sanity_check(data = atlas1006,
                                  tax_level = "Family",
                                  fix_formula = "age + sex + bmi_group",
                                  group = "bmi_group",
                                  struc_zero = TRUE,
                                  global = TRUE,
                                  verbose = TRUE)

Checking the input data type ...

The input data is of type: phyloseq

PASS

Checking the sample metadata ...

The specified variables in the formula: age, sex, bmi_group

The available variables in the sample metadata: age, sex, nationality, DNA_extraction_method, project, diversity, bmi_group, subject, time, sample

PASS

Checking other arguments ...

The number of groups of interest is: 6

The sample size per group is: underweight = 21, lean = 484, overweight = 197, obese = 222, severeobese = 99, morbidobese = 22

PASS

# Without `group` variable
check_results = data_sanity_check(data = atlas1006,
                                  tax_level = "Family",
                                  fix_formula = "age + sex + bmi_group",
                                  group = NULL,
                                  struc_zero = FALSE,
                                  global = FALSE,
                                  verbose = TRUE)

Checking the input data type ...

The input data is of type: phyloseq

PASS

Checking the sample metadata ...

The specified variables in the formula: age, sex, bmi_group

The available variables in the sample metadata: age, sex, nationality, DNA_extraction_method, project, diversity, bmi_group, subject, time, sample

PASS

Checking other arguments ...

PASS

3.2 Import a `tse` object

tse = mia::convertFromPhyloseq(atlas1006)

List the taxonomic levels available for data aggregation.

mia::taxonomyRanks(tse)

[1] "Phylum" "Family" "Genus"

List the variables available in the sample metadata.

colnames(SummarizedExperiment::colData(tse))

 [1] "age"                   "sex"                   "nationality"          
 [4] "DNA_extraction_method" "project"               "diversity"            
 [7] "bmi_group"             "subject"               "time"                 
[10] "sample"

Data sanity and integrity check.

check_results = data_sanity_check(data = tse,
                                  assay_name = "counts",
                                  tax_level = "Family",
                                  fix_formula = "age + sex + bmi_group",
                                  group = "bmi_group",
                                  struc_zero = TRUE,
                                  global = TRUE,
                                  verbose = TRUE)

Checking the input data type ...

The input data is of type: TreeSummarizedExperiment

PASS

Checking the sample metadata ...

The specified variables in the formula: age, sex, bmi_group

The available variables in the sample metadata: age, sex, nationality, DNA_extraction_method, project, diversity, bmi_group, subject, time, sample

PASS

Checking other arguments ...

The number of groups of interest is: 6

The sample size per group is: underweight = 21, lean = 484, overweight = 197, obese = 222, severeobese = 99, morbidobese = 22

PASS

3.3 Import a `matrix` or `data.frame`

Both abundance data and sample metadata are required for this import method.

Note that aggregating taxa to higher taxonomic levels is not supported in this method. Ensure that the data is already aggregated to the desired taxonomic level before proceeding. If aggregation is needed, consider creating a phyloseq or tse object for importing.

abundance_data = microbiome::abundances(atlas1006)
meta_data = microbiome::meta(atlas1006)

Ensure that the rownames of the metadata correspond to the colnames of the abundance data.

all(rownames(meta_data) %in% colnames(abundance_data))

[1] TRUE

List the variables available in the sample metadata.

colnames(meta_data)

 [1] "age"                   "sex"                   "nationality"          
 [4] "DNA_extraction_method" "project"               "diversity"            
 [7] "bmi_group"             "subject"               "time"                 
[10] "sample"

Data sanity and integrity check.

check_results = data_sanity_check(data = abundance_data,
                                  assay_name = "counts",
                                  tax_level = "Family",
                                  meta_data = meta_data,
                                  fix_formula = "age + sex + bmi_group",
                                  group = "bmi_group",
                                  struc_zero = TRUE,
                                  global = TRUE,
                                  verbose = TRUE)

Checking the input data type ...

The input data is of type: matrix

The imported data is in a generic 'matrix'/'data.frame' format.

PASS

Checking the sample metadata ...

The specified variables in the formula: age, sex, bmi_group

The available variables in the sample metadata: age, sex, nationality, DNA_extraction_method, project, diversity, bmi_group, subject, time, sample

PASS

Checking other arguments ...

The number of groups of interest is: 6

The sample size per group is: underweight = 21, lean = 484, overweight = 197, obese = 222, severeobese = 99, morbidobese = 22

PASS

Session information

sessionInfo()

R version 4.5.0 RC (2025-04-04 r88126)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.2 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] doRNG_1.8.6.2   rngtools_1.5.2  foreach_1.5.2   DT_0.33        
 [5] lubridate_1.9.4 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
 [9] purrr_1.0.4     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1   
[13] ggplot2_3.5.2   tidyverse_2.0.0 ANCOMBC_2.10.1 

loaded via a namespace (and not attached):
  [1] ggtext_0.1.2                    fs_1.6.6                       
  [3] matrixStats_1.5.0               DirichletMultinomial_1.50.0    
  [5] httr_1.4.7                      RColorBrewer_1.1-3             
  [7] doParallel_1.0.17               numDeriv_2016.8-1.1            
  [9] tools_4.5.0                     backports_1.5.0                
 [11] R6_2.6.1                        vegan_2.6-10                   
 [13] lazyeval_0.2.2                  mgcv_1.9-3                     
 [15] rhdf5filters_1.20.0             permute_0.9-7                  
 [17] withr_3.0.2                     gridExtra_2.3                  
 [19] textshaping_1.0.1               cli_3.6.5                      
 [21] Biobase_2.68.0                  sandwich_3.1-1                 
 [23] labeling_0.4.3                  slam_0.1-55                    
 [25] sass_0.4.10                     mvtnorm_1.3-3                  
 [27] proxy_0.4-27                    systemfonts_1.2.3              
 [29] yulab.utils_0.2.0               foreign_0.8-90                 
 [31] dichromat_2.0-0.1               scater_1.36.0                  
 [33] decontam_1.28.0                 parallelly_1.44.0              
 [35] readxl_1.4.5                    fillpattern_1.0.2              
 [37] rstudioapi_0.17.1               generics_0.1.4                 
 [39] gtools_3.9.5                    crosstalk_1.2.1                
 [41] rbiom_2.2.0                     Matrix_1.7-3                   
 [43] biomformat_1.36.0               ggbeeswarm_0.7.2               
 [45] DescTools_0.99.60               S4Vectors_0.46.0               
 [47] DECIPHER_3.4.0                  abind_1.4-8                    
 [49] lifecycle_1.0.4                 multcomp_1.4-28                
 [51] yaml_2.3.10                     SummarizedExperiment_1.38.1    
 [53] rhdf5_2.52.0                    SparseArray_1.8.0              
 [55] Rtsne_0.17                      grid_4.5.0                     
 [57] crayon_1.5.3                    lattice_0.22-7                 
 [59] haven_2.5.4                     beachmat_2.24.0                
 [61] pillar_1.10.2                   knitr_1.50                     
 [63] GenomicRanges_1.60.0            boot_1.3-31                    
 [65] gld_2.6.7                       estimability_1.5.1             
 [67] codetools_0.2-20                glue_1.8.0                     
 [69] data.table_1.17.2               MultiAssayExperiment_1.34.0    
 [71] vctrs_0.6.5                     treeio_1.32.0                  
 [73] Rdpack_2.6.4                    cellranger_1.1.0               
 [75] gtable_0.3.6                    cachem_1.1.0                   
 [77] xfun_0.52                       rbibutils_2.3                  
 [79] S4Arrays_1.8.0                  coda_0.19-4.1                  
 [81] reformulas_0.4.1                survival_3.8-3                 
 [83] SingleCellExperiment_1.30.1     iterators_1.0.14               
 [85] bluster_1.18.0                  gmp_0.7-5                      
 [87] TH.data_1.1-3                   nlme_3.1-168                   
 [89] phyloseq_1.52.0                 bit64_4.6.0-1                  
 [91] GenomeInfoDb_1.44.0             bslib_0.9.0                    
 [93] irlba_2.3.5.1                   vipor_0.4.7                    
 [95] rpart_4.1.24                    colorspace_2.1-1               
 [97] BiocGenerics_0.54.0             DBI_1.2.3                      
 [99] Hmisc_5.2-3                     nnet_7.3-20                    
[101] ade4_1.7-23                     Exact_3.3                      
[103] tidyselect_1.2.1                emmeans_1.11.1                 
[105] bit_4.6.0                       compiler_4.5.0                 
[107] microbiome_1.30.0               htmlTable_2.4.3                
[109] BiocNeighbors_2.2.0             expm_1.0-0                     
[111] xml2_1.3.8                      DelayedArray_0.34.1            
[113] checkmate_2.3.2                 scales_1.4.0                   
[115] digest_0.6.37                   minqa_1.2.8                    
[117] rmarkdown_2.29                  XVector_0.48.0                 
[119] htmltools_0.5.8.1               pkgconfig_2.0.3                
[121] base64enc_0.1-3                 lme4_1.1-37                    
[123] sparseMatrixStats_1.20.0        MatrixGenerics_1.20.0          
[125] fastmap_1.2.0                   rlang_1.1.6                    
[127] htmlwidgets_1.6.4               UCSC.utils_1.4.0               
[129] DelayedMatrixStats_1.30.0       farver_2.1.2                   
[131] jquerylib_0.1.4                 zoo_1.8-14                     
[133] jsonlite_2.0.0                  energy_1.7-12                  
[135] BiocParallel_1.42.0             BiocSingular_1.24.0            
[137] magrittr_2.0.3                  Formula_1.2-5                  
[139] scuttle_1.18.0                  GenomeInfoDbData_1.2.14        
[141] patchwork_1.3.0                 Rhdf5lib_1.30.0                
[143] Rcpp_1.0.14                     ape_5.8-1                      
[145] ggnewscale_0.5.1                viridis_0.6.5                  
[147] CVXR_1.0-15                     stringi_1.8.7                  
[149] rootSolve_1.8.2.4               MASS_7.3-65                    
[151] plyr_1.8.9                      parallel_4.5.0                 
[153] ggrepel_0.9.6                   lmom_3.2                       
[155] Biostrings_2.76.0               splines_4.5.0                  
[157] gridtext_0.1.5                  multtest_2.64.0                
[159] hms_1.1.3                       igraph_2.1.4                   
[161] reshape2_1.4.4                  stats4_4.5.0                   
[163] ScaledMatrix_1.16.0             evaluate_1.0.3                 
[165] nloptr_2.2.1                    tzdb_0.5.0                     
[167] BiocBaseUtils_1.10.0            rsvd_1.0.5                     
[169] xtable_1.8-4                    Rmpfr_1.1-0                    
[171] e1071_1.7-16                    tidytree_0.4.6                 
[173] ragg_1.4.0                      viridisLite_0.4.2              
[175] class_7.3-23                    gsl_2.1-8                      
[177] lmerTest_3.1-3                  beeswarm_0.4.0                 
[179] IRanges_2.42.0                  cluster_2.1.8.1                
[181] TreeSummarizedExperiment_2.16.1 timechange_0.3.0               
[183] mia_1.16.0

Tutorial on Data Sanity and Integrity Checks

Menglu Liang\(^1\), Huang Lin\(^1\)

\(^1\)University of Maryland, College Park, MD 20742

May 25, 2025

1. Introduction

2. Installation

3. Examples

3.1 Import a `phyloseq` object

3.2 Import a `tse` object

3.3 Import a `matrix` or `data.frame`

Session information

References

Tutorial on Data Sanity and Integrity Checks

Menglu Liang\(^1\), Huang Lin\(^1\)

\(^1\)University of Maryland, College Park, MD 20742

May 25, 2025

1. Introduction

2. Installation

3. Examples

3.1 Import a phyloseq object

3.2 Import a tse object

3.3 Import a matrix or data.frame

Session information

References

3.1 Import a `phyloseq` object

3.2 Import a `tse` object

3.3 Import a `matrix` or `data.frame`