Contents

1 Motivation & Introduction

The purpose of this vignette is to explore the file manifests available from the Human Cell Atlas project.

These files provide a metadata summary for a collection of files in a tabular format, including but not limited to information about process and workflow used to generate the file, information about the specimens the file data were derived from, and identifiers connect specific projects, files, and specimens.

The WARP (WDL Analysis Research Pipelines) repository contains information on a variety of pipelines, and can be used alongside a manifest to better understand the metadata.

1.1 Installation and getting started

Evaluate the following code chunk to install packages required for this vignette.

## install from Bioconductor if you haven't already
pkgs <- c("LoomExperiment", "hca")
pkgs_needed <- pkgs[!pkgs %in% rownames(installed.packages())]
BiocManager::install(pkgs_needed)

Load the packages into your R session.

library(dplyr)
library(SummarizedExperiment)
library(LoomExperiment)
library(hca)

2 Example: manifests

The manifest for all files available can be obtained with (this can takes several minutes to complete)

default_manifest_tbl <- hca::manifest()
default_manifest_tbl

This is seldom useful; instead, create a filter identifying the files of interest.

manifest_filter <- hca::filters(
    projectId = list(is = "4a95101c-9ffc-4f30-a809-f04518a23803"),
    fileFormat = list(is = "loom"),
    workflow = list(is = c("optimus_v4.2.2", "optimus_v4.2.3"))
)

Retrieve the manifest

manifest_tibble <- hca::manifest(filters = manifest_filter)
manifest_tibble
## # A tibble: 20 × 56
##    source_id        source_spec bundle_uuid bundle_version      file_document_id
##    <chr>            <chr>       <chr>       <dttm>              <chr>           
##  1 b2c7b0d5-d26a-4… tdr:datare… b593b66a-d… 2020-02-03 01:00:00 131ea511-25f7-5…
##  2 b2c7b0d5-d26a-4… tdr:datare… 5a63dd0b-5… 2021-02-02 23:50:00 1bb375a5-d22b-5…
##  3 b2c7b0d5-d26a-4… tdr:datare… 40733888-3… 2021-02-02 23:55:00 1f8ff0fa-6892-5…
##  4 b2c7b0d5-d26a-4… tdr:datare… 1a41ebe6-e… 2021-02-02 23:50:00 2fffe225-ba6c-5…
##  5 b2c7b0d5-d26a-4… tdr:datare… c12a6ca2-3… 2021-02-02 23:50:00 31aa5a18-2a4e-5…
##  6 b2c7b0d5-d26a-4… tdr:datare… f58d690c-b… 2021-02-02 23:50:00 48eea299-8823-5…
##  7 b2c7b0d5-d26a-4… tdr:datare… 21c4e2de-e… 2021-02-02 23:50:00 51458973-404c-5…
##  8 b2c7b0d5-d26a-4… tdr:datare… 50620c50-2… 2021-02-02 23:50:00 5bbebef4-9b14-5…
##  9 b2c7b0d5-d26a-4… tdr:datare… e3ecdfc2-4… 2021-02-02 23:55:00 5bc232f2-b77c-5…
## 10 b2c7b0d5-d26a-4… tdr:datare… ae338c4e-6… 2021-02-02 23:50:00 6326b602-0f63-5…
## 11 b2c7b0d5-d26a-4… tdr:datare… d62c4599-4… 2020-02-03 01:00:00 7848d80b-6b1d-5…
## 12 b2c7b0d5-d26a-4… tdr:datare… 81df106e-e… 2021-02-02 23:55:00 9f8bc032-6276-5…
## 13 b2c7b0d5-d26a-4… tdr:datare… 2838323c-c… 2020-02-03 01:00:00 b98cfaac-64f5-5…
## 14 b2c7b0d5-d26a-4… tdr:datare… c3f672ad-e… 2021-02-02 23:50:00 bf7751ae-ac9d-5…
## 15 b2c7b0d5-d26a-4… tdr:datare… a9c90392-c… 2020-02-03 01:00:00 c7b6470c-e2f0-5…
## 16 b2c7b0d5-d26a-4… tdr:datare… 9d0f5cd1-0… 2020-02-03 01:00:00 d0b95f2c-98ae-5…
## 17 b2c7b0d5-d26a-4… tdr:datare… 59de15e1-f… 2021-02-02 23:50:00 d18759a6-2a95-5…
## 18 b2c7b0d5-d26a-4… tdr:datare… 54fb0e25-5… 2021-02-02 23:55:00 dfd9905b-d6c9-5…
## 19 b2c7b0d5-d26a-4… tdr:datare… 7516565a-e… 2021-02-02 23:55:00 e07ca731-b20a-5…
## 20 b2c7b0d5-d26a-4… tdr:datare… 8e850d2d-0… 2021-02-02 23:50:00 fd41f3d6-7664-5…
## # ℹ 51 more variables: file_type <chr>, file_name <chr>, file_format <chr>,
## #   read_index <lgl>, file_size <dbl>, file_uuid <chr>, file_version <dttm>,
## #   file_crc32c <chr>, file_sha256 <chr>, file_content_type <chr>,
## #   file_drs_uri <chr>, file_url <chr>,
## #   cell_suspension.provenance.document_id <chr>,
## #   cell_suspension.biomaterial_core.biomaterial_id <chr>,
## #   cell_suspension.estimated_cell_count <lgl>, …

And perform additional filtering, e.g., identifying the specimen organs represented in the files.

manifest_tibble |>
    dplyr::count(specimen_from_organism.organ)
## # A tibble: 4 × 2
##   specimen_from_organism.organ     n
##   <chr>                        <int>
## 1 blood                            5
## 2 hematopoietic system             5
## 3 lung                             5
## 4 mediastinal lymph node           5

3 Example: Using manifest data to select files

manifest_tibble
## # A tibble: 20 × 56
##    source_id        source_spec bundle_uuid bundle_version      file_document_id
##    <chr>            <chr>       <chr>       <dttm>              <chr>           
##  1 b2c7b0d5-d26a-4… tdr:datare… b593b66a-d… 2020-02-03 01:00:00 131ea511-25f7-5…
##  2 b2c7b0d5-d26a-4… tdr:datare… 5a63dd0b-5… 2021-02-02 23:50:00 1bb375a5-d22b-5…
##  3 b2c7b0d5-d26a-4… tdr:datare… 40733888-3… 2021-02-02 23:55:00 1f8ff0fa-6892-5…
##  4 b2c7b0d5-d26a-4… tdr:datare… 1a41ebe6-e… 2021-02-02 23:50:00 2fffe225-ba6c-5…
##  5 b2c7b0d5-d26a-4… tdr:datare… c12a6ca2-3… 2021-02-02 23:50:00 31aa5a18-2a4e-5…
##  6 b2c7b0d5-d26a-4… tdr:datare… f58d690c-b… 2021-02-02 23:50:00 48eea299-8823-5…
##  7 b2c7b0d5-d26a-4… tdr:datare… 21c4e2de-e… 2021-02-02 23:50:00 51458973-404c-5…
##  8 b2c7b0d5-d26a-4… tdr:datare… 50620c50-2… 2021-02-02 23:50:00 5bbebef4-9b14-5…
##  9 b2c7b0d5-d26a-4… tdr:datare… e3ecdfc2-4… 2021-02-02 23:55:00 5bc232f2-b77c-5…
## 10 b2c7b0d5-d26a-4… tdr:datare… ae338c4e-6… 2021-02-02 23:50:00 6326b602-0f63-5…
## 11 b2c7b0d5-d26a-4… tdr:datare… d62c4599-4… 2020-02-03 01:00:00 7848d80b-6b1d-5…
## 12 b2c7b0d5-d26a-4… tdr:datare… 81df106e-e… 2021-02-02 23:55:00 9f8bc032-6276-5…
## 13 b2c7b0d5-d26a-4… tdr:datare… 2838323c-c… 2020-02-03 01:00:00 b98cfaac-64f5-5…
## 14 b2c7b0d5-d26a-4… tdr:datare… c3f672ad-e… 2021-02-02 23:50:00 bf7751ae-ac9d-5…
## 15 b2c7b0d5-d26a-4… tdr:datare… a9c90392-c… 2020-02-03 01:00:00 c7b6470c-e2f0-5…
## 16 b2c7b0d5-d26a-4… tdr:datare… 9d0f5cd1-0… 2020-02-03 01:00:00 d0b95f2c-98ae-5…
## 17 b2c7b0d5-d26a-4… tdr:datare… 59de15e1-f… 2021-02-02 23:50:00 d18759a6-2a95-5…
## 18 b2c7b0d5-d26a-4… tdr:datare… 54fb0e25-5… 2021-02-02 23:55:00 dfd9905b-d6c9-5…
## 19 b2c7b0d5-d26a-4… tdr:datare… 7516565a-e… 2021-02-02 23:55:00 e07ca731-b20a-5…
## 20 b2c7b0d5-d26a-4… tdr:datare… 8e850d2d-0… 2021-02-02 23:50:00 fd41f3d6-7664-5…
## # ℹ 51 more variables: file_type <chr>, file_name <chr>, file_format <chr>,
## #   read_index <lgl>, file_size <dbl>, file_uuid <chr>, file_version <dttm>,
## #   file_crc32c <chr>, file_sha256 <chr>, file_content_type <chr>,
## #   file_drs_uri <chr>, file_url <chr>,
## #   cell_suspension.provenance.document_id <chr>,
## #   cell_suspension.biomaterial_core.biomaterial_id <chr>,
## #   cell_suspension.estimated_cell_count <lgl>, …
file_uuid <- "24a8a323-7ecd-504e-a253-b0e0892dd730"
file_filter <- hca::filters(
    fileId = list(is = file_uuid)
)

file_tbl <- hca::files(filters = file_filter)

file_tbl
## # A tibble: 1 × 8
##   fileId            name  fileFormat   size version projectTitle projectId url  
##   <chr>             <chr> <chr>       <int> <chr>   <chr>        <chr>     <chr>
## 1 24a8a323-7ecd-50… t-ce… loom       3.90e8 2021-0… A single-ce… 4a95101c… http…
file_location <-
    file_tbl |>
    hca::files_download()
file_location
##  24a8a323-7ecd-504e-a253-b0e0892dd730-2021-02-11T19:00:05.000000Z 
## "/home/biocbuild/.cache/R/hca/125444707d163d_125444707d163d.loom"
loom <- LoomExperiment::import(file_location)
metadata(loom) |>
    dplyr::glimpse()
## List of 15
##  $ last_modified                                             : chr "20210211T185949.186062Z"
##  $ CreationDate                                              : chr "20210211T185658.758915Z"
##  $ LOOM_SPEC_VERSION                                         : chr "3.0.0"
##  $ donor_organism.genus_species                              : chr "Homo sapiens"
##  $ expression_data_type                                      : chr "exonic"
##  $ input_id                                                  : chr "58a18a4c-5423-4c59-9b3c-50b7f30b1ca5, c763f679-e13d-4f81-844f-c2c80fc90f46, c76d90b8-c190-4c58-b9bc-b31f586ec7f"| __truncated__
##  $ input_id_metadata_field                                   : chr "sequencing_process.provenance.document_id"
##  $ input_name                                                : chr "PP012_suspension, PP003_suspension, PP004_suspension, PP011_suspension"
##  $ input_name_metadata_field                                 : chr "sequencing_input.biomaterial_core.biomaterial_id"
##  $ library_preparation_protocol.library_construction_approach: chr "10X v2 sequencing"
##  $ optimus_output_schema_version                             : chr "1.0.0"
##  $ pipeline_version                                          : chr "Optimus_v4.2.2"
##  $ project.project_core.project_name                         : chr "HumanTissueTcellActivation"
##  $ project.provenance.document_id                            : chr "4a95101c-9ffc-4f30-a809-f04518a23803"
##  $ specimen_from_organism.organ                              : chr "hematopoietic system"
colData(loom) |>
    dplyr::as_tibble() |>
    dplyr::glimpse()
## Rows: 91,713
## Columns: 43
## $ CellID                                                 <chr> "GCTTCCATCACCGT…
## $ antisense_reads                                        <int> 0, 0, 0, 0, 0, …
## $ cell_barcode_fraction_bases_above_30_mean              <dbl> 0.9846281, 0.98…
## $ cell_barcode_fraction_bases_above_30_variance          <dbl> 0.003249023, 0.…
## $ cell_names                                             <chr> "GCTTCCATCACCGT…
## $ duplicate_reads                                        <int> 0, 0, 0, 0, 0, …
## $ emptydrops_FDR                                         <dbl> 1.000000000, 0.…
## $ emptydrops_IsCell                                      <raw> 00, 01, 00, 00,…
## $ emptydrops_Limited                                     <raw> 00, 01, 00, 00,…
## $ emptydrops_LogProb                                     <dbl> -689.6831, -120…
## $ emptydrops_PValue                                      <dbl> 0.91840816, 0.0…
## $ emptydrops_Total                                       <int> 255, 16705, 681…
## $ fragments_per_molecule                                 <dbl> 1.693252, 8.453…
## $ fragments_with_single_read_evidence                    <int> 504, 139828, 58…
## $ genes_detected_multiple_observations                   <int> 82, 2873, 1552,…
## $ genomic_read_quality_mean                              <dbl> 36.62988, 36.87…
## $ genomic_read_quality_variance                          <dbl> 25.99015, 20.19…
## $ genomic_reads_fraction_bases_quality_above_30_mean     <dbl> 0.8584288, 0.86…
## $ genomic_reads_fraction_bases_quality_above_30_variance <dbl> 0.03981779, 0.0…
## $ input_id                                               <chr> "58a18a4c-5423-…
## $ molecule_barcode_fraction_bases_above_30_mean          <dbl> 0.9820324, 0.98…
## $ molecule_barcode_fraction_bases_above_30_variance      <dbl> 0.005782884, 0.…
## $ molecules_with_single_read_evidence                    <int> 276, 5028, 2041…
## $ n_fragments                                            <int> 552, 202060, 84…
## $ n_genes                                                <int> 227, 3381, 1826…
## $ n_mitochondrial_genes                                  <int> 5, 22, 17, 5, 2…
## $ n_mitochondrial_molecules                              <int> 8, 3528, 2928, …
## $ n_molecules                                            <int> 326, 23902, 998…
## $ n_reads                                                <int> 679, 341669, 13…
## $ noise_reads                                            <int> 0, 0, 0, 0, 0, …
## $ pct_mitochondrial_molecules                            <dbl> 1.1782032, 1.03…
## $ perfect_cell_barcodes                                  <int> 667, 336674, 13…
## $ perfect_molecule_barcodes                              <int> 384, 227854, 89…
## $ reads_mapped_exonic                                    <int> 343, 210716, 84…
## $ reads_mapped_intergenic                                <int> 39, 19439, 8042…
## $ reads_mapped_intronic                                  <int> 175, 58833, 276…
## $ reads_mapped_multiple                                  <int> 162, 90968, 365…
## $ reads_mapped_too_many_loci                             <int> 0, 0, 0, 0, 0, …
## $ reads_mapped_uniquely                                  <int> 450, 227309, 93…
## $ reads_mapped_utr                                       <int> 55, 29289, 1039…
## $ reads_per_fragment                                     <dbl> 1.230072, 1.690…
## $ reads_unmapped                                         <int> 67, 23392, 9340…
## $ spliced_reads                                          <int> 99, 73817, 2987…

4 Example: Using manifest data to annotate a .loom file

The function optimus_loom_annotation() takes in the file path of a .loom file generated by the Optimus pipeline and returns a LoomExperiment object whose colData has been annotated with additional specimen data extracted from a manifest.

annotated_loom <- optimus_loom_annotation(file_location)
annotated_loom
## class: SingleCellLoomExperiment 
## dim: 58347 91713 
## metadata(16): last_modified CreationDate ...
##   specimen_from_organism.organ manifest
## assays(1): matrix
## rownames: NULL
## rowData names(29): Gene antisense_reads ... reads_per_molecule
##   spliced_reads
## colnames: NULL
## colData names(98): input_id CellID ...
##   sequencing_input.biomaterial_core.biomaterial_id
##   sequencing_input_type
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowGraphs(0): NULL
## colGraphs(0): NULL


## new metadata
setdiff(
    names(metadata(annotated_loom)),
    names(metadata(loom))
)
## [1] "manifest"
metadata(annotated_loom)$manifest
## # A tibble: 4 × 56
##   source_id         source_spec bundle_uuid bundle_version      file_document_id
##   <chr>             <chr>       <chr>       <dttm>              <chr>           
## 1 b2c7b0d5-d26a-4b… tdr:datare… e3ecdfc2-4… 2021-02-02 23:55:00 5bc232f2-b77c-5…
## 2 b2c7b0d5-d26a-4b… tdr:datare… 81df106e-e… 2021-02-02 23:55:00 9f8bc032-6276-5…
## 3 b2c7b0d5-d26a-4b… tdr:datare… 54fb0e25-5… 2021-02-02 23:55:00 dfd9905b-d6c9-5…
## 4 b2c7b0d5-d26a-4b… tdr:datare… 7516565a-e… 2021-02-02 23:55:00 e07ca731-b20a-5…
## # ℹ 51 more variables: file_type <chr>, file_name <chr>, file_format <chr>,
## #   read_index <chr>, file_size <dbl>, file_uuid <chr>, file_version <dttm>,
## #   file_crc32c <chr>, file_sha256 <chr>, file_content_type <chr>,
## #   file_drs_uri <chr>, file_url <chr>,
## #   cell_suspension.provenance.document_id <chr>,
## #   cell_suspension.biomaterial_core.biomaterial_id <chr>,
## #   cell_suspension.estimated_cell_count <lgl>, …

## new colData columns
setdiff(
    names(colData(annotated_loom)),
    names(colData(loom))
)
##  [1] "source_id"                                                      
##  [2] "source_spec"                                                    
##  [3] "bundle_uuid"                                                    
##  [4] "bundle_version"                                                 
##  [5] "file_document_id"                                               
##  [6] "file_type"                                                      
##  [7] "file_name"                                                      
##  [8] "file_format"                                                    
##  [9] "read_index"                                                     
## [10] "file_size"                                                      
## [11] "file_uuid"                                                      
## [12] "file_version"                                                   
## [13] "file_crc32c"                                                    
## [14] "file_sha256"                                                    
## [15] "file_content_type"                                              
## [16] "file_drs_uri"                                                   
## [17] "file_url"                                                       
## [18] "cell_suspension.provenance.document_id"                         
## [19] "cell_suspension.biomaterial_core.biomaterial_id"                
## [20] "cell_suspension.estimated_cell_count"                           
## [21] "cell_suspension.selected_cell_type"                             
## [22] "sequencing_protocol.instrument_manufacturer_model"              
## [23] "sequencing_protocol.paired_end"                                 
## [24] "library_preparation_protocol.library_construction_approach"     
## [25] "library_preparation_protocol.nucleic_acid_source"               
## [26] "project.provenance.document_id"                                 
## [27] "project.contributors.institution"                               
## [28] "project.contributors.laboratory"                                
## [29] "project.project_core.project_short_name"                        
## [30] "project.project_core.project_title"                             
## [31] "project.estimated_cell_count"                                   
## [32] "specimen_from_organism.provenance.document_id"                  
## [33] "specimen_from_organism.diseases"                                
## [34] "specimen_from_organism.organ"                                   
## [35] "specimen_from_organism.organ_part"                              
## [36] "specimen_from_organism.preservation_storage.preservation_method"
## [37] "donor_organism.sex"                                             
## [38] "donor_organism.biomaterial_core.biomaterial_id"                 
## [39] "donor_organism.provenance.document_id"                          
## [40] "donor_organism.genus_species"                                   
## [41] "donor_organism.development_stage"                               
## [42] "donor_organism.diseases"                                        
## [43] "donor_organism.organism_age"                                    
## [44] "cell_line.provenance.document_id"                               
## [45] "cell_line.biomaterial_core.biomaterial_id"                      
## [46] "organoid.provenance.document_id"                                
## [47] "organoid.biomaterial_core.biomaterial_id"                       
## [48] "organoid.model_organ"                                           
## [49] "organoid.model_organ_part"                                      
## [50] "_entity_type"                                                   
## [51] "sample.provenance.document_id"                                  
## [52] "sample.biomaterial_core.biomaterial_id"                         
## [53] "sequencing_input.provenance.document_id"                        
## [54] "sequencing_input.biomaterial_core.biomaterial_id"               
## [55] "sequencing_input_type"

5 Session info

sessionInfo()
## R version 4.4.0 RC (2024-04-16 r86468)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] hca_1.13.0                  LoomExperiment_1.23.0      
##  [3] BiocIO_1.15.0               rhdf5_2.49.0               
##  [5] SingleCellExperiment_1.27.0 SummarizedExperiment_1.35.0
##  [7] Biobase_2.65.0              GenomicRanges_1.57.0       
##  [9] GenomeInfoDb_1.41.0         IRanges_2.39.0             
## [11] S4Vectors_0.43.0            BiocGenerics_0.51.0        
## [13] MatrixGenerics_1.17.0       matrixStats_1.3.0          
## [15] dplyr_1.1.4                 BiocStyle_2.33.0           
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1        blob_1.2.4              filelock_1.0.3         
##  [4] fastmap_1.1.1           BiocFileCache_2.13.0    promises_1.3.0         
##  [7] digest_0.6.35           mime_0.12               lifecycle_1.0.4        
## [10] RSQLite_2.3.6           magrittr_2.0.3          compiler_4.4.0         
## [13] rlang_1.1.3             sass_0.4.9              tools_4.4.0            
## [16] utf8_1.2.4              yaml_2.3.8              knitr_1.46             
## [19] S4Arrays_1.5.0          htmlwidgets_1.6.4       bit_4.0.5              
## [22] curl_5.2.1              DelayedArray_0.31.0     abind_1.4-5            
## [25] miniUI_0.1.1.1          HDF5Array_1.33.0        withr_3.0.0            
## [28] purrr_1.0.2             grid_4.4.0              fansi_1.0.6            
## [31] xtable_1.8-4            Rhdf5lib_1.27.0         cli_3.6.2              
## [34] rmarkdown_2.26          crayon_1.5.2            generics_0.1.3         
## [37] httr_1.4.7              tzdb_0.4.0              DBI_1.2.2              
## [40] cachem_1.0.8            stringr_1.5.1           zlibbioc_1.51.0        
## [43] parallel_4.4.0          BiocManager_1.30.22     XVector_0.45.0         
## [46] vctrs_0.6.5             Matrix_1.7-0            jsonlite_1.8.8         
## [49] bookdown_0.39           hms_1.1.3               bit64_4.0.5            
## [52] archive_1.1.8           jquerylib_0.1.4         tidyr_1.3.1            
## [55] glue_1.7.0              DT_0.33                 stringi_1.8.3          
## [58] later_1.3.2             UCSC.utils_1.1.0        tibble_3.2.1           
## [61] pillar_1.9.0            htmltools_0.5.8.1       rhdf5filters_1.17.0    
## [64] GenomeInfoDbData_1.2.12 R6_2.5.1                dbplyr_2.5.0           
## [67] vroom_1.6.5             evaluate_0.23           shiny_1.8.1.1          
## [70] lattice_0.22-6          readr_2.1.5             memoise_2.0.1          
## [73] httpuv_1.6.15           bslib_0.7.0             Rcpp_1.0.12            
## [76] SparseArray_1.5.0       xfun_0.43               pkgconfig_2.0.3