HMP16SData 1.18.0
Schiffer, L. et al. HMP16SData: Efficient Access to the Human Microbiome Project through Bioconductor. Am. J. Epidemiol. (2019).
Griffith, J. C. & Morgan, X. C. Invited Commentary: Improving accessibility of the Human Microbiome Project data through integration with R/Bioconductor. Am. J. Epidemiol. (2019).
Waldron, L. et al. Waldron et al. Reply to “Commentary on the HMP16SData Bioconductor Package”. Am. J. Epidemiol. (2019).
The following knitr options will be used in this vignette to provide the most useful and concise output.
knitr::opts_chunk$set(message = FALSE)
The following packages will be used in this vignette to provide demonstrative examples of what a user might do with HMP16SData.
library(HMP16SData)
library(phyloseq)
library(magrittr)
library(ggplot2)
library(tibble)
library(dplyr)
library(dendextend)
library(circlize)
library(ExperimentHub)
library(gridExtra)
library(cowplot)
library(readr)
library(haven)
Pipe operators from the magrittr package are used in this vignette to provide the most elegant and concise syntax. See the magrittr vignette if the syntax is unclear.
HMP16SData is a Bioconductor ExperimentData package of
the Human Microbiome Project (HMP) 16S rRNA sequencing data for variable regions
1–3 and 3–5. Raw data files are provided in the package as downloaded from the
HMP Data Analysis and Coordination Center.
Processed data is provided as SummarizedExperiment
class objects via
ExperimentHub.
HMP16SData can be installed using BiocManager as follows.
BiocManager::install("HMP16SData")
Once installed, HMP16SData provides two functions to
access data – one for variable region 1–3 and another for variable region 3–5.
When called, as follows, the functions will download data from an
ExperimentHub Amazon S3 (Simple Storage Service)
bucket over https
or load data from a local cache.
V13()
## class: SummarizedExperiment
## dim: 43140 2898
## metadata(2): experimentData phylogeneticTree
## assays(1): 16SrRNA
## rownames(43140): OTU_97.1 OTU_97.10 ... OTU_97.9997 OTU_97.9999
## rowData names(7): CONSENSUS_LINEAGE SUPERKINGDOM ... FAMILY GENUS
## colnames(2898): 700013549 700014386 ... 700114963 700114965
## colData names(7): RSID VISITNO ... HMP_BODY_SUBSITE SRS_SAMPLE_ID
V35()
## class: SummarizedExperiment
## dim: 45383 4743
## metadata(2): experimentData phylogeneticTree
## assays(1): 16SrRNA
## rownames(45383): OTU_97.1 OTU_97.10 ... OTU_97.9998 OTU_97.9999
## rowData names(7): CONSENSUS_LINEAGE SUPERKINGDOM ... FAMILY GENUS
## colnames(4743): 700013549 700014386 ... 700114717 700114750
## colData names(7): RSID VISITNO ... HMP_BODY_SUBSITE SRS_SAMPLE_ID
The two data sets are represented as SummarizedExperiment
objects, a standard
Bioconductor class that is amenable to subsetting and analysis. To maintain
brevity, details of the SummarizedExperiment
class are not outlined here but
the SummarizedExperiment package provides an excellent
vignette.
Sometimes it is desirable to provide a quick summary of key demographic
variables and to make the process easier HMP16SData
provides a function, table_one
, to do so. It returns a data.frame
or a
list
of data.frame
objects that have been transformed to make a publication
ready table.
V13() %>%
table_one() %>%
head()
## Visit Number Sex Run Center
## 700013549 One Female Baylor College of Medicine
## 700014386 One Male Baylor College of Medicine, Broad Institute
## 700014403 One Male Baylor College of Medicine, Broad Institute
## 700014409 One Male Baylor College of Medicine, Broad Institute
## 700014412 One Male Baylor College of Medicine, Broad Institute
## 700014415 One Male Baylor College of Medicine, Broad Institute
## HMP Body Site HMP Body Subsite
## 700013549 Gastrointestinal Tract Stool
## 700014386 Gastrointestinal Tract Stool
## 700014403 Oral Saliva
## 700014409 Oral Tongue Dorsum
## 700014412 Oral Hard Palate
## 700014415 Oral Buccal Mucosa
If a list
is passed to table_one
, its elements must be named so that the
named elements can be used by the kable_one
function. The kable_one
function
will produce an HTML
table for vignettes such as the one shown below.
list(V13 = V13(), V35 = V35()) %>%
table_one() %>%
kable_one()
N | % | N | % | |
---|---|---|---|---|
Visit Number | ||||
One | 1,642 | 56.66 | 2,822 | 59.50 |
Two | 1,244 | 42.93 | 1,897 | 40.00 |
Three | 12 | 0.41 | 24 | 0.51 |
Sex | ||||
Female | 1,521 | 52.48 | 2,188 | 46.13 |
Male | 1,377 | 47.52 | 2,555 | 53.87 |
Run Center | ||||
Genome Sequencing Center at Washington University | 1,717 | 59.25 | 1,539 | 32.45 |
J. Craig Venter Institute | 506 | 17.46 | 1,009 | 21.27 |
Broad Institute | 0 | 0.00 | 1,078 | 22.73 |
Baylor College of Medicine | 289 | 9.97 | 696 | 14.67 |
J. Craig Venter Institute, Genome Sequencing Center at Washington University | 97 | 3.35 | 93 | 1.96 |
Baylor College of Medicine, Genome Sequencing Center at Washington University | 91 | 3.14 | 90 | 1.90 |
Baylor College of Medicine, Broad Institute | 76 | 2.62 | 103 | 2.17 |
J. Craig Venter Institute, Broad Institute | 86 | 2.97 | 75 | 1.58 |
Baylor College of Medicine, J. Craig Venter Institute | 15 | 0.52 | 40 | 0.84 |
Broad Institute, Baylor College of Medicine | 13 | 0.45 | 13 | 0.27 |
Genome Sequencing Center at Washington University, Baylor College of Medicine | 6 | 0.21 | 6 | 0.13 |
Genome Sequencing Center at Washington University, J. Craig Venter Institute | 1 | 0.03 | 1 | 0.02 |
Missing | 1 | 0.03 | 0 | 0.00 |
HMP Body Site | ||||
Oral | 1,622 | 55.97 | 2,774 | 58.49 |
Skin | 664 | 22.91 | 990 | 20.87 |
Urogenital Tract | 264 | 9.11 | 391 | 8.24 |
Gastrointestinal Tract | 187 | 6.45 | 319 | 6.73 |
Airways | 161 | 5.56 | 269 | 5.67 |
HMP Body Subsite | ||||
Tongue Dorsum | 190 | 6.56 | 316 | 6.66 |
Stool | 187 | 6.45 | 319 | 6.73 |
Supragingival Plaque | 189 | 6.52 | 313 | 6.60 |
Right Retroauricular Crease | 187 | 6.45 | 297 | 6.26 |
Attached Keratinized Gingiva | 181 | 6.25 | 313 | 6.60 |
Left Retroauricular Crease | 186 | 6.42 | 285 | 6.01 |
Buccal Mucosa | 183 | 6.31 | 312 | 6.58 |
Palatine Tonsils | 186 | 6.42 | 312 | 6.58 |
Subgingival Plaque | 183 | 6.31 | 309 | 6.51 |
Throat | 170 | 5.87 | 307 | 6.47 |
Hard Palate | 178 | 6.14 | 302 | 6.37 |
Saliva | 162 | 5.59 | 290 | 6.11 |
Anterior Nares | 161 | 5.56 | 269 | 5.67 |
Right Antecubital Fossa | 146 | 5.04 | 207 | 4.36 |
Left Antecubital Fossa | 145 | 5.00 | 201 | 4.24 |
Mid Vagina | 89 | 3.07 | 133 | 2.80 |
Posterior Fornix | 88 | 3.04 | 133 | 2.80 |
Vaginal Introitus | 87 | 3.00 | 125 | 2.64 |
The SummarizedExperiment
container provides for straightforward subsetting by
data or metadata variables using either the subset
function or [
methods –
see the SummarizedExperiment vignette for additional
details. Shown below, the variable region 3–5 data set is subset to include only
stool samples.
V35_stool <-
V35() %>%
subset(select = HMP_BODY_SUBSITE == "Stool")
V35_stool
## class: SummarizedExperiment
## dim: 45383 319
## metadata(2): experimentData phylogeneticTree
## assays(1): 16SrRNA
## rownames(45383): OTU_97.1 OTU_97.10 ... OTU_97.9998 OTU_97.9999
## rowData names(7): CONSENSUS_LINEAGE SUPERKINGDOM ... FAMILY GENUS
## colnames(319): 700013549 700014386 ... 700114717 700114750
## colData names(7): RSID VISITNO ... HMP_BODY_SUBSITE SRS_SAMPLE_ID
Most participant data from the HMP study is controlled through the National
Center for Biotechnology Information (NCBI) database of Genotypes and Phenotypes
(dbGaP). HMP16SData provides a data dictionary
translated from dbGaP XML
files for the seven different controlled-access data
tables related to the HMP. See ?HMP16SData::dictionary
for details of these
source data tables, and View(dictionary)
to view the complete data dictionary.
Several steps are required to access the data tables, but the process is greatly
simplified by HMP16SData.
You must make a controlled-access application through https://dbgap.ncbi.nlm.nih.gov for:
HMP Core Microbiome Sampling Protocol A (HMP-A) (phs000228.v4.p1)
Once approved, browse to https://dbgap.ncbi.nlm.nih.gov, sign in, and select
the option “get dbGaP repository key” to download your *.ngc
repository key.
This is all you need from the dbGaP website.
You must also install the NCBI SRA Toolkit, which will be used in the background for downloading and decrypting controlled-access data.
There are shortcuts for common platforms:
apt install sra-toolkit
brew install sratoolkit
For macOS, the brew
command does not come installed by default and requires
installation of the homebrew package manager. Instructions are available at
https://tinyurl.com/ybeqwl8f.
For Windows, binary installation is necessary and instructions are available at https://tinyurl.com/y845ppaa.
The attach_dbGap()
function takes a HMP16SData
SummarizedExperiment
object as its first argument and the path to a dbGaP
repository key as its second argument. It performs download, decryption, and
merging of all available controlled-access participant data with a single
function call.
V35_stool_protected <-
V35_stool %>%
attach_dbGaP("~/prj_12146.ngc")
The returned V35_stool_protected
object contains controlled-access participant
data as additional columns in its colData
slot.
colData(V35_stool_protected)
The phyloseq package provides an extensive suite of methods to analyze microbiome data.
For those familiar with both the HMP and phyloseq, you
may recall that an alternative phyloseq
class object containing the HMP
variable region 3–5 data has been made available by Joey McMurdie at
https://joey711.github.io/phyloseq-demo/HMPv35.RData. However, this object is
not compatible with the methods documented here for integration with dbGaP
controlled-access participant data, shotgun metagenomic data, or variable region
1–3 data. For that reason, we would encourage the use of the
HMP16SData SummarizedExperiment
class objects with
the phyloseq package.
To demonstrate how HMP16SData could be used as a
control or comparison cohort in microbime data analyses, we will demonstrate
basic comparisons of the palatine tonsils and stool body subsites using the
phyloseq package. We first create and subset two
SummarizedExperiment
objects from HMP16SData to
include only the relevant body subsites.
V13_tonsils <-
V13() %>%
subset(select = HMP_BODY_SUBSITE == "Palatine Tonsils")
V13_stool <-
V13() %>%
subset(select = HMP_BODY_SUBSITE == "Stool")
While these objects are both from the HMP16SData package, a user would potentially be comparing to their own data and only need a single object from the package.
The SummarizedExperiment
class objects can then be coerced to phyloseq
class
objects containing count data, sample (participant) data, taxonomy, and
phylogenetic trees using the as_phyloseq
function.
V13_tonsils_phyloseq <-
as_phyloseq(V13_tonsils)
V13_stool_phyloseq <-
as_phyloseq(V13_stool)
The analysis of all the samples in these two phyloseq
objects would be rather
computationally intensive. So to preform the analysis in a more timely manner, a
function, sample_samples
, is written here to take a sample of the samples
available in each phyloseq
object.
sample_samples <- function(x, size) {
sampled_names <-
sample_names(x) %>%
sample(size)
prune_samples(sampled_names, x)
}
Each phyloseq
object is then sampled to contain only twenty-five samples.
V13_tonsils_phyloseq %<>%
sample_samples(25)
V13_stool_phyloseq %<>%
sample_samples(25)
A “Study” identifier is then added to the sample_data
of each phyloseq
object to be used for stratification in analysis. In the case that a user were
comparing the HMP samples to their own data, an identifier would be added in the
same manner.
sample_data(V13_tonsils_phyloseq)$Study <- "Tonsils"
sample_data(V13_stool_phyloseq)$Study <- "Stool"
Once the two phyloseq
objects have been sampled and their sample_data
has
been augmented, they can be merged into a single phyloseq
object using the
merge_phyloseq
command.
V13_phyloseq <-
merge_phyloseq(V13_tonsils_phyloseq, V13_stool_phyloseq)
Finally, because the V13 data were subset and sampled, taxa with no relative
abundance are present in the merged object. These are removed using the
prune_taxa
command to avoid warnings during analysis.
V13_phyloseq %<>%
taxa_sums() %>%
is_greater_than(0) %>%
prune_taxa(V13_phyloseq)
The resulting V13_phyloseq
object can then be analyzed quickly and easily.
Alpha diversity measures the taxonomic variation within a sample and
phyloseq provides a method, plot_richness
, to plot
various alpha diversity measures.
First a vector of richness (i.e. alpha diversity) measures is created to be
passed to the plot_richness
method.
richness_measures <-
c("Observed", "Shannon", "Simpson")
The V13_phyloseq
object and the richness_measures
vector are then passed to
the plot_richness
method to construct a box plot of the three alpha diversity
measures. Additional ggplot2 syntax is used to control
the presentational aspects of the plot.
V13_phyloseq %>%
plot_richness(x = "Study", color = "Study", measures = richness_measures) +
stat_boxplot(geom ="errorbar") +
geom_boxplot() +
theme_bw() +
theme(axis.title.x = element_blank(), legend.position = "none")
Beta diversity measures the taxonomic variation between samples by calculating
the dissimilarity of clade relative abundances. The
phyloseq package provides a method, distance
, to
calculate various dissimilarity measures, such as Bray–Curtis dissimilarity.
Once dissimilarity has been calculated, samples can then be clustered and
represented as a dendrogram.
V13_dendrogram <-
distance(V13_phyloseq, method = "bray") %>%
hclust() %>%
as.dendrogram()
However, coercion to a dendrogram
object results in the lost of sample_data
present in the phyloseq
object which is needed for plotting. A data.frame
of
this sample_data
can be extracted from the phyloseq
object as follows.
V13_sample_data <-
sample_data(V13_phyloseq) %>%
data.frame()
Samples in the the plots will be identified by “PSN”" (Primary Sample Number)
and “Study”. So, additional columns to denote the colors and shapes of leaves
and labels are added to the data.frame
using dplyr
syntax.
V13_sample_data %<>%
rownames_to_column(var = "PSN") %>%
mutate(labels_col = if_else(Study == "Stool", "#F8766D", "#00BFC4")) %>%
mutate(leaves_col = if_else(Study == "Stool", "#F8766D", "#00BFC4")) %>%
mutate(leaves_pch = if_else(Study == "Stool", 16, 17))
Additionally, the order of samples in the dendrogram
and data.frame
objects
is different and a vector to sort samples is constructed as follows.
V13_sample_order <-
labels(V13_dendrogram) %>%
match(V13_sample_data$PSN)
The label and leaf color and shape columns of the data.frame
object can then
be coerced to vectors and sorted according to the sample order of the
dendrogram
object.
labels_col <- V13_sample_data$labels_col[V13_sample_order]
leaves_col <- V13_sample_data$leaves_col[V13_sample_order]
leaves_pch <- V13_sample_data$leaves_pch[V13_sample_order]
The dendextend package is then used to add these
vectors to the dendrogram
object as metadata which will be used for plotting.
V13_dendrogram %<>%
set("labels_col", labels_col) %>%
set("leaves_col", leaves_col) %>%
set("leaves_pch", leaves_pch)
Finally, the dendextend package provides a method,
circlize_dendrogram
, to produce a circular dendrogram, using the
circlize package, with a single line of code.
V13_dendrogram %>%
circlize_dendrogram(labels_track_height = 0.2)