Paul J. McMurdie and Susan Holmes
If you find phyloseq and/or its tutorials useful, please acknowledge and cite phyloseq in your publications:
phyloseq: An R package for reproducible interactive analysis and graphics of microbiome census data (2013) PLoS ONE 8(4):e61217 http://dx.plos.org/10.1371/journal.pone.0061217
To post feature requests or ask for help, try the phyloseq Issue Tracker.
The analysis of microbiological communities brings many challenges: the integration of many different types of data with methods from ecology, genetics, phylogenetics, network analysis, visualization and testing. The data itself may originate from widely different sources, such as the microbiomes of humans, soils, surface and ocean waters, wastewater treatment plants, industrial facilities, and so on; and as a result, these varied sample types may have very different forms and scales of related data that is extremely dependent upon the experiment and its question(s). The phyloseq package is a tool to import, store, analyze, and graphically display complex phylogenetic sequencing data that has already been clustered into Operational Taxonomic Units (OTUs), especially when there is associated sample data, phylogenetic tree, and/or taxonomic assignment of the OTUs. This package leverages many of the tools available in R for ecology and phylogenetic analysis (vegan, ade4, ape, picante), while also using advanced/flexible graphic systems (ggplot2) to easily produce publication-quality graphics of complex phylogenetic data. phyloseq uses a specialized system of S4 classes to store all related phylogenetic sequencing data as single experiment-level object, making it easier to share data and reproduce analyses. In general, phyloseq seeks to facilitate the use of R for efficient interactive and reproducible analysis of OTU-clustered high-throughput phylogenetic sequencing data.
code font- The font for code, usually courrier-like, but depends on the theme.
myFun()- Code font word with
()attached at the right-end, is a function name.
The class structure in the phyloseq package follows the inheritance diagram shown in the figure below.
Currently, phyloseq uses 4 core data classes.
(1) the OTU abundance table (
a table of sample data (
(2) a table of taxonomic descriptors (
(3) a phylogenetic tree (
"phylo"-class, ape package.
otu_table class can be considered the central data type,
as it directly represents the number and type of sequences observed in each sample.
otu_table extends the numeric matrix class in the
and has a few additonal feature slots.
The most important of these feature slots is the
which holds a single logical that indicates whether the table is oriented
with taxa as rows (as in the genefilter package in Bioconductor
or with taxa as columns (as in vegan and picante packages).
In phyloseq methods, as well as its extensions of methods in other packages,
taxa_are_rows value is checked to ensure proper orientation of the
A phyloseq user is only required to specify the
otu_table orientation during initialization, following which all handling is internal.
sample_data class directly inherits
data.frame class, and thus effectively stores both categorical and numerical data about each sample. The orientation of a
data.frame in this context requires that samples/trials are rows, and variables are columns (consistent with vegan and other packages). The
taxonomyTable class directly inherits the
matrix class, and is oriented such that rows are taxa/OTUs and columns are taxonomic levels (e.g. Phylum).
The phyloseq-class can be considered an “experiment-level class” and should contain two or more of the previously-described core data classes. We assume that phyloseq users will be interested in analyses that utilize their abundance counts derived from the phylogenetic sequencing data, and so the
phyloseq() constructor will stop with an error if the arguments do not include an
otu_table. There are a number of common methods that require either an
sample_data combination, or an
otu_table and phylogenetic tree combination. These methods can operate on instances of the phyloseq-class, and will stop with an error if the required component data is missing.
Classes and inheritance in the phyloseq package. The class name and its slots are shown with red- or blue-shaded text, respectively. Coercibility is indicated graphically by arrows with the coercion function shown. Lines without arrows indicate that the more complex class (``phyloseq“) contains a slot with the associated data class as its components.
Now let’s get started by loading phyloseq, and describing some methods for importing data.
To use phyloseq in a new R session, it will have to be loaded. This can be done in your package manager, or at the command line using the
An important feature of phyloseq are methods for importing phylogenetic sequencing data from common taxonomic clustering pipelines. These methods take file pathnames as input, read and parse those files, and return a single object that contains all of the data.
Some additional background details are provided below. The best reproducible examples on importing data with phyloseq can be found on the official data import tutorial page:
New versions of QIIME (see below) produce a file in version 2 of the biom file format, which is a specialized definition of the HDF5 format.
The phyloseq package provides the
which can import both
Version 1 (JSON) and
Version 2 (HDF5)
of the BIOM file format.
The phyloseq package fully supports both taxa and sample observations of the biom format standard, and works with the BIOM files output from QIIME, RDP, MG-RAST, etc.
The default output from modern versions of QIIME is a BIOM-format file (among others). This is suppored in phyloseq.
Sometimes inaccurately referred to as metadata, additional observations on samples provided as mapping file to QIIME have not typically been output in the BIOM files, even though BIOM format supports it. This failure to support the full capability of the BIOM format means that you’ll have to provide sample observations as a separate file. There are many ways to do this, but the QIIME sample map is supported.
Two QIIME output files (
are recognized by the
One QIIME input file (sample map, tab-delimited),
is recognized by the
The objects created by each of the import functions above
should be merged using
merge_phyloseq to create one coordinated, self-consistent object.
merge_phyloseq, the output from these import activities is the three separate objects listed in the previous table.
QIIME’s “Moving Pictures” example tutorial output is a little too large to include within the phyloseq package (and thus is not directly included in this vignette). However, the phyloseq home page includes a full reproducible example of the import procedure described above:
For reference, or if you want to try yourself, the following is the relative paths within the QIIME tutorial directory for each of the files you will need.
QIIME is a free, open-source OTU clustering and analysis pipeline written for Unix (mostly Linux). It is distributed in a number of different forms (including a pre-installed virtual machine). See the QIIME home page for details.
One QIIME input file (sample map), and two QIIME output files (
.tre) are recognized by the
import_qiime() function. Only one of the three input files is required to run, although an
"otu_table.txt" file is required if
import_qiime() is to return a complete experiment object.
In practice, you will have to find the relevant QIIME files among a number of other files created by the QIIME pipeline. A screenshot of the directory structure created during a typical QIIME run is shown in the QIIME Directory Figure.