This package provides an integrated analysis workflow for robust and reproducible analysis of mass spectrometry proteomics data for differential protein expression or differential enrichment. It requires tabular input (e.g. txt files) as generated by quantitative analysis softwares of raw mass spectrometry data, such as MaxQuant or IsobarQuant. Functions are provided for data preparation, filtering, variance normalization and imputation of missing values, as well as statistical testing of differentially enriched / expressed proteins. It also includes tools to check intermediate steps in the workflow, such as normalization and missing values imputation. Finally, visualization tools are provided to explore the results, including heatmap, volcano plot and barplot representations. For scientists with limited experience in R, the package also contains wrapper functions that entail the complete analysis workflow and generate a report (see the chapter on Workflow functions). Even easier to use are the interactive Shiny apps that are provided by the package (see the chapter on Shiny apps).
The most recent version of R can be downloaded from CRAN. Additional system requirements are Pandoc, as well as Rtools for Windows and NetCDF for Linux.
Start R and install the DEP package:
if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("DEP") library("DEP")
The package contains shiny apps, which allow for the interactive analysis entailing the entire workflow described below. These apps are especially relevant for users with limited or no experience in R. Currently, there are two different apps. One for label-free quantification (LFQ)-based analysis (output from MaxQuant) and the second for tandem-mass-tags (TMT)-based analysis (output from IsobarQuant). To run the apps, simply run this single command:
# For LFQ analysis run_app("LFQ") # For TMT analysis run_app("TMT")
We analyze a proteomics dataset in which Ubiquitin-protein interactors were characterized (Zhang et al. Mol Cell 2017). The raw mass spectrometry data were first analyzed using MaxQuant (Cox and Mann, Nat Biotech 2007) and the resulting “proteinGroups.txt” file is used as input for the downstream analysis.
# Loading a package required for data handling library("dplyr") # The data is provided with the package data <- UbiLength # We filter for contaminant proteins and decoy database hits, which are indicated by "+" in the columns "Potential.contaminants" and "Reverse", respectively. data <- filter(data, Reverse != "+", Potential.contaminant != "+")
The data.frame has the following dimensions:
##  2941 23
The data.frame has the following column names:
##  "Protein.IDs" "Majority.protein.IDs" ##  "Protein.names" "Gene.names" ##  "Fasta.headers" "Peptides" ##  "Razor...unique.peptides" "Unique.peptides" ##  "LFQ.intensity.Ubi4_1" "LFQ.intensity.Ubi4_2" ##  "LFQ.intensity.Ubi4_3" "LFQ.intensity.Ubi6_1" ##  "LFQ.intensity.Ubi6_2" "LFQ.intensity.Ubi6_3" ##  "LFQ.intensity.Ctrl_1" "LFQ.intensity.Ctrl_2" ##  "LFQ.intensity.Ctrl_3" "LFQ.intensity.Ubi1_1" ##  "LFQ.intensity.Ubi1_2" "LFQ.intensity.Ubi1_3" ##  "Only.identified.by.site" "Reverse" ##  "Potential.contaminant"
The “LFQ.intensity” columns will be used for subsequent analysis.
The dataset has unique Uniprot identifiers, however those are not immediately informative. The associated gene names are informative, however these are not always unique.
# Are there any duplicated gene names? data$Gene.names %>% duplicated() %>% any()
##  TRUE
# Make a table of duplicated gene names data %>% group_by(Gene.names) %>% summarize(frequency = n()) %>% arrange(desc(frequency)) %>% filter(frequency > 1)
## # A tibble: 51 × 2 ## Gene.names frequency ## <chr> <int> ## 1 "" 7 ## 2 "ATXN2" 4 ## 3 "ATXN2L" 4 ## 4 "SF1" 4 ## 5 "HSPA8" 3 ## 6 "RBM33" 3 ## 7 "UGP2" 3 ## 8 "ACTL6A" 2 ## 9 "BCLAF1" 2 ## 10 "BRAP" 2 ## # ℹ 41 more rows
For further analysis these proteins must get unique names. Additionally, some proteins do not have an annotated gene name and for those we will use the Uniprot ID.
# Make unique names using the annotation in the "Gene.names" column as primary names and the annotation in "Protein.IDs" as name for those that do not have an gene name. data_unique <- make_unique(data, "Gene.names", "Protein.IDs", delim = ";") # Are there any duplicated names? data$name %>% duplicated() %>% any()
##  FALSE
Many Bioconductor packages use SummarizedExperiment objects as input and/or output. This class of objects contains and coordinates the actual (assay) data, information on the samples as well as feature annotation. We can generate the SummarizedExperiment object from our data using two different approaches. We can extract sample information directly from the column names or we add sample information using an experimental design template. An experimental design is included in the package for our example dataset.
The experimental design must contain ‘label’, ‘condition’ and ‘replicate’ columns. The ‘label’ column contains the identifiers of the different samples and they should correspond to the column names containing the assay data. The ‘condition’ and ‘replicate’ columns contain the annotation of these samples as defined by the user.
# Generate a SummarizedExperiment object using an experimental design LFQ_columns <- grep("LFQ.", colnames(data_unique)) # get LFQ column numbers experimental_design <- UbiLength_ExpDesign data_se <- make_se(data_unique, LFQ_columns, experimental_design) # Generate a SummarizedExperiment object by parsing condition information from the column names LFQ_columns <- grep("LFQ.", colnames(data_unique)) # get LFQ column numbers data_se_parsed <- make_se_parse(data_unique, LFQ_columns) # Let's have a look at the SummarizedExperiment object data_se
## class: SummarizedExperiment ## dim: 2941 12 ## metadata(0): ## assays(1): '' ## rownames(2941): RBM47 UBA6 ... ATXN2.3 X6RHB9 ## rowData names(13): Protein.IDs Majority.protein.IDs ... name ID ## colnames(12): Ubi4_1 Ubi4_2 ... Ubi1_2 Ubi1_3 ## colData names(4): label ID condition replicate
The make_se and make_se_parse functions generate a SummarizedExperiment object that has a couple of specifications. The assay data is log2-transformed and its rownames depict the protein names. The rowData contains, amongst others, the ‘name’ and ‘ID’ columns that were generated by make_unique. The colData contains the experimental design and thereby the sample annotation. Thereby the colData includes the ‘label’, ‘condition’ and ‘replicate’ columns as well as a newly generated ‘ID’ column. The log2-transformed assay data and the specified rowData and colData columns are prerequisites for the subsequent analysis steps.
The dataset contains proteins which are not quantified in all replicates. Some proteins are even only quantified in a single replicate.
# Plot a barplot of the protein identification overlap between samples plot_frequency(data_se)
This leaves our dataset with missing values, which need to be imputed. However, this should not be done for proteins that contain too many missing values. Therefore, we first filter out proteins that contain too many missing values. This is done by setting the threshold for the allowed number of missing values per condition in the filter_missval function.
# Filter for proteins that are identified in all replicates of at least one condition data_filt <- filter_missval(data_se, thr = 0) # Less stringent filtering: # Filter for proteins that are identified in 2 out of 3 replicates of at least one condition data_filt2 <- filter_missval(data_se, thr = 1)
After filtering, the number of identified proteins per sample can be plotted as well as the overlap in identifications between samples.
# Plot a barplot of the number of identified proteins per samples plot_numbers(data_filt)
# Plot a barplot of the protein identification overlap between samples plot_coverage(data_filt)
The data is background corrected and normalized by variance stabilizing transformation (vsn).
# Normalize the data data_norm <- normalize_vsn(data_filt)
The normalization can be inspected by checking the distributions of the samples before and after normalization.
# Visualize normalization by boxplots for all samples before and after normalization plot_normalization(data_filt, data_norm)
The remaining missing values in the dataset need to be imputed. The data can be missing at random (MAR), for example if proteins are quantified in some replicates but not in others. Data can also be missing not at random (MNAR), for example if proteins are not quantified in specific conditions (e.g. in the control samples). MNAR can indicate that proteins are below the detection limit in specific samples, which could be very well the case in proteomics experiments. For these different conditions, different imputation methods have to be used, as described in the MSnbase vignette and more specifically in the impute function descriptions.
To explore the pattern of missing values in the data, a heatmap is plotted indicating whether values are missing (0) or not (1). Only proteins with at least one missing value are visualized.
# Plot a heatmap of proteins with missing values plot_missval(data_filt)
This heatmap indicates that missing values are highly biased to specific samples. The example dataset is an affinity enrichment dataset of ubiquitin interactors, which is likely to have proteins which are below the detection limit in specific samples. These can be proteins that are specifically enriched in the ubiquitin purifications, but are not enriched in the controls samples, or vice versa. To check whether missing values are biased to lower intense proteins, the densities and cumulative fractions are plotted for proteins with and without missing values.
# Plot intensity distributions and cumulative fraction of proteins with and without missing values plot_detect(data_filt)
Indeed the proteins with missing values have on average low intensities. This data (MNAR and close to the detection limit) should be imputed by a left-censored imputation method, such as the quantile regression-based left-censored function (“QRILC”) or random draws from a left-shifted distribution (“MinProb” and “man”). In contrast, MAR data should be imputed with methods such as k-nearest neighbor (“knn”) or maximum likelihood (“MLE”) functions. See the MSnbase vignette and more specifically the impute function description for more information.
# All possible imputation methods are printed in an error, if an invalid function name is given. impute(data_norm, fun = "")
## Error in match.arg(fun): 'arg' should be one of "bpca", "knn", "QRILC", "MLE", "MinDet", "MinProb", "man", "min", "zero", "mixed", "nbavg"
# Impute missing data using random draws from a Gaussian distribution centered around a minimal value (for MNAR) data_imp <- impute(data_norm, fun = "MinProb", q = 0.01) # Impute missing data using random draws from a manually defined left-shifted Gaussian distribution (for MNAR) data_imp_man <- impute(data_norm, fun = "man", shift = 1.8, scale = 0.3) # Impute missing data using the k-nearest neighbour approach (for MAR) data_imp_knn <- impute(data_norm, fun = "knn", rowmax = 0.9)
The effect of the imputation on the distributions can be visualized.
# Plot intensity distributions before and after imputation plot_imputation(data_norm, data_imp)