Foreword

MSnbase and pRoloc are under active developed; current functionality is evolving and new features are added on a regular basis. This software is free and open-source software. If you use it, please support the project by citing it in publications:

Gatto L. and Lilley K.S. MSnbase - an R/Bioconductor package for isobaric tagged mass spectrometry data visualization, processing and quantitation. Bioinformatics 28, 288-289 (2011).

Gatto L, Breckels LM, Wieczorek S, Burger T, Lilley KS. Mass-spectrometry-based spatial proteomics data analysis using pRoloc and pRolocdata. Bioinformatics. 2014 May 1;30(9):1322-4..

Breckels LM, Mulvey CM, Lilley KS and Gatto L. A Bioconductor workflow for processing and analysing spatial proteomics data. F1000Research 2016, 5:2926 doi: 10.12688/f1000research.10411.1.

If you are using the phenoDisco function, please also cite

Breckels L.M., Gatto L., Christoforou A., Groen A.J., Kathryn Lilley K.S. and Trotter M.W. The effect of organelle discovery upon sub-cellular protein localisation. J Proteomics, S1874-3919(13)00094-8 (2013).

If you are using any of the transfer learning functions, please also cite

Breckels LM, Holden S, Wonjar D, Mulvey CM, Christoforou A, Groen A, Trotter MW, Kohlbacker O, Lilley KS and Gatto L (2016). Learning from heterogeneous data sources: an application in spatial proteomics. PLoS Comput Biol 13;12(5):e1004920. doi: 10.1371/journal.pcbi.1004920.

If you are using any of the Bayesian generative models, please also cite

A Bayesian Mixture Modelling Approach For Spatial Proteomics Oliver M Crook, Claire M Mulvey, Paul D. W. Kirk, Kathryn S Lilley, Laurent Gatto bioRxiv 282269; doi: https://doi.org/10.1101/282269

For an introduction to spatial proteomics data analysis:

Gatto L, Breckels LM, Burger T, Nightingale DJ, Groen AJ, Campbell C, Nikolovski N, Mulvey CM, Christoforou A, Ferro M, Lilley KS. A foundation for reliable spatial proteomics data analysis. Mol Cell Proteomics. 2014 Aug;13(8):1937-52. doi: 10.1074/mcp.M113.036350.

The pRoloc package contains additional vignettes and reference material:

  • pRoloc-tutorial: pRoloc tutorial (this vignette).
  • pRoloc-ml: Machine learning techniques available in pRoloc.
  • pRoloc-transfer-learning: A transfer learning algorithm for spatial proteomics.
  • pRoloc-goannotations: Annotating spatial proteomics data.
  • pRoloc-bayesian: Bayesian spatial proteomics with pRoloc.

Questions and bugs

You are welcome to contact me directly about pRoloc. For bugs, typos, suggestions or other questions, please file an issue in our issue tracking system (https://github.com/lgatto/pRoloc/issues) providing as much information as possible, a reproducible example and the output of sessionInfo().

If you wish to reach a broader audience for general questions about proteomics analysis using R, you may want to use the Bioconductor support site: https://support.bioconductor.org/.

1 Introduction

1.1 Spatial proteomics

Spatial (or organelle) proteomics is the study of the localisation of proteins inside cells. The sub-cellular compartment can be organelles, i.e. structures defined by lipid bi-layers,macro-molecular assemblies of proteins and nucleic acids or large protein complexes. In this document, we will focus on mass-spectrometry based approaches that assay a population of cells, as opposed as microscopy based techniques that monitor single cells, as the former is the primary concern of pRoloc, although the techniques described below and the infrastructure in place could also be applied the processed image data. The typical experimental use-case for using pRoloc is a set of fractions, originating from a total cell lysate. These fractions can originate from a continuous gradient, like in the LOPIT (Dunkley et al. 2006) or PCP (Foster et al. 2006) approaches, or can be discrete fractions. The content of the fractions is then identified and quantified (using labelled or un-labelled quantitation techniques). Using relative quantitation of known organelle residents, termed organelle markers, organelle-specific profiles along the gradient are determined and new residents are identified based on matching of these distribution profiles. See for example (Gatto et al. 2010) and references therein for a detailed review on organelle proteomics.

It should be noted that large protein complexes, that are not necessarily separately enclosed within their own lipid bi-layer, can be detected by such techniques, as long as a distinct profile can be defined across the fractions.

1.2 About R and pRoloc

R (R Development Core Team 2011) is a statistical programming language and interactive working environment. It can be expanded by so-called packages to confer new functionality to users. Many such packages have been developed for the analysis of high-throughput biology, notably through the Bioconductor project (Gentleman et al. 2004). Two packages are of particular interest here, namely MSnbase (Gatto and Lilley 2012) and pRoloc. The former provides flexible infrastructure to store and manipulate quantitative proteomics data and the associated meta-data and the latter implements specific algorithmic technologies to analyse organelle proteomics data.

Among the advantages of R are robust statistical procedures, good visualisation capabilities, excellent documentation, reproducible research1 The content of this document is compiled (the code is executed and its output, text and figures, is displayed dynamically) to generate the pdf file., power and flexibility of the R language and environment itself and a rich environment for specialised functionality in many domains of bioinformatics: tools for many omics technologies, including proteomics, bio-statistics, gene ontology and biological pathway analysis, … Although there exists some specific graphical user interfaces (GUI), interaction with R is executed through a command line interface. While this mode of interaction might look alien to new users, experience has proven that after a first steep learning curve, great results can be achieved by non-programmers. Furthermore, specific and general documentation is plenty and beginners and advanced course material are also widely available.

Once R is started, the first step to enable functionality of a specific packages is to load them using the library function, as shown in the code chunk below:

library("MSnbase")
library("pRoloc")
library("pRolocdata")

MSnbase implements the data containers that are used by pRoloc. pRolocdata is a data package that supplies several published organelle proteomics data sets.

As a final setup step, we set the default colour palette for some of our custom plotting functionality to use semi-transparent colours in the code chunk below (see ?setStockcol for details). This facilitates visualisation of overlapping points.

setStockcol(NULL) ## reset first
setStockcol(paste0(getStockcol(), 70))

2 Data structures

2.1 Example data

The data used in this tutorial has been published in (Tan et al. 2009). The LOPIT technique (Dunkley et al. 2006) is used to localise integral and associated membrane proteins in Drosophila melanogaster embryos. Briefly, embryos were collected at 0 – 16 hours, homogenised and centrifuged to collect the supernatant, removing cell debris and nuclei. Membrane fractionation was performed on a iodixanol gradient and fractions were quantified using iTRAQ isobaric tags (Ross et al. 2004) as follows: fractions 4/5, 114; fractions 12/13, 115; fraction 19, 116 and fraction 21, 117. Labelled peptides were then separated using cation exchange chromatography and analysed by LS-MS/MS on a QSTAR XL quadrupole-time-of-flight mass spectrometer (Applied Biosystems). The original localisation analysis was performed using partial least square discriminant analysis (PLS-DA). Relative quantitation data was retrieved from the supplementary file pr800866n_si_004.xls (http://pubs.acs.org/doi/suppl/10.1021/pr800866n/suppl_file/pr800866n_si_004.xls) and imported into R as described below. We will concentrate on the first replicate.

2.2 Importing and loading data

This section illustrates how to import data in comma-separated value (csv) format into an appropriate R data structure. The first section shows the original csv (comma separated values) spreadsheet, as published by the authors, and how one can read such a file into using the read.csv function. This spreadsheet file is similar to the output of many quantitation software.

In the next section, we show 2 csv files containing a subset of the columns of original pr800866n_si_004-rep1.csv file and another short file, created manually, that will be used to create the appropriate R data.

2.2.1 The original data file

## The original data for replicate 1, available
## from the pRolocdata package
f0 <- dir(system.file("extdata", package = "pRolocdata"),
      full.names = TRUE,
      pattern = "pr800866n_si_004-rep1.csv")
csv <- read.csv(f0)

The three first lines of the original spreadsheet, containing the data for replicate one, are illustrated below (using the function head). It contains 888 rows (proteins) and 16 columns, including protein identifiers, database accession numbers, gene symbols, reporter ion quantitation values, information related to protein identification, …

head(csv, n=3)
##   Protein.ID        FBgn Flybase.Symbol No..peptide.IDs Mascot.score
## 1    CG10060 FBgn0001104    G-ialpha65A               3       179.86
## 2    CG10067 FBgn0000044         Act57B               5       222.40
## 3    CG10077 FBgn0035720        CG10077               5       219.65
##   No..peptides.quantified area.114 area.115 area.116 area.117
## 1                       1 0.379000 0.281000 0.225000 0.114000
## 2                       9 0.420000 0.209667 0.206111 0.163889
## 3                       3 0.187333 0.167333 0.169667 0.476000
##   PLS.DA.classification Peptide.sequence Precursor.ion.mass
## 1                    PM                                    
## 2                    PM                                    
## 3                                                          
##   Precursor.ion.charge pd.2013 pd.markers
## 1                           PM    unknown
## 2                           PM    unknown
## 3                      unknown    unknown

2.2.2 From csv files to R data

There are several ways to create the desired R data object, termed MSnSet, that will be used to perform the actual sub-cellular localisation prediction. Here, we illustrate a method that uses separate spreadsheet files for quantitation data, feature meta-data and sample (fraction) meta-data and the readMSnSet constructor function, that will hopefully be the most straightforward for new users.

## The quantitation data, from the original data
f1 <- dir(system.file("extdata", package = "pRolocdata"),
      full.names = TRUE, pattern = "exprsFile.csv")
exprsCsv <- read.csv(f1)
## Feature meta-data, from the original data
f2 <- dir(system.file("extdata", package = "pRolocdata"),
      full.names = TRUE, pattern = "fdataFile.csv")
fdataCsv <- read.csv(f2)
## Sample meta-data, a new file
f3 <- dir(system.file("extdata", package = "pRolocdata"),
      full.names = TRUE, pattern = "pdataFile.csv")
pdataCsv <- read.csv(f3)
  • exprsFile.csv containing the quantitation (expression) data for the 888 proteins and 4 reporter tags.
head(exprsCsv, n=3)
##          FBgn     X114     X115     X116     X117
## 1 FBgn0001104 0.379000 0.281000 0.225000 0.114000
## 2 FBgn0000044 0.420000 0.209667 0.206111 0.163889
## 3 FBgn0035720 0.187333 0.167333 0.169667 0.476000
  • fdataFile.csv containing meta-data for the 888 features (here proteins).
head(fdataCsv, n=3)
##          FBgn ProteinID FlybaseSymbol NoPeptideIDs MascotScore
## 1 FBgn0001104   CG10060   G-ialpha65A            3      179.86
## 2 FBgn0000044   CG10067        Act57B            5      222.40
## 3 FBgn0035720   CG10077       CG10077            5      219.65
##   NoPeptidesQuantified PLSDA
## 1                    1    PM
## 2                    9    PM
## 3                    3
  • pdataFile.csv containing samples (here fractions) meta-data. This simple file has been created manually.
pdataCsv
##   sampleNames Fractions
## 1        X114       4/5
## 2        X115     12/13
## 3        X116        19
## 4        X117        21

A self-contained data structure, called MSnSet (defined in the MSnbase package) can now easily be generated using the readMSnSet constructor, providing the respective csv file names shown above and specifying that the data is comma-separated (with sep = ","). Below, we call that object tan2009r1 and display its content.

tan2009r1 <- readMSnSet(exprsFile = f1,
            featureDataFile = f2,
            phenoDataFile = f3,
            sep = ",")
tan2009r1
## MSnSet (storageMode: lockedEnvironment)
## assayData: 888 features, 4 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: X114 X115 X116 X117
##   varLabels: Fractions
##   varMetadata: labelDescription
## featureData
##   featureNames: FBgn0001104 FBgn0000044 ... FBgn0001215 (888 total)
##   fvarLabels: ProteinID FlybaseSymbol ... PLSDA (6 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
## Quantitation data loaded: Tue May  2 17:55:25 2023  using readMSnSet. 
##  MSnbase version: 2.26.0

2.3 A shorter input work flow

The readMSnSet2 function provides a simplified import pipeline. It takes a single spreadsheet as input (default is csv) and extract the columns identified by ecol to create the expression data, while the others are used as feature meta-data. ecol can be a character with the respective column labels or a numeric with their indices. In the former case, it is important to make sure that the names match exactly. Special characters like '-' or '(' will be transformed by R into '.' when the csv file is read in. Optionally, one can also specify a column to be used as feature names. Note that these must be unique to guarantee the final object validity.

ecol <- paste("area", 114:117, sep = ".")
fname <- "Protein.ID"
eset <- readMSnSet2(f0, ecol, fname)
eset
## MSnSet (storageMode: lockedEnvironment)
## assayData: 888 features, 4 samples 
##   element names: exprs 
## protocolData: none
## phenoData: none
## featureData
##   featureNames: CG10060 CG10067 ... CG9983 (888 total)
##   fvarLabels: Protein.ID FBgn ... pd.markers (12 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
##  MSnbase version: 2.26.0

The ecol columns can also be queried interactively from R using the getEcols and grepEcols function. The former return a character with all column names, given a splitting character, i.e. the separation value of the spreadsheet (typically "," for csv, "\t" for tsv, …). The latter can be used to grep a pattern of interest to obtain the relevant column indices.

getEcols(f0, ",")
##  [1] "\"Protein ID\""              "\"FBgn\""                   
##  [3] "\"Flybase Symbol\""          "\"No. peptide IDs\""        
##  [5] "\"Mascot score\""            "\"No. peptides quantified\""
##  [7] "\"area 114\""                "\"area 115\""               
##  [9] "\"area 116\""                "\"area 117\""               
## [11] "\"PLS-DA classification\""   "\"Peptide sequence\""       
## [13] "\"Precursor ion mass\""      "\"Precursor ion charge\""   
## [15] "\"pd.2013\""                 "\"pd.markers\""
grepEcols(f0, "area", ",")
## [1]  7  8  9 10
e <- grepEcols(f0, "area", ",")
readMSnSet2(f0, e)
## MSnSet (storageMode: lockedEnvironment)
## assayData: 888 features, 4 samples 
##   element names: exprs 
## protocolData: none
## phenoData: none
## featureData
##   featureNames: 1 2 ... 888 (888 total)
##   fvarLabels: Protein.ID FBgn ... pd.markers (12 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
##  MSnbase version: 2.26.0

The phenoData slot can now be updated accordingly using the replacement functions phenoData<- or pData<- (see ?MSnSet for details).

2.3.1 The MSnSet class

Although there are additional specific sub-containers for additional meta-data (for instance to make the object MIAPE compliant), the feature (the sub-container, or slot featureData) and sample (the phenoData slot) are the most important ones. They need to meet the following validity requirements (see figure below):

  • the number of row in the expression/quantitation data and feature data must be equal and the row names must match exactly, and

  • the number of columns in the expression/quantitation data and number of row in the sample meta-data must be equal and the column/row names must match exactly.

It is common, in the context of pRoloc to update the feature meta-data (described in section 5) by adding new columns, without breaking the objects validity. Similarly, the sample meta-data can also be updated by adding new sample variables. A detailed description of the MSnSet class is available by typing ?MSnSet in the R console.

Dimension requirements for the respective expression, feature and sample meta-data slots.

The individual parts of this data object can be accessed with their respective accessor methods:

  • the quantitation data can be retrieved with exprs(tan2009r1),
  • the feature meta-data with fData(tan2009r1) and
  • the sample meta-data with pData(tan2009r1).

The advantage of this structure is that it can be manipulated as a whole and the respective parts of the data object will remain compatible. The code chunk below, for example, shows how to extract the first 5 proteins and 2 first samples:

smallTan <- tan2009r1[1:5, 1:2]
dim(smallTan)
## [1] 5 2
exprs(smallTan)
##                 X114     X115
## FBgn0001104 0.379000 0.281000
## FBgn0000044 0.420000 0.209667
## FBgn0035720 0.187333 0.167333
## FBgn0003731 0.247500 0.253000
## FBgn0029506 0.216000 0.183000

Several data sets, including the 3 replicates from (Tan et al. 2009), are distributed as MSnSet instances in the pRolocdata package. Others include, among others, the Arabidopsis thaliana LOPIT data from (Dunkley et al. 2006) (dunkley2006) and the mouse PCP data from (Foster et al. 2006) (foster2006). Each data set can be loaded with the data function, as show below for the first replicate from (Tan et al. 2009).

data(tan2009r1)

The original marker proteins are available as a feature meta-data variables called markers.orig and the output of the partial least square discriminant analysis, applied in the original publication, in the PLSDA feature variable. The most up-to-date marker list for the experiment can be found in markers. This feature meta-data column can be added from a simple csv markers files using the addMarkers function - see ?addMarkers for details.

The organelle markers are illustrated below using the convenience function getMarkers, but could also be done manually by accessing feature variables directly using fData().

getMarkers(tan2009r1, fcol = "markers.orig")
## organelleMarkers
##            ER         Golgi            PM mitochondrion       unknown 
##            20             6            15            14           833
getMarkers(tan2009r1, fcol = "PLSDA")
## organelleMarkers
##      ER/Golgi            PM mitochondrion       unknown 
##           235           180            74           399

Important As can be seen above, some proteins are labelled "unknown", defining non marker proteins. This is a convention in many pRoloc functions. Missing annotations (empty string) will not be considered as of unknown localisation; we prefer to avoid empty strings and make the absence of known localisation explicit by using the "unknown" tag. This information will be used to separate marker and non-marker (unlabelled) proteins when proceeding with data visualisation and clustering (sections 3 and 5.1) and classification analysis (section 5.2).

2.4 pRoloc’s organelle markers

The pRoloc package distributes a set of markers that have been obtained by mining the pRolocdata datasets and curation by various members of the Cambridge Centre for Proteomics. The available marker sets can be obtained and loaded using the pRolocmarkers function:

pRolocmarkers()
## 7 marker lists available:
## Arabidopsis thaliana [atha]:
##  Ids: TAIR, 543 markers
## Drosophila melanogaster [dmel]:
##  Ids: Uniprot, 179 markers
## Gallus gallus [ggal]:
##  Ids: IPI, 102 markers
## Homo sapiens [hsap]:
##  Ids: Uniprot, 872 markers
## Mus musculus [mmus]:
##  Ids: Uniprot, 937 markers
## Saccharomyces cerevisiae [scer_sgd]:
##  Ids: SGD, 259 markers
## Saccharomyces cerevisiae [scer_uniprot]:
##  Ids: Uniprot Accession, 259 markers
head(pRolocmarkers("dmel"))
##       Q7JZN0       Q7KLV9       Q9VIU7       P15348       Q7KMP8       O01367 
##         "ER" "Proteasome"         "ER"    "Nucleus" "Proteasome"    "Nucleus"
table(pRolocmarkers("dmel"))
## 
##  Cytoskeleton            ER         Golgi      Lysosome       Nucleus 
##             7            24             7             8            21 
##            PM    Peroxisome    Proteasome  Ribosome 40S  Ribosome 60S 
##            25             4            14            22            32 
## mitochondrion 
##            15

These markers can then be added to a new MSnSet using the addMarkers function by matching the marker names (protein identifiers) and the feature names of the MSnSet. See ?addMarkers for examples.

2.5 Data processing

The quantitation data obtained in the supplementary file is normalised to the sum of intensities of each protein; the sum of fraction quantitation for each protein equals 1 (considering rounding errors). This can quickly be verified by computing the row sums of the expression data.

summary(rowSums(exprs(tan2009r1)))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9990  0.9999  1.0000  1.0000  1.0001  1.0010

The normalise method (also available as normalize) from the MSnbase package can be used to obtain relative quantitation data, as illustrated below on another iTRAQ test data set, available from MSnbase. Several normalisation methods are available and described in ?normalise. For many algorithms, including classifiers in general and support vector machines in particular, it is important to properly per-process the data. Centering and scaling of the data is also available with the scale method.

In the code chunk below, we first create a test MSnSet instance2 Briefly, the itraqdata raw iTRAQ4-plex data is quantified by trapezoidation using MSnbase functionality. See the MSnbase-demo vignette for details. and illustrate the effect of normalise(..., method = "sum").

## create a small illustrative test data
data(itraqdata)
itraqdata <- quantify(itraqdata, method = "trap",
              reporters = iTRAQ4)
## the quantification data
head(exprs(itraqdata), n = 3)
##     iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
## X1   1347.6158  2247.3097  3927.6931  7661.1463
## X10   739.9861   799.3501   712.5983   940.6793
## X11 27638.3582 33394.0252 32104.2879 26628.7278
summary(rowSums(exprs(itraqdata)))
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##     59.06   5638.09  15344.43  38010.87  42256.61 305739.04         1
## normalising to the sum of feature intensitites
itraqnorm <- normalise(itraqdata, method = "sum")
processingData(itraqnorm)
## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011 
## Updated from version 0.3.0 to 0.3.1 [Fri Jul  8 20:23:25 2016] 
## iTRAQ4 quantification by trapezoidation: Tue May  2 17:55:30 2023 
## Normalised (sum): Tue May  2 17:55:30 2023 
##  MSnbase version: 1.1.22
head(exprs(itraqnorm), n = 3)
##     iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
## X1  0.08875373  0.1480074  0.2586772  0.5045617
## X10 0.23178064  0.2503748  0.2232022  0.2946424
## X11 0.23077081  0.2788287  0.2680598  0.2223407
summary(rowSums(exprs(itraqnorm)))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       1       1       1       1       1       1       1

Note above how the processing undergone by the MSnSet instances itraqdata and itraqnorm is stored in another such specific sub-container, the processingData slot.

The different features (proteins in the tan2009r1 data above, but these could also represent peptides or MS\(^2\) spectra) are characterised by unique names. These can be retrieved with the featureNames function.

head(featureNames(tan2009r1))
## [1] "P20353" "P53501" "Q7KU78" "P04412" "Q7KJ73" "Q7JZN0"

If we look back at section 2.2.2, we see that these have been automatically assigned using the first columns in the exprsFile.csv and fdataFile.csv files. It is thus crucial for these respective first columns to be identical. Similarly, the sample names can be retrieved with sampleNames(tan2009r1).

3 Data visualisation

The following sections will focus on two closely related aspects, data visualisation and data analysis (i.e. organelle assignments). Data visualisation is used in the context on quality control, to convince ourselves that the data displays the expected properties so that the output of further processing can be trusted. Visualising results of the localisation prediction is also essential, to control the validity of these results, before proceeding with orthogonal (and often expensive) dry or wet validation.

3.1 Profile plots

The underlying principle of gradient approaches is that we have separated organelles along the gradient and by doing so, generated organelle-specific protein distributions along the gradient fractions. The most natural visualisation is shown on figure 1, obtained using the sub-setting functionality of MSnSet instances and the plotDist function, as illustrated below.

## indices of the mito markers
j <- which(fData(tan2009r1)$markers.orig == "mitochondrion")
## indices of all proteins assigned to the mito
i <- which(fData(tan2009r1)$PLSDA == "mitochondrion")
plotDist(tan2009r1[i, ],
     markers = featureNames(tan2009r1)[j])