1. Introduction

1.1 Motivation for Developing Oncomix

The advent of large, well-curated databases, such as the genomic data commons, that contain RNA sequencing data from hundreds of patient tumors has made it possible to identify oncogene candidates solely based off of patterns present in mRNA expression data. Oncomix is the first method developed to identify oncogenes in a visually-interpretable manner from RNA-sequencing data in large cohorts of patients.

Oncomix is an R package for identifying oncogene candidates based off of 2-component Gaussian mixture models. It estimates parameters using the expectation maximization procedure as implemented in the R package mclust. This tutorial will demonstrate how to identify oncogene candidates from a set of mRNA sequencing data. We start by loading the package:

#devtools::install_github("dpique/oncomix", build_vignettes=T)
library(oncomix)

1.2 Distribution of Oncogene mRNA Expression

We first explore the idea of what the distribution of gene expression values for a oncogene should look like. It is known that oncogenes such as ERBB2 are overexpressed in 15-20% of all breast cancer patients. In addition, oncogenes should not be expressed in normal tissue. Based on this line of reasoning, we formulate a model for the distribution of oncogene mRNA expression values in a population of both tumor (teal curves) and normal (red-orange curves) tissue:

library(ggplot2)
oncoMixIdeal()

The x-axis represents mRNA expression values, with lower values toward the left and larger values (i.e. higher expression) toward the right. The y axis represents density. The teal curves represent the best-fitting Gaussian probability distribution (PD) over expression values from a single gene obtained from multiple tumor samples. The red-orange curves represent the PD over expression values from the same gene obtained from multiple adjacent normal tissue samples. This mixture model is applied once to the tumor data and again (separately) to the adjacent normal data, hence the 4 curves.

The advantage of applying a 2-component mixture model is that we are able to capture biologically-relevant clusters of gene expression that may naturally exist in the data. Otherwise, we might represent our data with just a single curve sitting in the middle of what really are 2 distinct clusters. Visually, we see that for a theoretical oncogene, there is a subgroup of tumors that overexpresses this gene relative to normal tissue.

1.3 Comparison of Oncomix to Existing Differential Expression Methods

We now conceptually compare oncomix to the techniques employed by traditional differential expression analysis (e.g. Student’s t-test, as employed by limma, or DESeq2). These existing approaches make strong assumptions – namely, that the data from a particular group are well-described by distributions with mass concentrated around a central value (such as a ‘mean’). If we were to use one of these approaches on a large dataset, our assumption would be that oncogenes are overexpressed in every tumor sample compared to normal tissue. This assumption can be visualized below:

oncoMixTraditionalDE()

The red-orange curve represents the gene expression values from the adjacent normal data, and the teal curve represents the gene expression values from the tumor data. Note, however, that the goal for a DE analysis using existing methods (such as limma) would be to find genes that maximize the difference between these two curves. This approach does not represent our knowledge of how oncogenes are expressed in a population of individuals – that is, highly expressed in a subset of patient tumors, and lowly expressed in adjacent normal tissue.

2. Identifying Oncogene Candidates

2.1 Loading Example Data and Exploring the mixModelParams Object

Now, we will load an example dataset that contains expression values for 700 mRNA isoforms obtained from paired samples of breast tumor (exprTumIsof) and adjacent normal(exprNmlIsof) breast tissue from 113 patients in the Genomic Data Commons. We will fit the mixture model using the oncomix function mixModelParams, which takes dataframes that contain patients as rows and mRNA isoforms/genes as columns. The number of columns (genes) should be the same between both dataframes, though the number of rows can vary.

data(exprNmlIsof, exprTumIsof, package="oncomix")

##look at the matrix of mRNA expression data from adjacent normal samples
dim(exprNmlIsof)
## [1] 700 113
exprNmlIsof[1:5, 1:5] 
##            TCGA-A7-A0CE-11A TCGA-A7-A0CH-11A TCGA-A7-A0D9-11A TCGA-A7-A0DB-11A
## uc001qoa.2         1.972887        2.4877508         1.534429         2.787777
## uc002jyg.1         1.488236        0.9886728         3.363712         3.301014
## uc011lsc.1         5.939032        6.7117641         5.210523         5.664385
## uc004cpd.1         1.816706        1.6479322         1.295284         1.537309
## uc011mgy.1         1.573477        1.3371375         2.027953         2.070035
##            TCGA-A7-A0DC-11A
## uc001qoa.2         1.821897
## uc002jyg.1         3.127924
## uc011lsc.1         5.433401
## uc004cpd.1         1.533831
## uc011mgy.1         2.431077
##look at the matrix of mRNA expression data from tumors
dim(exprTumIsof)
## [1] 700 113
exprTumIsof[1:5, 1:5] 
##            TCGA-A7-A0CE-01A TCGA-A7-A0CH-01A TCGA-A7-A0D9-01A TCGA-A7-A0DB-01A
## uc001qoa.2        1.2399803         1.780444         1.704665         2.390934
## uc002jyg.1        3.1547083         3.503748         2.432818         2.638793
## uc011lsc.1        5.2938539         5.622610         9.029952         6.077302
## uc004cpd.1        0.8499161         1.803169         2.518515         1.735324
## uc011mgy.1        0.6849480         2.081580         3.343087         1.546985
##            TCGA-A7-A0DC-01A
## uc001qoa.2         2.310899
## uc002jyg.1         1.893164
## uc011lsc.1         7.636221
## uc004cpd.1         1.882874
## uc011mgy.1         2.415372
##fits the mixture models, will take a few minutes
mmParams <- mixModelParams(exprNmlIsof, exprTumIsof) 
head(mmParams)
##                   nMu1       nMu2        nVar      nPi1      tMu1     tMu2
## uc002jxc.2 1.250068025 1.83147135 0.116400229 0.7967701 1.7371897 4.285259
## uc001jrg.2 0.006952803 0.01324544 0.005531848 0.5003037 0.1113493 1.625321
## uc003aab.2 0.985454390 1.59242262 0.087878646 0.4896663 2.0035574 4.522943
## uc004cmb.2 0.385483412 0.87309050 0.046609881 0.6731112 1.1691973 3.389940
## uc002jfu.2 0.382007855 0.45789679 0.558312339 0.5189438 0.5591093 2.645615
## uc001lcs.1 0.600357282 1.45153880 0.178300764 0.6373414 0.3463322 2.934114
##                 tVar      tPi1 deltaMu2   deltaMu1        SI     score
## uc002jxc.2 0.4368186 0.5943046 2.453788  0.4871217 1.0000000 1.4134475
## uc001jrg.2 0.1074420 0.7913964 1.612076  0.1043965 1.0000000 1.3947057
## uc003aab.2 0.6200912 0.7359520 2.930521  1.0181030 1.0000000 1.2044477
## uc004cmb.2 0.5000264 0.7992164 2.516849  0.7837139 1.0000000 1.1864992
## uc002jfu.2 0.3309116 0.6385413 2.187718  0.1771014 0.9911504 1.1114688
## uc001lcs.1 0.5330532 0.5976110 1.482575 -0.2540251 0.8849558 0.9072978

The object returned by mixModelParams is a dataframe with rows corresponding to genes and 12 columns containing mixture model parameters. The rows are sorted according to the score column, with the first row containing the highest oncomix score (defined below). The meaning of the dataframe columns are described below:

  • nMu1 = the mean (\(\mu\)) of the Gaussian curve with the smaller mean fit to the adjacent normal expression data (referred to as Mode 1).

  • nMu2 = the mean (\(\mu\)) of the Gaussian curve with the larger mean fit to the adjacent normal expression data (referred to as Mode 2).

  • nVar = the variance (\(\sigma\)) of the two Gaussian curves fit to the adjacent normal expression data (fixed to be equal between the two curves)

  • nPi1 = the proportion of adjacent normal samples assigned to the Gaussian curve with mean nMu1

  • tMu1 = the mean (\(\mu\)) of the Gaussian curve with the smaller mean fit to the tumor expression data (referred to as Mode 1).

  • tMu2 = the mean (\(\mu\)) of the Gaussian curve with the larger mean fit to the tumor expression data (referred to as Mode 2).

  • tVar = the variance (\(\sigma\)) of the two Gaussian curves fit to the tumor expression data (fixed to be equal between the two curves).

  • tPi1 = the proportion of tumor samples assigned to the Gaussian curve with mean tMu1.

  • deltaMu2 = the difference between the means of the two curves between groups. tMu2 - nMu2. May be negative or positive.

  • deltaMu1 = the difference between the means of the two curves between groups. tMu1 - nMu1. May be negative or positive.

  • SI = the selectivity index, or the proportion of adjacent normal samples with expression values less than the boundary defined by tMu2 - tMu1. The selectivity index for the ith gene is computed as:

\[SI_i = \frac{1}{N}\sum_{j=1}^N \Bigg\{ \begin{array}{ll} 1,~ if~x_{ij} < \frac{\mu_{iLT}+\mu_{iHT}}{2} \\ 0, ~ otherwise \end{array}, \]

  • where \(N\) is the number of adjacent normal samples, and \(x_{ij}\) is the expression value of the ith gene in the jth adjacent normal sample. \(\mu_{iHT}\) is the mean of higher/larger Gaussian from the ith gene in tumor samples, and \(\mu_{iLT}\) is the mean of the smaller/lower Gaussian from the ith gene in the tumor samples.

  • score = The score for the \(i^{th}\) gene is calculated as follows: \[score_i = SI * [(\Delta\mu_{2i} - \Delta\mu_{1i}) - (var_{Ni} - var_{Ti})] ~~~~~, \]

  • where \(SI\) is the selectivity index described above, \(\Delta\mu_{2i}\) and \(\Delta\mu_{1i}\) are as described above (equivalent to deltaMu2 and deltaMu1), \(var_{Ni}\) is the common variance across the adjacent normal samples (equivalent to n.var), and \(var_{Ti}\) is the common variance across the tumor samples (equivalent to tVar).

2.2 Selecting Genes that Appear Most Like the Idealized Oncogene

We can now use the mmParams dataframe to figure out which of our isoforms in our gene set are most similar in terms of their distribution to our ideal oncogene candidate.

For example, lets say that we wanted to select a subset of gene isoforms that most resembled the theoretically ideal oncogene. We can capture all of the genes meeting or exceeding specified thresholds using the topGeneQuant function. Here, we want to maximize deltMu2, minimize deltMu1, and maximize the SI.

topGeneQuant <- topGeneQuants(mmParams, deltMu2Thr=99, 
    deltMu1Thr=10, siThr=.99)
print(topGeneQuant)
##                 nMu1      nMu2       nVar      nPi1     tMu1     tMu2      tVar
## uc002jxc.2 1.2500680 1.8314714 0.11640023 0.7967701 1.737190 4.285259 0.4368186
## uc004cmb.2 0.3854834 0.8730905 0.04660988 0.6731112 1.169197 3.389940 0.5000264
##                 tPi1 deltaMu2  deltaMu1 SI    score
## uc002jxc.2 0.5943046 2.453788 0.4871217  1 1.413447
## uc004cmb.2 0.7992164 2.516849 0.7837139  1 1.186499

The results of the topGeneQuants function is to return a subsetted dataframe containing only those isoforms that met or exceeded the specified threshold. Here, there were 2 isoforms.

We can also select the top \(N\) genes that most closely resemble the oncogene candidate by simply selecting the first \(N\) rows from the mmParams object (e.g. mmParams[1:N,]). This is because the mixModelParams function returns a dataframe sorted by the score, with the 1st row containing the isoform with the highest score. There is an explanation of how the score is calculated above.

mmParamsTop10 <- mmParams[1:10,]
print(mmParamsTop10)
##                   nMu1       nMu2        nVar      nPi1       tMu1      tMu2
## uc002jxc.2 1.250068025 1.83147135 0.116400229 0.7967701  1.7371897  4.285259
## uc001jrg.2 0.006952803 0.01324544 0.005531848 0.5003037  0.1113493  1.625321
## uc003aab.2 0.985454390 1.59242262 0.087878646 0.4896663  2.0035574  4.522943
## uc004cmb.2 0.385483412 0.87309050 0.046609881 0.6731112  1.1691973  3.389940
## uc002jfu.2 0.382007855 0.45789679 0.558312339 0.5189438  0.5591093  2.645615
## uc001lcs.1 0.600357282 1.45153880 0.178300764 0.6373414  0.3463322  2.934114
## uc010tgs.1 0.893283005 1.05416810 0.179040709 0.5385148  1.3478455  2.959714
## uc003yqc.1 4.208205968 4.27383134 0.051540764 0.5289734  4.7150533  5.678098
## uc002dwc.2 9.293023652 9.41161776 0.183687289 0.4974137 10.1034656 11.302811
## uc001guh.2 1.430670482 1.92416525 0.110274426 0.4216897  1.9201509  3.423634
##                 tVar      tPi1 deltaMu2   deltaMu1        SI     score
## uc002jxc.2 0.4368186 0.5943046 2.453788  0.4871217 1.0000000 1.4134475
## uc001jrg.2 0.1074420 0.7913964 1.612076  0.1043965 1.0000000 1.3947057
## uc003aab.2 0.6200912 0.7359520 2.930521  1.0181030 1.0000000 1.2044477
## uc004cmb.2 0.5000264 0.7992164 2.516849  0.7837139 1.0000000 1.1864992
## uc002jfu.2 0.3309116 0.6385413 2.187718  0.1771014 0.9911504 1.1114688
## uc001lcs.1 0.5330532 0.5976110 1.482575 -0.2540251 0.8849558 0.9072978
## uc010tgs.1 0.4029768 0.7233166 1.905546  0.4545625 0.9911504 0.8612763
## uc003yqc.1 0.1911460 0.6652337 1.404266  0.5068473 1.0000000 0.6547324
## uc002dwc.2 0.3124464 0.7678767 1.891193  0.8104420 0.9911504 0.5794439
## uc001guh.2 0.3555573 0.7444933 1.499468  0.4894804 0.9911504 0.5393407

3. Visualize the Output

3.1 Visualize Isoforms with a High SI & Oncomix Score

Now, we will visualize the distribution of gene expression values for a particular isoform. Specifically, we will create a overlapping histogram of an single isoform’s expression values across both tumor (teal) and adjacent normal (red) samples with the best-fitting Gaussian curves superimposed. The isoform that we want to visualize here, uc002jxc.2, is one of the isoforms in the output from topGeneQuant function. It also ranks highly in terms of its oncomix score, so it should have a distributional profile that is more similar to our theoretical oncogene that most other isoforms in this dataset.

isof = "uc002jxc.2"
plotGeneHist(mmParams, exprNmlIsof, exprTumIsof, isof)

Next, we will create a scatterplot with the axes corresponding to the differences between component means. Our oncogene candidates will be those genes that appear in the upper right quadrant of this scatterplot. The x axis corresponds to the difference between the means of the curves with the larger Gaussians (deltaMu2), and the y axis corresponds to the difference between the means of the curves with the smaller Gaussians (deltaMu1) between the two treatments.

Here, \(\alpha\) (in the title) is a term that is present in the denominator of the value of the y-axis and functions as an automatic scaling parameter to set the range of the y-axis to be approximately equal to the range of the x-axis.

scatterMixPlot(mmParams)