Edited on September 28, 2020; Compiled on September 29, 2020
As the field of precision medicine progresses, we start to tailor treatments for cancer patients classified not only by their clinical, but also by their molecular features. The emerging cancer subtypes defined by these features require dedicated resources to assist the discovery of drug candidates for preclinical evaluation. Voluminous cancer patient gene expression profiles have been accumulated in public databases, enabling the creation of a cancer-specific expression signature. Meanwhile, large-scale gene expression profiles of chemical compounds became available in recent years. By matching the cancer-specific expression signature to compound-induced gene expression profiles from large drug libraries, researchers can prioritize small molecules that present high potency to reverse expression of signature genes for further experimental testing of their efficacy. This approach has proven to be an efficient and cost-effective way to identify efficacious drug candidates. However, the success of this approach requires multiscale procedures, imposing significant challenges to many labs. Therefore, we present OCTAD (http://octad.org): an open workspace for virtually screening compounds targeting precise cancer patient groups using gene expression features. We have included 19,127 patient tissue samples covering 50 cancer types, and expression profiles for 12,442 distinct compounds. We will continue to include more tissue samples. We streamline all the procedures including deep-learning based reference tissue selection, disease gene expression signature creation, drug reversal potency scoring, and in silico validation. We release OCTAD as a web portal and a standalone R package to allow experimental and computational scientists to easily navigate the tool. The code and data can also be employed to answer various biological questions.
We use Hepatocellular Carcinoma (HCC) to illustrate the utility of
the desktop pipeline. We provide the code and data for investigating
differential expression, pathway enrichment, drug prediction and hit
selection, and in silico validation. In this workflow, we will select
case tissue samples from our compiled TCGA data and compute control
tissue samples from the GTEx data. Note that the compiled data contains
adjacent normal samples which can also serve as control tissues. By
default, the octad package uses Small OCTAD dataset containing
expression values only for LINCS 978 landmark genes required for sRGES
score computation. To download the full expression values, please refer
to the link octad.counts.and.tpm.h5
(~3G). We recommend to use the full expression matrix. By default,
computated results are stored in the temporary directory. ## Select case
samples Choosing cases (tumor samples from the phenoDF data.frame) and
controls (corresponding samples treated as background such as normal
tissue, adjacent normal tissue or tumor tissue samples without a
specific mutation) is critical to achieving the best results. Several
methods are included in the provided code which demonstrates multiple
control sample selection methods. There are no built-in validation steps
to evaluate case samples. Visualization of cases in a t-SNE
(t-Distributed Stochastic Neighbor Embedding) plot could help understand
their relations with other OCTAD samples. Samples sharing similar
transcriptomic profiles tend to cluster together in the t-SNE plot. The
cases scattering in multiple clusters are not recommended to choose as a
group. Phenotype data frame phenoDF
contains tissue types
including normal tissue, adjacent normal tissue, primary cancer,
recurrent cancer, and metastatic cancer.
To list all available samples from the octad database, use the phenoDF data frame. To select HCC samples, subset the phenoDF:
#select data
library(octad)
## Loading required package: magrittr
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: ggplot2
## Loading required package: edgeR
## Loading required package: limma
## Loading required package: RUVSeq
## Loading required package: Biobase
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following object is masked from 'package:limma':
##
## plotMA
## The following objects are masked from 'package:dplyr':
##
## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
## lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
## pmin.int, rank, rbind, rownames, sapply, saveRDS, setdiff, table,
## tapply, union, unique, unsplit, which.max, which.min
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: EDASeq
## Loading required package: ShortRead
## Loading required package: BiocParallel
## Loading required package: Biostrings
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:dplyr':
##
## first, rename
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## I, expand.grid, unname
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following objects are masked from 'package:dplyr':
##
## collapse, desc, slice
## Loading required package: XVector
## Loading required package: GenomeInfoDb
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
## Loading required package: Rsamtools
## Loading required package: GenomicRanges
##
## Attaching package: 'GenomicRanges'
## The following object is masked from 'package:magrittr':
##
## subtract
## Loading required package: GenomicAlignments
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
##
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
##
## anyMissing, rowMedians
## The following object is masked from 'package:dplyr':
##
## count
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## The following object is masked from 'package:Biobase':
##
## rowMedians
##
## Attaching package: 'GenomicAlignments'
## The following object is masked from 'package:dplyr':
##
## last
##
## Attaching package: 'ShortRead'
## The following object is masked from 'package:dplyr':
##
## id
## The following object is masked from 'package:magrittr':
##
## functions
## Loading required package: DESeq2
## Loading required package: rhdf5
## Loading required package: foreach
## Loading required package: Rfast
## Loading required package: Rcpp
## Loading required package: RcppZiggurat
## Loading required package: RcppParallel
##
## Attaching package: 'RcppParallel'
## The following object is masked from 'package:Rcpp':
##
## LdFlags
##
## Rfast: 2.1.0
## ___ __ __ __ __ __ __ __ __ __ _ _ __ __ __ __ __ __ __ __ __ __ __
## | __ __ __ __ | | __ __ __ __ _/ / \ | __ __ __ __ / /__ __ _ _ __ __\
## | | | | | | / _ \ | | / /
## | | | | | | / / \ \ | | / /
## | | | | | | / / \ \ | | / /
## | |__ __ __ __| | | |__ __ __ __ / / \ \ | |__ __ __ __ _ / /__/\
## | __ __ __ __| | __ __ __ __| / /__ _ __\ \ |_ __ __ __ _ | / ___ /
## | \ | | / _ _ _ _ _ _ \ | | \/ / /
## | |\ \ | | / / \ \ | | / /
## | | \ \ | | / / \ \ | | / /
## | | \ \ | | / / \ \ | | / /
## | | \ \__ __ _ | | / / \ \ _ __ __ __ _| | / /
## |_| \__ __ __\ |_| /_/ \_\ /_ __ __ __ ___| \/ team
##
## Attaching package: 'Rfast'
## The following objects are masked from 'package:MatrixGenerics':
##
## colMads, colMaxs, colMedians, colMins, colRanks, colVars, rowMads,
## rowMaxs, rowMedians, rowMins, rowRanks, rowVars
## The following objects are masked from 'package:matrixStats':
##
## colMads, colMaxs, colMedians, colMins, colRanks, colVars, rowMads,
## rowMaxs, rowMedians, rowMins, rowRanks, rowVars
## The following object is masked from 'package:Biobase':
##
## rowMedians
## The following object is masked from 'package:dplyr':
##
## nth
## Loading required package: octad.db
## Loading required package: ExperimentHub
## Loading required package: AnnotationHub
## Loading required package: BiocFileCache
## Loading required package: dbplyr
##
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
##
## ident, sql
##
## Attaching package: 'AnnotationHub'
## The following object is masked from 'package:Biobase':
##
## cache
## Welcome to octad.db package. This is a database package, pipeline available via the octad package. If you want to run the pipeline on the webserver, please refer to octad.org
## Loading required package: httr
##
## Attaching package: 'httr'
## The following object is masked from 'package:Biobase':
##
## content
## Loading required package: qpdf
phenoDF=get_ExperimentHub_data("EH7274") #load data.frame with samples included in the OCTAD database.
## see ?octad.db and browseVignettes('octad.db') for documentation
## loading from cache
head(phenoDF) #list all data included within the package
## sample.id sample.type biopsy.site cancer
## 1 GTEX-1117F-0226-SM-5GZZ7 normal ADIPOSE - SUBCUTANEOUS normal
## 2 GTEX-1117F-0426-SM-5EGHI normal MUSCLE - SKELETAL normal
## 3 GTEX-1117F-0526-SM-5EGHJ normal ARTERY - TIBIAL normal
## 4 GTEX-1117F-0626-SM-5N9CS normal ARTERY - CORONARY normal
## 5 GTEX-1117F-0726-SM-5GIEN normal HEART - ATRIAL APPENDAGE normal
## 6 GTEX-1117F-1326-SM-5EGHH normal ADIPOSE - VISCERAL (OMENTUM) normal
## data.source gender read.count.file age_in_year metastatic_site tumor_grade
## 1 GTEX Female TOIL.RData <NA> <NA> <NA>
## 2 GTEX Female TOIL.RData <NA> <NA> <NA>
## 3 GTEX Female TOIL.RData <NA> <NA> <NA>
## 4 GTEX Female TOIL.RData <NA> <NA> <NA>
## 5 GTEX Female TOIL.RData <NA> <NA> <NA>
## 6 GTEX Female TOIL.RData <NA> <NA> <NA>
## tumor_stage gain_list loss_list mutation_list subtype
## 1 <NA> <NA> <NA>
## 2 <NA> <NA> <NA>
## 3 <NA> <NA> <NA>
## 4 <NA> <NA> <NA>
## 5 <NA> <NA> <NA>
## 6 <NA> <NA> <NA>
HCC_primary=subset(phenoDF,cancer=='liver hepatocellular carcinoma'&sample.type == 'primary') #select data
case_id=HCC_primary$sample.id #select cases
The sample ids will be stored in the character vector case_id. The code can be easily modified to select for other cancers or a set of samples based on mutations and copy numbers (e.g., TP53 mutation or MYC amplification). It is also recommended to use the R package cgdsr to select TCGA samples based on more clinical and molecular features.
Use the function computeRefTissue
to compute appropriate
normal tissues via comparing gene expression features between case
samples and normal tissue samples. Users can select adjacent normal
tissue samples if available. By default, features from the precomputed
AutoEncoder
file are used, but other features such as top
varying genes across samples can be employed as well. Pairwise Spearman
correlation is computed between every case sample and every normal
sample using these features. For each normal sample, its median
correlation with all case samples is then computed. Top correlated
normal samples (defined by control_size) are then selected as
control.
#computing top 50 reference tissues
control_id=computeRefTissue(case_id,output=FALSE,adjacent=TRUE,source = "octad",control_size = 50)
## see ?octad.db and browseVignettes('octad.db') for documentation
## loading from cache
## see ?octad.db and browseVignettes('octad.db') for documentation
## loading from cache
#please note, if \code{output = TRUE}, \code{outputFolder} variable must be specified, otherwise it will be written to \code{tempdir()}
# use adjacent normal tissue samples as control_id allow you to avoid running this function
There is also an availability to select control samples by hand:
#computing top 50 reference tissues
control_id=subset(phenoDF,biopsy.site=='LIVER'&sample.type=='normal')$sample.id[1:50] #select first 50 samples from healthy liver
# use adjacent normal tissue samples as control_id allow you to avoid running this function
The relationships among case, control and other samples can be visualised through a t-SNE matrix precomputed based on the features derived from autoencoder.
tsne=get_ExperimentHub_data("EH7276") #Download file with tsneresults for all samples in the octad.db once. After this it will be cached and no additional download required.
## see ?octad.db and browseVignettes('octad.db') for documentation
## loading from cache
tsne$type <- "others"
tsne$type[tsne$sample.id %in% case_id] <- "case"
tsne$type[tsne$sample.id %in% control_id] <- "control"
#plot
p2 <- ggplot(tsne, aes(X, Y, color = type)) + geom_point(alpha = 0.4)+
labs(title = paste ('TNSE PLOT'), x= 'TSNE Dim1', y='TSNE Dim2', caption="OCTAD")+
theme_bw()
p2