This document provides an introduction of the TCGAbiolinksGUI.data, which contains supporting data for TCGAbiolinksGUI (Silva et al. 2017).
This package contains the following objects:
The code below access the NCI’s Genomic Data Commons (GDC) and get the list of available datasets for The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research to Generate Effective Treatments (TARGET) projects.
# Defining parameters
getGDCdisease <- function(){
projects <- TCGAbiolinks:::getGDCprojects()
projects <- projects[projects$id != "FM-AD",]
disease <- projects$project_id
idx <- grep("disease_type",colnames(projects))
names(disease) <- paste0(projects[[idx]], " (",disease,")")
disease <- disease[sort(names(disease))]
return(disease)
}
This data is in saved in the GDCdisease object.
The code below downloads a manifest of open TCGA MAF files available in the NCI’s Genomic Data Commons (GDC).
getMafTumors <- function(){
root <- "https://gdc-api.nci.nih.gov/data/"
maf <- fread("https://gdc-docs.nci.nih.gov/Data/Release_Notes/Manifests/GDC_open_MAFs_manifest.txt",
data.table = FALSE, verbose = FALSE, showProgress = FALSE)
tumor <- unlist(lapply(maf$filename, function(x){unlist(str_split(x,"\\."))[2]}))
proj <- TCGAbiolinks:::getGDCprojects()
disease <- gsub("TCGA-","",proj$project_id)
idx <- grep("disease_type",colnames(proj))
names(disease) <- paste0(proj[[idx]], " (",proj$project_id,")")
disease <- sort(disease)
ret <- disease[disease %in% tumor]
return(ret)
}
This data is in saved in the maf.tumor object.
Based on the article data from the article “Molecular Profiling Reveals Biologically Discrete Subsets and Pathways of Progression in Diffuse Glioma” (www.cell.com/cell/abstract/S0092-8674(15)01692-X) (Ceccarelli et al. 2016) we created a training model to predict Glioma classes based on DNA methylation signatures.
First, we will load the required libraries, control random number generation by specifying a seed and register the number of cores for parallel evaluation.
library(readr)
library(readxl)
library(dplyr)
library(caret)
library(randomForest)
library(doMC)
library(e1071)
# Control random number generation
set.seed(210) # set a seed to RNG
# register number of cores to be used for parallel evaluation
registerDoMC(cores = parallel::detectCores())
The next steps will download the DNA methylation matrix from the article: the DNA methylation data for glioma samples, samples metadata, and DNA methylation signatures.
file <- "https://tcga-data.nci.nih.gov/docs/publications/lgggbm_2016/LGG.GBM.meth.txt"
if(!file.exists(basename(file))) downloader::download(file,basename(file))
LGG.GBM <- as.data.frame(readr::read_tsv(basename(file)))
rownames(LGG.GBM) <- LGG.GBM$Composite.Element.REF
idx <- grep("TCGA",colnames(LGG.GBM))
colnames(LGG.GBM)[idx] <- substr(colnames(LGG.GBM)[idx], 1, 12) # reduce complete barcode to sample identifier (first 12 characters)
We will get metadata with samples molecular subtypes from the paper: (www.cell.com/cell/abstract/S0092-8674(15)01692-X) (Ceccarelli et al. 2016)
file <- "http://www.cell.com/cms/attachment/2045372863/2056783242/mmc2.xlsx"
if(!file.exists(basename(file))) downloader::download(file,basename(file))
metadata <- readxl::read_excel(basename(file), sheet = "S1A. TCGA discovery dataset", skip = 1)
DT::datatable(
metadata[,c("Case",
"Pan-Glioma DNA Methylation Cluster",
"Supervised DNA Methylation Cluster",
"IDH-specific DNA Methylation Cluster")]
)
Probes metadata information are downloaded from http://zwdzwd.io/InfiniumAnnotation This will be used to remove probes that should be masked from the training.
file <- "http://zwdzwd.io/InfiniumAnnotation/current/EPIC/EPIC.manifest.hg38.rda"
if(!file.exists(basename(file))) downloader::download(file,basename(file))
load(basename(file))
Retrieve probe signatures from the paper.
file <- "https://tcga-data.nci.nih.gov/docs/publications/lgggbm_2015/PanGlioma_MethylationSignatures.xlsx"
if(!file.exists(basename(file))) downloader::download(file,basename(file))
With the data and metadata available we will create one model for each signature. The code below selects the DNA methylation values for a given set of signatures (probes) and uses the classification of each sample to create a Random forest model. Each model is described in the next subsections.
Parameters | Values |
---|---|
trainingset | whole TCGA panglioma cohort |
probes signature | 1,300 pan-glioma tumor specific |
groups to be classified | LGm1, LGm2, LGm3, LGm4, LGm5, LGm6 |
metadata column | Pan-Glioma DNA Methylation Cluster |
We will start by preparing the training data. We will select the probes signatures for the group classification from the excel file (for this case the 1,300 probes) and the samples that belong to the groups we want to create our model (in this case “LGm2” “LGm5” “LGm3” “LGm4” “LGm1” “LGm6”).
sheet <- "1,300 pan-glioma tumor specific"
trainingset <- grep("mut|wt",unique(metadata$`Pan-Glioma DNA Methylation Cluster`),value = T)
trainingcol <- c("Pan-Glioma DNA Methylation Cluster")
The DNA methylation matrix will be subset to the DNA methylation signatures and samples with classification.
plat <- "EPIC"
signature.probes <- read_excel("PanGlioma_MethylationSignatures.xlsx", sheet = sheet) %>% pull(1)
samples <- dplyr::filter(metadata, `IDH-specific DNA Methylation Cluster` %in% trainingset)
RFtrain <- LGG.GBM[signature.probes, colnames(LGG.GBM) %in% as.character(samples$Case)] %>% na.omit
Probes that should be masked, will be removed.
We will merge the samples with their classification. In the end, we will have samples in the row, and prob