Contents

1 Introduction

1.1 The Comparative Toxicogenomics Database

The Comparative Toxicogenomics Database (CTDbase; http://ctdbase.org) is a public resource for toxicogenomic information manually curated from the peer-reviewed scientific literature, providing key information about the interactions of environmental chemicals with gene products and their effect on human disease [1][2]. CTDbase is offered to public by using a web-based interface that includes basic and advanced query options to access data for sequences, references, and toxic agents, and a platform for analysing sequences.

1.2 CTDquerier R package

CTDquerier is an R package that allows to R users to download basic data from CTDbase about genes, chemicals and diseases. Once the user’s input is validated allows to query CTDbase to download the information of the given input from the other modules.

CTDquerier can be installed using devtools. To install CTDquerier run the following command in an R session:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("CTDquerier")

Once installed, CTDquerier should be loaded running the following command:

library( CTDquerier )

The main function of CTDquerier are three depending of the input: genes, chemicals or diseases. Table 1 indicates the proper function to be used to query CTDbase depending on the input.

Table 1: Main functions of CTDquerier, designed to accept a specific input
Input Function
Genes query_ctd_gene
Chemicals query_ctd_chem
Diseases query_ctd_dise

The function to query CTDbase relies on a set of function that download the specific vocabulary of each input. Table 2 shows the different functions that are used to download the specific vocabulary and to load it into R. This process is transparent to user since it is encapsulated into each one of the query functions.

Table 2: Functions used to download and load specific vocabulary from CTDbase
Input Load Function Download Function
Genes load_ctd_gene download_ctd_genes
Chemicals load_ctd_chem download_ctd_chem
Diseases load_ctd_dise download_ctd_dise

The three main functions of CTDquerier returns CTDdata objects. These objects can be used to plot the information available in CTDbase by using plot. Moreover, the informatin from CTDbase can be extracted as data.frames using the method get_table. Both plot and extract methods needs an argument index_name that indicates the table to be ploted or extarcted. Table 3 shows the relation between the possible options for index_name depeting of the query performed. Also the pssible representation of each table.

Table 3: Relation of the accessors and representation of each table in a CTDdata object depending of the input
Accessor Genes Chemicals Diseases
gene interactions heat-map/network heat-map
chemical interactions heat-map heat-map
diseases heat-map heat-map
gene-gene interactions heat-map/network
kegg pathways network heat-map network
go terms network heat-map

2 Querying CTDbase …

2.1 … by gene

To query CTDbase for a given gene or set of genes, we use the function query_ctd_gene:

args( query_ctd_gene )
## function (terms, verbose = FALSE) 
## NULL

The argument terms is the one that must be filled with the list of genes of interest. The argument filename is filled with the name that will receive the table with the specific vocabulary from CTDbase for genes. The function checks if this file already exists, if is the case it used the local version. The argument mode is used to download the vocabulary file (for more info., check download.file from module utils). Finally, the argument verbose will show relevant messages about the querying process if is set to TRUE.

A typical gene-query follows:

ctd_genes <- query_ctd_gene( 
    terms = c( "APOE", "APOEB", "APOE2", "APOE3" , "APOE4", "APOA1", "APOA5" ) )
## Warning in .get_cache(): /home/biocbuild/.cache/CTDQuery
## Using temporary cache /tmp/Rtmp9H9Jhb/BiocFileCache
## Warning in .get_cache(): /home/biocbuild/.cache/CTDQuery
## Using temporary cache /tmp/Rtmp9H9Jhb/BiocFileCache
## 1/tmp/Rtmp9H9Jhb/BiocFileCache/3e5a751f1027_CTD_genes.tsv.gz
## Warning in load_ctd_gene(): 1/tmp/Rtmp9H9Jhb/BiocFileCache/
## 3e5a751f1027_CTD_genes.tsv.gz
## 1/tmp/Rtmp9H9Jhb/BiocFileCache/3e5a751f1027_CTD_genes.tsv.gz
## Warning in load_ctd_gene(): 1/tmp/Rtmp9H9Jhb/BiocFileCache/
## 3e5a751f1027_CTD_genes.tsv.gz
## Warning in query_ctd_gene(terms = c("APOE", "APOEB", "APOE2", "APOE3",
## "APOE4", : 2/7 terms were dropped.
ctd_genes
## Object of class 'CTDdata'
## -------------------------
##  . Type: GENE 
##  . Length: 5 
##  . Items: APOE, ..., APOA5 
##  . Diseases: 2202 ( 5022 / 5627 )
##  . Gene-gene interactions: 173 ( 209 )
##  . Gene-chemical interactions: 592 ( 1487 )
##  . KEGG pathways: 59 ( 59 )
##  . GO terms: 321 ( 323 )

As can be seen, query_ctd_gene informs about the number of terms used in the query and the number of terms lost in the process. To know the exact terms that were found in CTDbase and the ones that were lost, we use the method get_terms.

get_terms( ctd_genes )
## $found
## [1] "APOE"  "APOEB" "APOE2" "APOA1" "APOA5"
## 
## $lost
## [1] "APOE3" "APOE4"

2.1.1 Extract Tables

Now that the information about the genes of interest was download from CTDbase we can access to it using the method get_table. Method extract allows to access to different tables according to the origin of the object. For a created from genes the accessible tables are:

Table Available Accessors
Gene Interactions NO "gene interactions"
Chemicals Interactions YES "chemical interactions"
Diseases YES "diseases"
Gene-Gene Interactions YES "gene-gene interactions"
Pathways (KEGG) YES "kegg pathways"
GO (Gene Ontology Terms) YES "go terms"

Example of how to extract one of this tables follows:

get_table( ctd_genes , index_name = "diseases" )[ 1:2, 1:3 ]
## DataFrame with 2 rows and 3 columns
##                             Disease.Name   Disease.ID  Direct.Evidence
##                              <character>  <character>      <character>
## 1 Chemical and Drug Induced Liver Injury MESH:D056486 marker/mechanism
## 2                        Atherosclerosis MESH:D050197 marker/mechanism

The information stored in each table can be see in the following code, were the names of the columns of each table is shown:

colnames( get_table( ctd_genes, index_name = "chemical interactions" ) )
## [1] "Chemical.Name"       "Chemical.ID"         "CAS.RN"             
## [4] "Interaction"         "Interaction.Actions" "Reference.Count"    
## [7] "Organism.Count"      "GeneSymbol"          "GeneID"
colnames( get_table( ctd_genes, index_name = "diseases" ) )
## [1] "Disease.Name"      "Disease.ID"        "Direct.Evidence"  
## [4] "Inference.Network" "Inference.Score"   "Reference.Count"  
## [7] "GeneSymbol"        "GeneID"
colnames( get_table( ctd_genes, index_name = "gene-gene interactions" ) )
##  [1] "Source.Gene.Symbol" "Source.Gene.ID"     "Target.Gene.Symbol"
##  [4] "Target.Gene.ID"     "Source.Organism"    "Target.Organism"   
##  [7] "Assay"              "Interaction.Type"   "Throughput"        
## [10] "Reference.Authors"  "Reference.Citation" "PubMed.ID"         
## [13] "GeneSymbol"         "GeneID"
colnames( get_table( ctd_genes, index_name = "kegg pathways" ) )
## [1] "Pathway"    "Pathway.ID" "GeneSymbol" "GeneID"
colnames( get_table( ctd_genes, index_name = "go terms" ) )
## [1] "Ontology"             "Qualifiers"           "GO.Term.Name"        
## [4] "GO.Term.ID"           "Organisms..Evidence." "GeneSymbol"          
## [7] "GeneID"

2.1.2 Plotting Gene Created CTDdata Objects

The generic plot function has the same mechanism that get_table. Using the argument index_name we select the table to plot. Then, the arguments subset.gene and subset.* (being * chemicals, diseases, pathways and go) allows to filter the X-axis and Y-axis. Depending the table to be plotted, the argument field.score can be used to select the field to plotted (that can takes "Inference" or "Reference" values). Then argument filter.score can be used to filter entries of the table. Finally, the argument max.length is in charge to reduce the characters of the labels.

The following plot shows the number of reference that cites the association between the APOE-like genes and chemicals.

plot( ctd_genes, index_name = "chemical interactions", filter.score = 3 )

Then, next plot shows shows the inference score that associates the APOE-like genes with diseases according to CTDbase.

plot( ctd_genes, index_name = "disease", filter.score = 115 )

The plot to explore the gene-gene interactions is based in a network representation. The genes from the original set are dark-coloured, while the other genes are light-coloured.

plot( ctd_genes, index_name = "gene-gene interactions", 
    representation = "network", main = "APOE-like gene-gene interactions" )

Finally both KEG pathways and GO terms related to the given set of genes can be also visually explored. for KEGG pathways:

plot( ctd_genes, index_name = "kegg pathways", 
    representation = "network", main = "KEGG pathways related to APOE genes" )

With idea to allow user to create clean networks, arguments subset.gene and subset.pathway can be used to filter genes and KEGG pathways. For GO term we use the same structure:

plot( ctd_genes, index_name = "go terms",
    representation = "network", main = "GO terms related to APOE genes",
    ontology = "Molecular Function" )

The argument ontology can take values "Biological Process", "Cellular Component" and "Molecular Function". One of them or any combination. This helps to filter the relation that will be plotted in the network. With the same idea, arguments subset.gene and subset.go can be used to filter genes and GO terms.

2.2 … by chemical

To query CTDbase for a given chemical or set of chemicals, we use the function query_ctd_chem:

args( query_ctd_chem )
## function (terms, filename = "CTD_chemicals.tsv.gz", mode = "auto", 
##     max.distance = 10, verbose = FALSE) 
## NULL

The argument terms is the one that must be filled with the list of chemicals of interest. The argument filename is filled with the name that will receive the table with the specific vocabulary from CTDbase for chemicals. The function checks if this file already exists, if is the case it used the local version. The argument mode is used to download the vocabulary file (for more info., check download.file from module utils). Finally, the argument verbose will show relevant messages about the querying process if is set to TRUE.

A typical chemical-query follows:

ctd_chem <- query_ctd_chem( terms = c( "Zinc", "Cadmium" ) )
ctd_chem
## Object of class 'CTDdata'
## -------------------------
##  . Type: CHEMICAL 
##  . Length: 2 
##  . Items: ZINC, ..., CADMIUM 
##  . Diseases: 2865 ( 234 / 5187 )
##  . Diseases: 2865 ( 5187 )
##  . Chemical-gene interactions: 5337 ( 8961 )
##  . KEGG pathways: 1377 ( 1377 )
##  . GO terms: 6343 ( 6343 )

As can be seen, query_ctd_chem informs about the number of terms used in the query and the number of terms lost in the process. To know the exact terms that were found in CTDbase and the ones that were lost, we use the method get_terms.

get_terms( ctd_chem )
## $found
## [1] "ZINC"    "CADMIUM"
## 
## $lost
## character(0)

2.2.1 Extract Tables

Now that the information about the chemicals of interest was download from CTDbase we can access to it using the method get_table. Method extract allows to access to different tables according to the origin of the object. For a created from chemicals the accessible tables are:

Table Available Accessors
Gene Interactions YES "gene interactions"
Chemicals Interactions NO "chemical interactions"
Diseases YES "diseases"
Gene-Gene Interactions NO "gene-gene interactions"
Pathways (KEGG) YES "kegg pathways"
GO (Gene Ontology Terms) YES "go terms"

Example of how to extract one of this tables follows:

get_table( ctd_chem , index_name = "diseases" )[ 1:2, 1:6 ]
## DataFrame with 2 rows and 6 columns
##   Chemical.Name Chemical.ID      CAS.RN                  Disease.Name
##     <character> <character> <character>                   <character>
## 1          Zinc     D015032   7440-66-6 Liver Cirrhosis, Experimental
## 2          Zinc     D015032   7440-66-6              Breast Neoplasms
##     Disease.ID  Direct.Evidence
##    <character>      <character>
## 1 MESH:D008106 marker/mechanism
## 2 MESH:D001943 marker/mechanism

The information stored in each table can be see in the following code, were the names of the columns of each table is shown:

colnames( get_table( ctd_chem, index_name = "gene interactions" ) )
##  [1] "Chemical.Name"       "Chemical.ID"         "CAS.RN"             
##  [4] "Gene.Symbol"         "Gene.ID"             "Interaction"        
##  [7] "Interaction.Actions" "Reference.Count"     "Organism.Count"     
## [10] "ChemicalName"        "ChemicalID"
colnames( get_table( ctd_chem, index_name = "diseases" ) )
##  [1] "Chemical.Name"     "Chemical.ID"       "CAS.RN"           
##  [4] "Disease.Name"      "Disease.ID"        "Direct.Evidence"  
##  [7] "Inference.Network" "Inference.Score"   "Reference.Count"  
## [10] "ChemicalName"      "ChemicalID"
colnames( get_table( ctd_chem, index_name = "kegg pathways" ) )
## [1] "Pathway"                  "Pathway.ID"              
## [3] "P.value"                  "Corrected.P.value"       
## [5] "Annotated.Genes.Quantity" "Annotated.Genes"         
## [7] "Genome.Frequency"         "ChemicalName"            
## [9] "ChemicalID"
colnames( get_table( ctd_chem, index_name = "go terms" ) )
##  [1] "Ontology"                 "Highest.GO.Level"        
##  [3] "GO.Term.Name"             "GO.Term.ID"              
##  [5] "P.value"                  "Corrected.P.value"       
##  [7] "Annotated.Genes.Quantity" "Annotated.Genes"         
##  [9] "Genome.Frequency"         "ChemicalName"            
## [11] "ChemicalID"

2.2.2 Plotting Chemical Created CTDdata Objects

The generic plot function seen in the previous sections for gene queries also works with the chemical queries. The arguments that can be used are the same when plotting both types of queries.

The following plot shows the inference score for each chemical-gene association according to CTDbase.

plot( ctd_chem, index_name = "gene interactions", filter.score = 5 )

This associations, or relations, can be further inspected in a network representation that includes the effect of the chemical on the altered genes.

plot( ctd_chem, index_name = "gene interactions", representation = "network",
    filter.score = 3, main = "Gen-Chemical interaction for Zinc and Cadmium" )

Consequently, a heat-map with the inference score for the associations between chemicals and diseases can also be plotted.

plot( ctd_chem, index_name = "disease" )

The association between KEGG pathways and chemicals can be seen as a heat-map. The heat-map shows the P-Value of the association between each pathway and each chemical.

plot( ctd_chem, index_name = "kegg pathways", filter.score = 1e-40 )

The argument filter.score can be used to filter associations by their P-Value. Only the associations with a -Value lower or equal to the value given to filter.score are keep. Then, the heat-map calculates the terciles of the distribution of P-Values to create the legend (that has always four categories).

The heat-map for GO terms follows the same mechanic:

plot( ctd_chem, index_name = "go terms",
    representation = "network", filter.score = 1e-210 )

2.3 … by disease

To query CTDbase for a given disease or set of diseases, we use the function query_ctd_dise:

args( query_ctd_dise )
## function (terms, filename = "CTD_diseases.tsv.gz", mode = "auto", 
##     verbose = FALSE) 
## NULL

The argument terms is the one that must be filled with the list of diseases of interest. The argument filename is filled with the name that will receive the table with the specific vocabulary from CTDbase for diseases. The function checks if this file already exists, if is the case it used the local version. The argument mode is used to download the vocabulary file (for more info., check download.file from module utils). Finally, the argument verbose will show relevant messages about the querying process if is set to TRUE.

A typical gene-query follows:

ctd_diseases <- query_ctd_dise( terms = c( "Dementia", "Alzheimer" ) )
## Warning in query_ctd_dise(terms = c("Dementia", "Alzheimer")): 1/2 terms were
## discarted.
ctd_diseases
## Object of class 'CTDdata'
## -------------------------
##  . Type: DISEASE 
##  . Length: 1 
##  . Items: DEMENTIA 
##  . Disease-gene interactions: 68766 ( 187 / 68766 )
##  . Gene-chemical interactions: 14428 ( 249 / 14428 )
##  . KEGG pathways: 1690 ( 1690 )
##  . GO terms: 0 (-)

As can be seen, query_ctd_chem informs about the number of terms used in the query and the number of terms lost in the process. To know the exact terms that were found in CTDbase and the ones that were lost, we use the method get_terms.

get_terms( ctd_diseases )
## $found
## [1] "DEMENTIA"
## 
## $lost
## [1] "ALZHEIMER"

2.3.1 Extract Tables

Now that the information about the diseases of interest was download from CTDbase we can access to it using the method get_table. Method extract allows to access to different tables according to the origin of the object. For a created from diseases the accessible tables are:

Table Available Accessors
Gene Interactions YES "gene interactions"
Chemicals Interactions YES "chemical interactions"
Diseases NO "diseases"
Gene-Gene Interactions NO "gene-gene interactions"
Pathways (KEGG) YES "kegg pathways"
GO (Gene Ontology Terms) NO "go terms"
get_table( ctd_diseases , index_name = "gene interactions" )[ 1:2, 1:5 ]
## DataFrame with 2 rows and 5 columns
##   Gene.Symbol   Gene.ID      Disease.Name   Disease.ID  Direct.Evidence
##   <character> <integer>       <character>  <character>      <character>
## 1         APP       351 Alzheimer Disease MESH:D000544 marker/mechanism
## 2       CASP3       836 Alzheimer Disease MESH:D000544 marker/mechanism

The information stored in each table can be see in the following code, were the names of the columns of each table is shown:

colnames( get_table( ctd_diseases, index_name = "gene interactions" ) )
## [1] "Gene.Symbol"       "Gene.ID"           "Disease.Name"     
## [4] "Disease.ID"        "Direct.Evidence"   "Inference.Network"
## [7] "Inference.Score"   "Reference.Count"
colnames( get_table( ctd_diseases, index_name = "chemical interactions" ) )
## [1] "Chemical.Name"     "Chemical.ID"       "CAS.RN"           
## [4] "Disease.Name"      "Disease.ID"        "Direct.Evidence"  
## [7] "Inference.Network" "Inference.Score"   "Reference.Count"
colnames( get_table( ctd_diseases, index_name = "kegg pathways" ) )
## [1] "Disease.Name"             "Disease.ID"              
## [3] "Pathway"                  "Pathway.ID"              
## [5] "Association.inferred.via"

2.3.2 Plotting Disease Created CTDdata Objects

As seen in the previous sections CTDquerier allows for basic visualization from disease-retrieved information from CTDbase. The function plot is also used on disease created CTDdata objects allowing to display the disease-gene interaction tables, the disease-chemical interaction table and the pathway related to disease table.

Disease created CTDdata objects have a heat-map visualization of the disease interaction with chemicals. This plot allows to select the CTDbase’s inference score or the number of reference to highlight the associations.

plot( ctd_diseases, index_name = "gene interactions", filter.score = 75 )

Then, the associations between diseases and chemicals can also be plotted. Usually this plot is a heat-map as previously seen for genes. Nevertheless the lot can result in a bar-plot. This happens when the argument filter.core that allows to filter the associations (selecting any of CTDbase’s inference score or number of reference for the association) is so string that only a single disease is kept.

plot( ctd_diseases, index_name = "chemical interactions", filter.score = 35 )

Finally a network for the KEGG pathways inferred for the diseases can also be obtained.

plot( ctd_diseases, index_name = "kegg pathways", 
    representation = "network", subset.disease = "Dementia" )

3 Session Info.

## R version 3.5.1 Patched (2018-07-12 r74967)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.8-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.8-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.20       ggplot2_3.1.0    bindrcpp_0.2.2   CTDquerier_1.2.0
## [5] BiocStyle_2.10.0
## 
## loaded via a namespace (and not attached):
##  [1] stringdist_0.9.5.1  tidyselect_0.2.5    xfun_0.4           
##  [4] purrr_0.2.5         colorspace_1.3-2    htmltools_0.3.6    
##  [7] stats4_3.5.1        BiocFileCache_1.6.0 yaml_2.2.0         
## [10] blob_1.1.1          rlang_0.3.0.1       pillar_1.3.0       
## [13] glue_1.3.0          withr_2.1.2         DBI_1.0.0          
## [16] rappdirs_0.3.1      BiocGenerics_0.28.0 bit64_0.9-7        
## [19] dbplyr_1.2.2        bindr_0.1.1         plyr_1.8.4         
## [22] stringr_1.3.1       munsell_0.5.0       gtable_0.2.0       
## [25] memoise_1.1.0       evaluate_0.12       labeling_0.3       
## [28] parallel_3.5.1      curl_3.2            highr_0.7          
## [31] Rcpp_0.12.19        backports_1.1.2     scales_1.0.0       
## [34] BiocManager_1.30.3  S4Vectors_0.20.0    bit_1.1-14         
## [37] gridExtra_2.3       digest_0.6.18       stringi_1.2.4      
## [40] bookdown_0.7        dplyr_0.7.7         rprojroot_1.3-2    
## [43] grid_3.5.1          tools_3.5.1         bitops_1.0-6       
## [46] magrittr_1.5        RCurl_1.95-4.11     lazyeval_0.2.1     
## [49] RSQLite_2.1.1       tibble_1.4.2        crayon_1.3.4       
## [52] pkgconfig_2.0.2     assertthat_0.2.0    rmarkdown_1.10     
## [55] httr_1.3.1          R6_2.3.0            igraph_1.2.2       
## [58] compiler_3.5.1

Bibliography

1. Mattingly CJ FJ Colby GT. The comparative toxicogenomics database (ctd). 2003.

2. Davis AP JR Grondin CJ. The comparative toxicogenomics database: Update 2017. 2017.