1 Introduction

Cancer-Testis (CT) genes, also called Cancer-Germline (CG), are a group of genes whose expression is normally restricted to the germline but that are found aberrantly activated in many types of cancers. These genes produce cancer specific antigens represent ideal targets for anti-cancer vaccines. Besides their interest in immunotherapy, they can also be used as cancer biomarkers and as target of anti-tumor therapies with limited side effect.

Many CT genes use DNA methylation as a primary mechanism of transcriptionnal regulation. This is another interesting point about CT genes as they represent suitable models to study DNA demethylation in cancer.

Currently the reference database gathering CT genes is the CTdatabase that was published in 2009, based on a literature screening(Almeida et al., 2009). This database is however not up to date. Recently identified CT genes are not referenced (in particular CT genes identified by omics methods that didn’t exist at the time) while some genes referred as CT genes appeared to be in reality expressed in many somatic tissues. Furthermore, the database is not in an easily importable format, some genes are not encoded properly (by synonyms names rather than by their official HGNC symbol names, or by a concatenation of both) resulting in poor interoperability for downstream analyses. More recent studies proposed other lists of CT genes like Wang’s CTatlas (Wang et al., 2016, Jamin et al., 2021, Carter et al., 2023). These lists were established using different criteria to define CT genes and hence differ substantially from each other. Moreover, these lists are usually provided as supplemental data files and are not strictly speaking databases. Finally, none of these studies describe the involvement of DNA methylation in the regulation of individual CT genes.

We therefore created CTexploreR, a Bioconductor R package, aimed to redefine rigorously the list of CT genes based on publicly available RNAseq databases and to summarize their main characteristics. We included methylation analyses to classify these genes according to whether they are regulated or not by DNA methylation. The package also offers tools to visualize CT genes expression and promoter DNA methylation in normal and tumoral tissues. CTexploreR hence represents an up-to-date reference database for CT genes and can be used as a starting point for further investigations related to these genes.

2 Installation

To install the package:

if (!require("BiocManager")) {


To install the package from GitHub:

if (!require("BiocManager")) {


3 CT genes

The central element of CTexploreR is the list of 298 CT genes (see table below) selected based on their expression in normal and tumoral tissues (selection details in the next section). The table also summarises their main characteristics.

## Loading required package: CTdata
head(CT_genes, 10)

CTdata is the companion Package for CTexploreR and provides the omics data that was necessary to select and characterize cancer testis genes as well as exploring them. The data are served through the ExperimentHub infrastructure. Currently available data are summarised in the table below and details can be found in CTdata vignette or manuals.


4 CT gene selection

In order to generate the list of CT genes, we followed a specific selection procedure (see figure below).

4.1 Testis-specific expression

Testis-specific genes (expressed exclusively in testis) and testis-preferential genes (expressed in a few somatic tissues at a level lower than 10x testis expression) were first selected using the GTEx database (Aguet et al., 2020).

Note that some genes were undetectable in the GTEx database due to multimapping issues (these were flagged as “lowly expressed” in GTEX_category column). A careful inspection of these genes showed that many of them are well-known Cancer-Testis genes, belonging to gene families (MAGEA, SSX, CT45A, GAGE, …) from which members have identical or nearly identical sequences. This is likely the reason why these genes are not detected in the GTEx database, as GTEx processing pipeline specifies that overlapping intervals between genes are excluded from all genes for counting.For these genes, as testis-specificity could not be assessed using GTEx database, RNAseq data from a set of normal tissues were reprocessed in order to allow multimapping. Expression of several genes became detectable in some tissues. Genes showing a testis-specific expression (expression at least 10x higher in testis than in any somatic tissues when multimapping was allowed) were selected, and flagged as testis-specific in multimapping_analysis column.

The expression of the selected testis-specific or testis-preferential genes was further analysed in somatic cell types using scRNAseq data of normal tissues from the Human Protein Atlas (Uhlén et al., 2015). The aim was to ensure that the selected genes are not expressed in any specific somatic cell type, as the GTEx selection was based on bulk RNAseq data.

4.2 Germline-specific expression

We also used the single cell RNA-Seq data from the adult human testis transcriptional cell atlas (Guo et al., 2018) to ensure to select germline-specific genes but not genes that would be specific of any somatic cell type of the testis.

For each gene, the testis cell type has been determined as the cell type showing the highest mean expression for that gene. This allowed us to remove the genes for which this cell type corresponds to a testis somatic cell type (macrophage, endothelial, myoid, Sertoli or Leydig cells).

4.3 Activation in cancer cell lines and TCGA tumors

To assess activation in cancers, RNAseq data from cancer cell lines from CCLE (Barretina et al., 2012) and from TCGA cancer samples (Cancer Genome Atlas Research Network et al., 2013) were used. This allowed to select among testis-specific and testis-preferential genes those that are activated in cancers.

In the CCLE_category and TCGA_category columns, genes are tagged as “activated” when they are highly expressed in at least one cancer cell line/ sample (TPM >= 10). However genes that were found to be expressed in all -or almost all-cancer cell lines/samples were removed, as this probably reflects a constitutive expression rather than a true activation. We filtered out genes that were not completely repressed in at least 20 % of cancer cell lines/samples (TPM <= 0.1).

4.4 IGV visualisation

All selected CT genes were visualised on IGV (Thorvaldsdóttir et al., 2013) using a RNA-seq alignment from testis, to ensure that expression in testis really corresponded to the canonical transcript. For some genes for which the canonical transcript did not correspond to the transcript that we could see in the testis sample, we manually modified the external_transcript_name accordingly, to ensure that the TSS and the promoter region are correctly defined. This is particularly important for methylation analysis that must be focused on true promoter regions.

4.5 Regulation by methylation

Genes flagged as TRUE in regulated_by_methylation column correspond to

  • Genes that are significantly induced by a demethylating agent (RNAseq analysis of cell lines reated with DAC (5-Aza-2′-Deoxycytidine)).

  • Genes that have a highly methylated promoter in normal somatic tissues but less methylated in germ cells (WGBS analysis of a set of normal tissues).

For some genes showing a strong activation in cells treated with 5-Aza-2′-Deoxycytidine, methylation analysis was not possible due to multimapping issues. In this case, genes were still considered as regulated by methylation unless their promoter appeared unmethylated in somatic tissues or methylated in germ cells.

5 Available functions

For details about functions, see their respective manual pages. For all functions, an option values_only can be set to TRUE in order to get the values instead of the visualisation.

All expression visualisation functions can be used on all GTEx genes, not only on Cancer-Testis genes, as the data they refer to contains all genes.

5.1 Expression in normal healthy tissues

5.1.1 GTEX_expression()

Allows to visualise gene expression in GTEx tissues. We can for example see the difference of expression between testis-specific and testis-preferential genes. Testis-specific genes have been determined with a stricter specificity to the testis : they are lowly expressed in all somatic tissues and at least 10 times more in the testis. Whereas testis-preferential accepts a little expression outside the testis : they are lowly expressed in at least 75% of somatic tissues, but still 10 times more in the testis.

  • Applied to testis-specific genes : we can clearly see the expression strictly limited to the testis. We can also see genes that are lowly expressed in GTEx, and have thus been characterized using multimapping (see below).
testis_specific <- dplyr::filter(
    testis_specificity == "testis_specific")
GTEX_expression(testis_specific$external_gene_name, units = "log_TPM")