1 Introduction

This document is a presentation of the R implementation of the tool CIMICE. It shows the main features of this software and how it is built as a modular pipeline, with the goal of making it easy to change and update.

CIMICE is a tool in the field of tumor phylogenetics and its goal is to build a Markov Chain (called Cancer Progression Markov Chain, CPMC) in order to model tumor subtypes evolution. The input of CIMICE is a Mutational Matrix, so a boolean matrix representing altered genes in a collection of samples. These samples are assumed to be obtained with single-cell DNA analysis techniques and the tool is specifically written to use the peculiarities of this data for the CMPC construction.

CIMICE data processing and analysis can be divided in four section:

  • Input management
  • Graph topology reconstruction
  • Graph weight computation
  • Output presentation

These steps will be presented in the following sections.

1.1 Used libraries

This implementation of CIMICE is built as a single library on its own:

library(CIMICE)

and it requires the following libraries:

IRanges

# Dataframe manipulation
library(dplyr) 
# Plot display
library(ggplot2)
# Improved string operations
library(glue)
# Dataframe manipulation
library(tidyr)
# Graph data management
library(igraph)
# Remove transitive edges on a graph
library(relations)
# Interactive graph visualization
library(networkD3)
# Interactive graph visualization
library(visNetwork)
# Correlation plot visualization
library(ggcorrplot)
# Functional R programming
library(purrr)
# Graph Visualization
library(ggraph)
# sparse matrices
library(Matrix)

2 Input management

CIMICE requires a boolean dataframe as input, structured as follows:

  • Each column represents a gene
  • Each row represents a sample (or a genotype)
  • Each 0/1 represents if a given gene is mutated in a given sample

It is possible to load this information from a file. The default input format for CIMICE is the “CAPRI/CAPRESE” TRONCO format:

  • The file is a tab or space separated file
  • The first line starts with the string “s\g” (or any other word) followed by the list of genes (or loci) to be considered in the analysis
  • Each subsequent line starts with a sample identifier string, followed by the bit set representing its genotype

This is a scheme of CIMICE’s input format:

s\g       gene_1 gene_2 ... gene_n
sample_1    1      0    ...   0
...
sample_m    1      1    ...   1

and this an example on how to load a dataset from the file system:

# Read input dataset in CAPRI/CAPRESE format
dataset.big <- read_CAPRI(system.file("extdata", "example.CAPRI", package = "CIMICE", mustWork = TRUE))
sfmF ulaF dapB yibT phnL ispA
GCF_001281685.1_ASM128168v1_genomic.fna 1 1 0 0 0 0
GCF_001281725.1_ASM128172v1_genomic.fna 0 0 0 0 0 0
GCF_001281755.1_ASM128175v1_genomic.fna 0 0 0 0 0 0
GCF_001281775.1_ASM128177v1_genomic.fna 0 0 0 0 0 0
GCF_001281795.1_ASM128179v1_genomic.fna 0 1 0 0 0 0
GCF_001281815.1_ASM128181v1_genomic.fna 0 0 0 0 0 0
## [1] "ncol: 999  - nrow: 160"

Another option is to define directly the dataframe in R. This is made easy by the functions make_dataset and update_df, used as follows:

# genes
dataset <- make_dataset(A,B,C,D) %>%
    # samples
    update_df("S1", 0, 0, 0, 1) %>%
    update_df("S2", 1, 0, 0, 0) %>%
    update_df("S3", 1, 0, 0, 0) %>%
    update_df("S4", 1, 0, 0, 1) %>%
    update_df("S5", 1, 1, 0, 1) %>%
    update_df("S6", 1, 1, 0, 1) %>%
    update_df("S7", 1, 0, 1, 1) %>%
    update_df("S8", 1, 1, 0, 1) 

with the following outcome:

## 8 x 4 Matrix of class "dgeMatrix"
##    A B C D
## S1 0 0 0 1
## S2 1 0 0 0
## S3 1 0 0 0
## S4 1 0 0 1
## S5 1 1 0 1
## S6 1 1 0 1
## S7 1 0 1 1
## S8 1 1 0 1

In the case your data is composed by samples with associated frequencies it is possible to use an alternative format that we call “CAPRIpop”:

s/g    gene_1 gene_2 ... gene_n freq
sample_1 1 0 ... 0 freq_s1
...
sample_m 1 1 ... 1 freq_sm

where the freq column is mandatory and sample must not be repeated. Frequencies in the freq column will be automatically normalized. This format is meant to be used with the functions quick_run(dataset, mode="CAPRIpop") for the full analysis and dataset_preprocessing_population(...) for the preprocessing stage only. The subsequent operations remain otherwise equal to those of the default format.

Another option is to compute a mutational matrix directly from a MAF file, which can be done as follows:

#        path to MAF file
read_MAF(system.file("extdata", "paac_jhu_2014_500.maf", package = "CIMICE", mustWork = TRUE))[1:5,1:5]
## -Reading
## -Validating
## -Silent variants: 49 
## -Summarizing
## --Possible FLAGS among top ten genes:
##   HMCN1
## -Processing clinical data
## --Missing clinical data
## -Finished in 0.127s elapsed (0.154s cpu)
## 5 x 5 sparse Matrix of class "dgCMatrix"
##          CSMD2 C2orf61 DCHS2 DPYS GPR158
## ACINAR28     1       1     .    .      .
## ACINAR27     1       .     .    .      .
## ACINAR25     1       .     .    .      .
## ACINAR02     .       .     1    .      1
## ACINAR13     .       .     1    .      .

3 Preliminary check of mutations distributions

This implementation of CIMICE includes simple functions to quickly analyze the distributions of mutations among genes and samples.

The following code displays an histogram showing the distribution of the number of mutations hitting a gene:

gene_mutations_hist(dataset.big)

And this does the same but from the samples point of view:

sample_mutations_hist(dataset.big, binwidth = 10)

3.1 Simple procedures of feature selection

In case of huge dataset, it could be necessary to focus only on a subset of the input samples or genes. The following procedures aim to provide an easy way to do so when the goal is to use the most (or least) mutated samples or genes.

3.1.1 By genes

Keeps the first \(n\) (=100) most mutated genes:

select_genes_on_mutations(dataset.big, 100) 
eptA yohC yedK yeeO narU rsxC
GCF_001281685.1_ASM128168v1_genomic.fna 1 1 1 1 1 1
GCF_001281725.1_ASM128172v1_genomic.fna 1 1 1 0 0 0
GCF_001281755.1_ASM128175v1_genomic.fna 1 1 1 1 1 1
GCF_001281775.1_ASM128177v1_genomic.fna 1 1 1 1 1 0
GCF_001281795.1_ASM128179v1_genomic.fna 1 1 1 1 1 1
GCF_001281815.1_ASM128181v1_genomic.fna 1 1 1 1 1 1
## [1] "ncol: 100  - nrow: 160"

3.1.2 By samples

Keeps the first \(n\) (=100) least mutated samples:

select_samples_on_mutations(dataset.big, 100, desc = FALSE)
sfmF ulaF dapB yibT phnL ispA
GCF_001281725.1_ASM128172v1_genomic.fna 0 0 0 0 0 0
GCF_001281855.1_ASM128185v1_genomic.fna 0 0 0 0 0 0
GCF_001281815.1_ASM128181v1_genomic.fna 0 0 0 0 0 0
GCF_001281775.1_ASM128177v1_genomic.fna 0 0 0 0 0 0
GCF_001607735.1_ASM160773v1_genomic.fna 0 0 0 0 0 0
GCF_001297965.1_ASM129796v1_genomic.fna 0 0 0 0 0 0
## [1] "ncol: 999  - nrow: 100"

3.2 Both selections

It is easy to combine these selections by using the pipe operator %>%:

select_samples_on_mutations(dataset.big , 100, desc = FALSE) %>% select_genes_on_mutations(100)
eptA yohC yedK argK yeeO narU
GCF_001281725.1_ASM128172v1_genomic.fna 1 1 1 1 0 0
GCF_001281855.1_ASM128185v1_genomic.fna 1 1 1 1 0 0
GCF_001281815.1_ASM128181v1_genomic.fna 1 1 1 1 1 1
GCF_001281775.1_ASM128177v1_genomic.fna 1 1 1 1 1 1
GCF_001607735.1_ASM160773v1_genomic.fna 1 1 1 1 1 1
GCF_001297965.1_ASM129796v1_genomic.fna 1 1 1 1 1 1
## [1] "ncol: 100  - nrow: 100"

3.3 Correlation plot

It may be of interest to show correlations among gene or sample mutations. The library corrplots provides an easy way to do so by preparing an heatmap based on the correlation matrix. We can show these plots by using the following comands:

gene mutations correlation:

corrplot_genes(dataset)

sample mutations correlation:

corrplot_samples(dataset)

3.4 Group equal genotypes

The first step of the CIMICE algorithm is based on grouping the genotypes contained in the dataset to compute their observed frequencies. In this implementation we used a simple approach using the library dplyr. However, this solution is not optimal from an efficiency point of view and might be problematic with very large datasets. An Rcpp implementation is planned and, moreover, it is possible to easily modify this step by changing the algorithm as long as its output is a dataframe containing only unique genotypes with an additional column named “freq” for the observed frequencies count.

# groups and counts equal genotypes
compactedDataset <- compact_dataset(dataset)
## $matrix
## 5 x 4 Matrix of class "dgeMatrix"
##    A B C D
## S1 0 0 0 1
## S2 1 0 0 0
## S4 1 0 0 1
## S7 1 0 1 1
## S5 1 1 0 1
## 
## $counts
## [1] 1 2 1 1 3
## 
## $row_names
## $row_names[[1]]
## [1] "S1"
## 
## $row_names[[2]]
## [1] "S2, S3"
## 
## $row_names[[3]]
## [1] "S4"
## 
## $row_names[[4]]
## [1] "S7"
## 
## $row_names[[5]]
## [1] "S5, S6, S8"

4 Graph topology construction

The subsequent stage goal is to prepare the topology for the final Cancer Progression Markov Chain. We racall that this topology is assumed to be a DAG. These eraly steps are required to prepare the information necessary for this and the following pahses.

Convert dataset to matricial form:

samples <- compactedDataset$matrix
## 5 x 4 Matrix of class "dgeMatrix"
##    A B C D
## S1 0 0 0 1
## S2 1 0 0 0
## S4 1 0 0 1
## S7 1 0 1 1
## S5 1 1 0 1

Extract gene names:

genes <- colnames(samples)
## [1] "A" "B" "C" "D"

Compute observed frequency of each genotype:

freqs <- compactedDataset$counts/sum(compactedDataset$counts)
## [1] 0.125 0.250 0.125 0.125 0.375

Add “Clonal” genotype to the dataset (if not present) that will be used as DAG root:

# prepare node labels listing the mutated genes for each node
labels <- prepare_labels(samples, genes)
# fix Colonal genotype absence, if needed
fix <- fix_clonal_genotype(samples, freqs, labels)
samples = fix[["samples"]]
freqs = fix[["freqs"]]
labels = fix[["labels"]]
## 6 x 4 Matrix of class "dgeMatrix"
##    A B C D
## S1 0 0 0 1
## S2 1 0 0 0
## S4 1 0 0 1
## S7 1 0 1 1
## S5 1 1 0 1
##    0 0 0 0

Build the topology of the graph based on the “superset” relation:

# compute edges based on subset relation
edges <- build_topology_subset(samples)

and finally prepare and show with the current topology of the graph:

# remove transitive edges and prepare igraph object
g <- build_subset_graph(edges, labels)

that can be (badly) plotted using basic igraph: