# Introduction to SEPIRA

### 2021-10-26

SEPIRA (Systems EPigenomics Inference of Regulatory Activity) is a novel algorithm which estimates transcription factor activity in any given sample from its genome-wide mRNA expression or DNA methylation profile(Chen Y, Widschwendter M, and Teschendorff AE 2017). It encompasses two main steps:

1. Construction of a tissue-specific transcription factor regulatory network, consisting of transcription factors that are more highly expressed in the user-specified tissue type (the ‘tissue type of interest’) compared to other tissue types, plus an associated set of high-confidence downstream targets.
2. Estimation of transcription factor activity in this network, in any given dataset consisting of gene expression or promoter DNA methylation profiles.

To infer the network, we use the large GTEx RNA-seq data set encompassing 8555 samples from about 30 tissue types. With this inferred network, we can then estimate transcription factor activity in samples from other data sets. Due to the large size of the GTEx data set we did not include it in our package. However, in order to gain an appreciation for the SEPIRA algorithm we show the results obtained by applying it to this full dataset. Then we validate it using an RNA-seq data set from Protein Atlas and a DNA methylation data set from the Stem-cell matrix compendium-2 (SCM2).

### How to use

#### Inferring tissue-specific network

net.o <- sepiraInfNet(data=data.m, tissue=colnames(data.m), toi = "Lung", cft = "Blood",
TFs = TFeid, sdth = 0.25, sigth = 0.05, pcorth = 0.2, degth = c(0.05, 0.05),
lfcth = c(log2(1.5), 0), minNtgts = 5, ncores = 1)
## Note: data.m should be a normalized gene expression data set.
## Parameters used here are not recommended. See "?sepiraInfNet" for more info.

### Validation of correctness of tissue-specific networks

It is important to validate the regulatory network in another dataset dataVAL.m to see if the predicted targets of the transcription factors in the network do indeed faithfully measure upstream transcription factor (TF) activity. In the paper we used an independent RNA-seq dataset from the ProteinAtlas project for this purpose (see figure below). If you run sepiraRegAct on your own validation dataset, you could generate another boxplots to see if the estimated TF activity scores differ.

# estimating transcription factor activity in a validation dataset
TFact_val.m <- sepiraRegAct(dataVAL.m,type="exp",regnet=net.o\$netTOI,norm="z",ncores=1)

For both datasets, the top ranked tissue is lung, as required.

### Estimating TF activity from DNA methylation

It is also possible to estimate transcription factor activity from promoter DNA methylation profiles. For example, we verified the validity of the infered lung-specific network using a subset of the Stem-Cell Matrix compendium (SCM2) Illumina 450k DNAm dataset consisting of 479,328 probes (after quality control) and 153 samples. When assigning a DNAm value to a gene, we used the method used in (Jiao Y, Widschwendter M, and Teschendorff AE 2014) which first assigns to genes the average DNAm of probes mapping to within 200bp of TSSs (TSS200), or probes mapping to the 1st exon if no probes are mapped to TSS200, or TSS1500 if no probes mapping to the 1st exon. However, there is a little difference when you run sepiraRegAct on DNA methylation data set, promoter DNAm values must be used, in which case the binding profile is multiplied by -1 before estimation, since normally a methylated promoter is associated with gene silencing. In the paper we estimated TF activities from SCM2 DNAm dataset, as expected, their activity levels are also significantly higher in lung tissue than in other tissues (one-tailed Wilcoxon test).

## Session Info

sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
##
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
##  [3] LC_TIME=en_GB              LC_COLLATE=C
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
## [1] SEPIRA_1.14.0
##
## loaded via a namespace (and not attached):
##  [1] corpcor_1.6.10  digest_0.6.28   R6_2.5.1        jsonlite_1.7.2
##  [5] magrittr_2.0.1  evaluate_0.14   highr_0.9       stringi_1.7.5
##  [9] rlang_0.4.12    jquerylib_0.1.4 limma_3.50.0    bslib_0.3.1
## [13] rmarkdown_2.11  tools_4.1.1     stringr_1.4.0   prettydoc_0.4.1
## [17] parallel_4.1.1  xfun_0.27       yaml_2.2.1      fastmap_1.1.0
## [21] compiler_4.1.1  htmltools_0.5.2 knitr_1.36      sass_0.4.0

## References

Chen Y, Widschwendter M, and Teschendorff AE. 2017. “Systems-Epigenomics Inference of Transcription Factor Activity Implicates Aryl-Hydrocarbon-Receptor Inactivation as a Key Event in Lung Cancer Development.” Genome Biol 18: 236.

Jiao Y, Widschwendter M, and Teschendorff AE. 2014. “A Systems-Level Integrative Framework for Genome-Wide Dna Methylation and Gene Expression Data Identifies Differential Gene Expression Modules Under Epigenetic Control.” Bioinformatics 30: 2360–6.