FindIT2 1.6.0
FindIT2
FindIT2
is available on Bioconductor repository for
packages, you can install it by:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("FindIT2")
# Check that you have a valid Bioconductor installation
BiocManager::valid()
citation("FindIT2")
#> To cite FindIT2 in publications use: Guandong Shang(2022)
#>
#> Shang, G.-D., Xu, Z.-G., Wan, M.-C., Wang, F.-X. & Wang, J.-W.
#> FindIT2: an R/Bioconductor package to identify influential
#> transcription factor and targets based on multi-omics data. BMC
#> Genomics 23, 272 (2022)
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Article{,
#> title = {FindIT2: an R/Bioconductor package to identify influential transcription factor and targets based on multi-omics data},
#> author = {Guandong Shang and Zhougeng Xu and Muchun Wan and Fuxiang Wang and Jiawei Wang},
#> journal = {BMC Genomics},
#> year = {2022},
#> volume = {23},
#> number = {S1},
#> pages = {272},
#> url = {https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08506-8},
#> doi = {10.1186/s12864-022-08506-8},
#> }
I benefited a lot from X. Shirley Liu lab’s tools. The integrate_ChIP_RNA
model
borrow the idea from BETA(Wang et al. 2013) and cistromeGO
(Li et al. 2019). The calcRP
model borrow the idea from regulation
potential(Wang et al. 2016). And the FindIT_regionRP
model borrow idea from
lisa(Qin et al. 2020).
I also want to thanks the Allen Lynch in Liu lab, it is please to talk with him
on the github about lisa.
I want to thanks for the memberships in our lab. Many ideas in this packages appeared when I talk with them.
The origin name of FindIT2 is MPMG(Multi Peak Multi Gene) :), which means this package origin purpose is to do mutli-peak-multi-gene annotation. But as the diversity of analysis increase, it gradually extend its function and rename into FindIT2(Find influential TF and Target). But the latter function are still build on the MPMG. Now, it have five module:
And there are also some other useful function like integrate different source information, calculate jaccard similarity for your TF. I will introduce all these function in below manual. And for each part, I will also show the file type you may need prepare, which can help you prepare the correct file format.
The ChIP and ATAC datasets in this vignettes are from (Wang et al. 2020). For the speed, I only use the data in chrosome 5.
# load packages
# If you want to run this manual, please check you have install below packages.
library(FindIT2)
#> Warning: replacing previous import 'utils::findMatches' by
#> 'S4Vectors::findMatches' when loading 'AnnotationDbi'
library(TxDb.Athaliana.BioMart.plantsmart28)
library(SummarizedExperiment)
library(dplyr)
library(ggplot2)
# because of the fa I use, I change the seqlevels of Txdb to make the chrosome levels consistent
Txdb <- TxDb.Athaliana.BioMart.plantsmart28
seqlevels(Txdb) <- c(paste0("Chr", 1:5), "M", "C")
all_geneSet <- genes(Txdb)
Because of the storage restriction, the data used here is a small data set, which may not show the deatiled information for result. The user can read the FindIT2 paper and related github repo to see more detailed information and practical example.
The multi-peak multi-gene annotation(mmPeakAnno) is the basic of this package. Most function will use the result of mmPeakAnno. So I explain them first.
The object you may need
FindIT2 provides loadPeakFile
to load peak and store in GRanges
object.
Meanwhile, it also rename one of your GRange column name into feature_id
. The
feature_id
is the most important column in FindIT2, which will be used as a
link to join information from different source. The feature_id
column
represents your peak name, which is often the forth column in bed file and the
first column in GRange metadata column . If you have a GRange without
feature_id
column, you can rename your first metadata column or just add a
column named feature_id
like below
# when you make sure your first metadata column is peak name
colnames(mcols(yourGR))[1] <- "feature_id"
# or you just add a column
yourGR$feature_id <- paste0("peak_", seq_len(length(yourGR)))
you can see the detailed Txdb description in Making and Utilizing TxDb Objects
Here I take the ChIP-Seq data as example.
# load the test ChIP peak bed
ChIP_peak_path <- system.file("extdata", "ChIP.bed.gz", package = "FindIT2")
ChIP_peak_GR <- loadPeakFile(ChIP_peak_path)
# you can see feature_id is in your first column of metadata
ChIP_peak_GR
#> GRanges object with 4288 ranges and 2 metadata columns:
#> seqnames ranges strand | feature_id score
#> <Rle> <IRanges> <Rle> | <character> <numeric>
#> [1] Chr5 6236-6508 * | peak_14125 27
#> [2] Chr5 7627-8237 * | peak_14126 51
#> [3] Chr5 9730-10211 * | peak_14127 32
#> [4] Chr5 12693-12867 * | peak_14128 22
#> [5] Chr5 13168-14770 * | peak_14129 519
#> ... ... ... ... . ... ...
#> [4284] Chr5 26937822-26938526 * | peak_18408 445
#> [4285] Chr5 26939152-26939267 * | peak_18409 21
#> [4286] Chr5 26949581-26950335 * | peak_18410 263
#> [4287] Chr5 26952230-26952558 * | peak_18411 30
#> [4288] Chr5 26968877-26969091 * | peak_18412 26
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
The nearest mode is the most widely used annotation mode. It will link the peak
to its nearest gene, which means every peak only have one related gene. The
disadvantage is sometimes you can not link the correct gene for your peak because
of the complexity in the genomic feature. But this annotation mode is simple
enough and at most time, for most peak, the result is correct.
The skeleton function is distanceToNearest
from GenomicRanges
. I add some
modification so that it will output more human readable result.
mmAnno_nearestgene <- mm_nearestGene(peak_GR = ChIP_peak_GR,
Txdb = Txdb)
#> >> checking seqlevels match... 2023-04-25 17:05:47
#> >> your peak_GR seqlevel:Chr5...
#> >> your Txdb seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5 M C...
#> Good, your Chrs in peak_GR is all in Txdb
#> ------------
#> annotating Peak using nearest gene mode begins
#> >> preparing gene features information... 2023-04-25 17:05:47
#> >> finding nearest gene and calculating distance... 2023-04-25 17:05:47
#> >> dealing with gene strand ... 2023-04-25 17:05:48
#> >> merging all info together ... 2023-04-25 17:05:48
#> >> done 2023-04-25 17:05:48
mmAnno_nearestgene
#> GRanges object with 4288 ranges and 4 metadata columns:
#> seqnames ranges strand | feature_id score gene_id
#> <Rle> <IRanges> <Rle> | <character> <numeric> <character>
#> [1] Chr5 6236-6508 * | peak_14125 27 AT5G01015
#> [2] Chr5 7627-8237 * | peak_14126 51 AT5G01020
#> [3] Chr5 9730-10211 * | peak_14127 32 AT5G01030
#> [4] Chr5 12693-12867 * | peak_14128 22 AT5G01030
#> [5] Chr5 13168-14770 * | peak_14129 519 AT5G01040
#> ... ... ... ... . ... ... ...
#> [4284] Chr5 26937822-26938526 * | peak_18408 445 AT5G67510
#> [4285] Chr5 26939152-26939267 * | peak_18409 21 AT5G67520
#> [4286] Chr5 26949581-26950335 * | peak_18410 263 AT5G67560
#> [4287] Chr5 26952230-26952558 * | peak_18411 30 AT5G67570
#> [4288] Chr5 26968877-26969091 * | peak_18412 26 AT5G67630
#> distanceToTSS
#> <numeric>
#> [1] -344
#> [2] 206
#> [3] 0
#> [4] 2823
#> [5] 1402
#> ... ...
#> [4284] 0
#> [4285] 0
#> [4286] 0
#> [4287] 0
#> [4288] 302
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
You can also use this the annotation result to check your TF type using
plot_annoDistance
. For most TF, the distance density plot maybe like below,
which means your TF is promoter-type. But for some TF, its density plot will be
different, like GATA4, MYOD1(Li et al. 2019).
plot_annoDistance(mmAnno = mmAnno_nearestgene)
Sometimes, you may interested in the number peaks of each gene linked. Or
reciprocally, how many genes of each peak link. you can use the
getAssocPairNumber
to see the deatailed number or summary.
getAssocPairNumber(mmAnno = mmAnno_nearestgene)
#> # A tibble: 2,757 × 2
#> gene_id peakNumber
#> <chr> <int>
#> 1 AT5G01015 1
#> 2 AT5G01020 1
#> 3 AT5G01030 2
#> 4 AT5G01040 2
#> 5 AT5G01050 2
#> 6 AT5G01070 1
#> 7 AT5G01090 1
#> 8 AT5G01100 1
#> 9 AT5G01170 1
#> 10 AT5G01175 3
#> # ℹ 2,747 more rows
getAssocPairNumber(mmAnno = mmAnno_nearestgene,
output_summary = TRUE)
#> # A tibble: 8 × 2
#> N gene_freq
#> <fct> <int>
#> 1 1 1793
#> 2 2 606
#> 3 3 229
#> 4 4 75
#> 5 5 38
#> 6 6 9
#> 7 7 4
#> 8 >=8 3
# you can see all peak's related gene number is 1 because I use the nearest gene mode
getAssocPairNumber(mmAnno_nearestgene, output_type = "feature_id")
#> # A tibble: 4,288 × 2
#> feature_id geneNumber
#> <chr> <int>
#> 1 peak_14125 1
#> 2 peak_14126 1
#> 3 peak_14127 1
#> 4 peak_14128 1
#> 5 peak_14129 1
#> 6 peak_14130 1
#> 7 peak_14131 1
#> 8 peak_14132 1
#> 9 peak_14133 1
#> 10 peak_14134 1
#> # ℹ 4,278 more rows
getAssocPairNumber(mmAnno = mmAnno_nearestgene,
output_type = "feature_id",
output_summary = TRUE)
#> # A tibble: 1 × 2
#> N feature_freq
#> <fct> <int>
#> 1 1 4288
And if you want the summary plot, you can use the plot_peakGeneAlias_summary
function.
plot_peakGeneAlias_summary(mmAnno_nearestgene)
plot_peakGeneAlias_summary(mmAnno_nearestgene, output_type = "feature_id")
The mm_geneBound
function is designed for finding related peak for your
input_genes
.Because we do the nearest gene mode to annotate peak, once a peak
is linked by nearest gene, it will not be linked by another gene even if another
gene is very close to your peak. So this function is very useful when you want
to plot peak heatmap or volcano plot. When ploting these plot, you often have a
interesting gene set, and want to plot related peak. If we just filter gene id
in the nearest result,many of your interesting gene will not have related peak.
After all, each peak will be assigned once.
For mm_geneBound
, it will output realted peak for all your input_gene
as long
as your input_genes
in your Txdb. The model behind mm_geneBound
is simple, it
will do mm_nearestgene
first and scan nearest peak for these genes which do not
have related peak.
# The genes_Chr5 is all gene set in Chr5
genes_Chr5 <- names(all_geneSet[seqnames(all_geneSet) == "Chr5"])
# The genes_Chr5_notAnno is gene set which is not linked by peak
genes_Chr5_notAnno <- genes_Chr5[!genes_Chr5 %in% unique(mmAnno_nearestgene$gene_id)]
# The genes_Chr5_tAnno is gene set which is linked by peak
genes_Chr5_Anno <- unique(mmAnno_nearestgene$gene_id)
# mm_geneBound will tell you there 5 genes in your input_genes not be annotated
# and it will use the distanceToNearest to find nearest peak of these genes
mmAnno_geneBound <- mm_geneBound(peak_GR = ChIP_peak_GR,
Txdb = Txdb,
input_genes = c(genes_Chr5_Anno[1:5], genes_Chr5_notAnno[1:5]))
#> >> checking seqlevels match... 2023-04-25 17:05:51
#> >> your peak_GR seqlevel:Chr5...
#> >> your Txdb seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5 M C...
#> Good, your Chrs in peak_GR is all in Txdb
#> >> using mm_nearestGene to annotate Peak... 2023-04-25 17:05:51
#> >> checking seqlevels match... 2023-04-25 17:05:51
#> >> your peak_GR seqlevel:Chr5...
#> >> your Txdb seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5 M C...
#> Good, your Chrs in peak_GR is all in Txdb
#> ------------
#> annotating Peak using nearest gene mode begins
#> >> preparing gene features information... 2023-04-25 17:05:51
#> >> finding nearest gene and calculating distance... 2023-04-25 17:05:51
#> >> dealing with gene strand ... 2023-04-25 17:05:52
#> >> merging all info together ... 2023-04-25 17:05:52
#> >> done 2023-04-25 17:05:52
#> It seems that there 5 genes have not been annotated by nearestGene mode
#> >> using distanceToNearest to find nearest peak of these genes... 2023-04-25 17:05:52
#> >> merging all anno... 2023-04-25 17:05:52
#> >> done 2023-04-25 17:05:52
# all of your input_genes have related peaks
mmAnno_geneBound
#> # A tibble: 13 × 3
#> feature_id gene_id distanceToTSS_abs
#> <chr> <chr> <dbl>
#> 1 peak_14125 AT5G01015 344
#> 2 peak_14126 AT5G01020 206
#> 3 peak_14127 AT5G01030 0
#> 4 peak_14128 AT5G01030 2823
#> 5 peak_14129 AT5G01040 1402
#> 6 peak_14130 AT5G01040 0
#> 7 peak_14131 AT5G01050 571
#> 8 peak_14132 AT5G01050 94
#> 9 peak_14125 AT5G01010 1174
#> 10 peak_14132 AT5G01060 2022
#> 11 peak_14133 AT5G01075 949
#> 12 peak_14134 AT5G01080 2339
#> 13 peak_14135 AT5G01110 6623