1 Basics

1.1 Install FindIT2

FindIT2 is available on Bioconductor repository for packages, you can install it by:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
      install.packages("BiocManager")
  }

BiocManager::install("FindIT2")

# Check that you have a valid Bioconductor installation
BiocManager::valid()

1.2 Citation

citation("FindIT2")
#> To cite FindIT2 in publications use: Guandong Shang(2022)
#> 
#>   Shang, G.-D., Xu, Z.-G., Wan, M.-C., Wang, F.-X. & Wang, J.-W.
#>   FindIT2: an R/Bioconductor package to identify influential
#>   transcription factor and targets based on multi-omics data.  BMC
#>   Genomics 23, 272 (2022)
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Article{,
#>     title = {FindIT2: an R/Bioconductor package to identify influential transcription factor and targets based on multi-omics data},
#>     author = {Guandong Shang and Zhougeng Xu and Muchun Wan and Fuxiang Wang and Jiawei Wang},
#>     journal = {BMC Genomics},
#>     year = {2022},
#>     volume = {23},
#>     number = {S1},
#>     pages = {272},
#>     url = {https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08506-8},
#>     doi = {10.1186/s12864-022-08506-8},
#>   }

1.3 Acknowledgments

I benefited a lot from X. Shirley Liu lab’s tools. The integrate_ChIP_RNA model borrow the idea from BETA(Wang et al. 2013) and cistromeGO (Li et al. 2019). The calcRP model borrow the idea from regulation potential(Wang et al. 2016). And the FindIT_regionRP model borrow idea from lisa(Qin et al. 2020). I also want to thanks the Allen Lynch in Liu lab, it is please to talk with him on the github about lisa.

I want to thanks for the memberships in our lab. Many ideas in this packages appeared when I talk with them.

1.4 Introduction

The origin name of FindIT2 is MPMG(Multi Peak Multi Gene) :), which means this package origin purpose is to do mutli-peak-multi-gene annotation. But as the diversity of analysis increase, it gradually extend its function and rename into FindIT2(Find influential TF and Target). But the latter function are still build on the MPMG. Now, it have five module:

  • Multi-peak multi-gene annotaion(mmPeakAnno module)
  • Calculate regulation potential(calcRP module)
  • Find influential Target based on ChIP-Seq and RNA-Seq data(Find influential Target module)
  • Find influential TF based on different input(Find influential TF module)
  • Calculate peak-gene or peak-peak correlation(peakGeneCor module)

And there are also some other useful function like integrate different source information, calculate jaccard similarity for your TF. I will introduce all these function in below manual. And for each part, I will also show the file type you may need prepare, which can help you prepare the correct file format.

The ChIP and ATAC datasets in this vignettes are from (Wang et al. 2020). For the speed, I only use the data in chrosome 5.

# load packages
# If you want to run this manual, please check you have install below packages.
library(FindIT2)
#> Warning: replacing previous import 'utils::findMatches' by
#> 'S4Vectors::findMatches' when loading 'AnnotationDbi'
library(TxDb.Athaliana.BioMart.plantsmart28)
library(SummarizedExperiment)

library(dplyr)
library(ggplot2)

# because of the fa I use, I change the seqlevels of Txdb to make the chrosome levels consistent
Txdb <- TxDb.Athaliana.BioMart.plantsmart28
seqlevels(Txdb) <- c(paste0("Chr", 1:5), "M", "C")

all_geneSet <- genes(Txdb)

Because of the storage restriction, the data used here is a small data set, which may not show the deatiled information for result. The user can read the FindIT2 paper and related github repo to see more detailed information and practical example.

2 Multi-peak multi-gene annotation

The multi-peak multi-gene annotation(mmPeakAnno) is the basic of this package. Most function will use the result of mmPeakAnno. So I explain them first.

The object you may need

  • peak_GR: a GRange object with a column named feature_id

FindIT2 provides loadPeakFile to load peak and store in GRanges object. Meanwhile, it also rename one of your GRange column name into feature_id. The feature_id is the most important column in FindIT2, which will be used as a link to join information from different source. The feature_id column represents your peak name, which is often the forth column in bed file and the first column in GRange metadata column . If you have a GRange without feature_id column, you can rename your first metadata column or just add a column named feature_id like below

# when you make sure your first metadata column is peak name
colnames(mcols(yourGR))[1] <- "feature_id"

# or you just add a column
yourGR$feature_id <- paste0("peak_", seq_len(length(yourGR)))
  • Txdb: a Txdb object that manages genomic locations and the relationships between genomic feature

you can see the detailed Txdb description in Making and Utilizing TxDb Objects

  • input_genes: gene id

Here I take the ChIP-Seq data as example.

# load the test ChIP peak bed
ChIP_peak_path <- system.file("extdata", "ChIP.bed.gz", package = "FindIT2")
ChIP_peak_GR <- loadPeakFile(ChIP_peak_path)

# you can see feature_id is in your first column of metadata
ChIP_peak_GR
#> GRanges object with 4288 ranges and 2 metadata columns:
#>          seqnames            ranges strand |  feature_id     score
#>             <Rle>         <IRanges>  <Rle> | <character> <numeric>
#>      [1]     Chr5         6236-6508      * |  peak_14125        27
#>      [2]     Chr5         7627-8237      * |  peak_14126        51
#>      [3]     Chr5        9730-10211      * |  peak_14127        32
#>      [4]     Chr5       12693-12867      * |  peak_14128        22
#>      [5]     Chr5       13168-14770      * |  peak_14129       519
#>      ...      ...               ...    ... .         ...       ...
#>   [4284]     Chr5 26937822-26938526      * |  peak_18408       445
#>   [4285]     Chr5 26939152-26939267      * |  peak_18409        21
#>   [4286]     Chr5 26949581-26950335      * |  peak_18410       263
#>   [4287]     Chr5 26952230-26952558      * |  peak_18411        30
#>   [4288]     Chr5 26968877-26969091      * |  peak_18412        26
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

2.1 annotate peak using nearest mode

The nearest mode is the most widely used annotation mode. It will link the peak to its nearest gene, which means every peak only have one related gene. The disadvantage is sometimes you can not link the correct gene for your peak because of the complexity in the genomic feature. But this annotation mode is simple enough and at most time, for most peak, the result is correct. The skeleton function is distanceToNearest from GenomicRanges. I add some modification so that it will output more human readable result.

mmAnno_nearestgene <- mm_nearestGene(peak_GR = ChIP_peak_GR,
                                     Txdb = Txdb)
#> >> checking seqlevels match...       2023-04-25 17:05:47
#> >> your peak_GR seqlevel:Chr5...
#> >> your Txdb seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5 M C...
#> Good, your Chrs in peak_GR is all in Txdb
#> ------------
#> annotating Peak using nearest gene mode begins
#> >> preparing gene features information...        2023-04-25 17:05:47
#> >> finding nearest gene and calculating distance...      2023-04-25 17:05:47
#> >> dealing with gene strand ...      2023-04-25 17:05:48
#> >> merging all info together ...     2023-04-25 17:05:48
#> >> done      2023-04-25 17:05:48

mmAnno_nearestgene
#> GRanges object with 4288 ranges and 4 metadata columns:
#>          seqnames            ranges strand |  feature_id     score     gene_id
#>             <Rle>         <IRanges>  <Rle> | <character> <numeric> <character>
#>      [1]     Chr5         6236-6508      * |  peak_14125        27   AT5G01015
#>      [2]     Chr5         7627-8237      * |  peak_14126        51   AT5G01020
#>      [3]     Chr5        9730-10211      * |  peak_14127        32   AT5G01030
#>      [4]     Chr5       12693-12867      * |  peak_14128        22   AT5G01030
#>      [5]     Chr5       13168-14770      * |  peak_14129       519   AT5G01040
#>      ...      ...               ...    ... .         ...       ...         ...
#>   [4284]     Chr5 26937822-26938526      * |  peak_18408       445   AT5G67510
#>   [4285]     Chr5 26939152-26939267      * |  peak_18409        21   AT5G67520
#>   [4286]     Chr5 26949581-26950335      * |  peak_18410       263   AT5G67560
#>   [4287]     Chr5 26952230-26952558      * |  peak_18411        30   AT5G67570
#>   [4288]     Chr5 26968877-26969091      * |  peak_18412        26   AT5G67630
#>          distanceToTSS
#>              <numeric>
#>      [1]          -344
#>      [2]           206
#>      [3]             0
#>      [4]          2823
#>      [5]          1402
#>      ...           ...
#>   [4284]             0
#>   [4285]             0
#>   [4286]             0
#>   [4287]             0
#>   [4288]           302
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

You can also use this the annotation result to check your TF type using plot_annoDistance. For most TF, the distance density plot maybe like below, which means your TF is promoter-type. But for some TF, its density plot will be different, like GATA4, MYOD1(Li et al. 2019).

plot_annoDistance(mmAnno = mmAnno_nearestgene)

Sometimes, you may interested in the number peaks of each gene linked. Or reciprocally, how many genes of each peak link. you can use the getAssocPairNumber to see the deatailed number or summary.

getAssocPairNumber(mmAnno = mmAnno_nearestgene)
#> # A tibble: 2,757 × 2
#>    gene_id   peakNumber
#>    <chr>          <int>
#>  1 AT5G01015          1
#>  2 AT5G01020          1
#>  3 AT5G01030          2
#>  4 AT5G01040          2
#>  5 AT5G01050          2
#>  6 AT5G01070          1
#>  7 AT5G01090          1
#>  8 AT5G01100          1
#>  9 AT5G01170          1
#> 10 AT5G01175          3
#> # ℹ 2,747 more rows

getAssocPairNumber(mmAnno = mmAnno_nearestgene,
                   output_summary = TRUE)
#> # A tibble: 8 × 2
#>   N     gene_freq
#>   <fct>     <int>
#> 1 1          1793
#> 2 2           606
#> 3 3           229
#> 4 4            75
#> 5 5            38
#> 6 6             9
#> 7 7             4
#> 8 >=8           3

# you can see all peak's related gene number is 1 because I use the nearest gene mode
getAssocPairNumber(mmAnno_nearestgene, output_type = "feature_id")
#> # A tibble: 4,288 × 2
#>    feature_id geneNumber
#>    <chr>           <int>
#>  1 peak_14125          1
#>  2 peak_14126          1
#>  3 peak_14127          1
#>  4 peak_14128          1
#>  5 peak_14129          1
#>  6 peak_14130          1
#>  7 peak_14131          1
#>  8 peak_14132          1
#>  9 peak_14133          1
#> 10 peak_14134          1
#> # ℹ 4,278 more rows

getAssocPairNumber(mmAnno = mmAnno_nearestgene,
                   output_type = "feature_id",
                   output_summary = TRUE)
#> # A tibble: 1 × 2
#>   N     feature_freq
#>   <fct>        <int>
#> 1 1             4288

And if you want the summary plot, you can use the plot_peakGeneAlias_summary function.

plot_peakGeneAlias_summary(mmAnno_nearestgene)

plot_peakGeneAlias_summary(mmAnno_nearestgene, output_type = "feature_id")

2.2 find realted peak using gene Bound mode

The mm_geneBound function is designed for finding related peak for your input_genes.Because we do the nearest gene mode to annotate peak, once a peak is linked by nearest gene, it will not be linked by another gene even if another gene is very close to your peak. So this function is very useful when you want to plot peak heatmap or volcano plot. When ploting these plot, you often have a interesting gene set, and want to plot related peak. If we just filter gene id in the nearest result,many of your interesting gene will not have related peak. After all, each peak will be assigned once.

For mm_geneBound, it will output realted peak for all your input_gene as long as your input_genes in your Txdb. The model behind mm_geneBound is simple, it will do mm_nearestgene first and scan nearest peak for these genes which do not have related peak.

# The genes_Chr5 is all gene set in Chr5
genes_Chr5 <- names(all_geneSet[seqnames(all_geneSet) == "Chr5"])

# The genes_Chr5_notAnno is gene set which is not linked by peak
genes_Chr5_notAnno <- genes_Chr5[!genes_Chr5 %in% unique(mmAnno_nearestgene$gene_id)]

# The genes_Chr5_tAnno is gene set which is linked by peak
genes_Chr5_Anno <- unique(mmAnno_nearestgene$gene_id)

# mm_geneBound will tell you there 5 genes in your input_genes not be annotated
# and it will use the distanceToNearest to find nearest peak of these genes
mmAnno_geneBound <- mm_geneBound(peak_GR = ChIP_peak_GR,
                                 Txdb = Txdb,
                                 input_genes = c(genes_Chr5_Anno[1:5], genes_Chr5_notAnno[1:5]))
#> >> checking seqlevels match...       2023-04-25 17:05:51
#> >> your peak_GR seqlevel:Chr5...
#> >> your Txdb seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5 M C...
#> Good, your Chrs in peak_GR is all in Txdb
#> >> using mm_nearestGene to annotate Peak...      2023-04-25 17:05:51
#> >> checking seqlevels match...       2023-04-25 17:05:51
#> >> your peak_GR seqlevel:Chr5...
#> >> your Txdb seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5 M C...
#> Good, your Chrs in peak_GR is all in Txdb
#> ------------
#> annotating Peak using nearest gene mode begins
#> >> preparing gene features information...        2023-04-25 17:05:51
#> >> finding nearest gene and calculating distance...      2023-04-25 17:05:51
#> >> dealing with gene strand ...      2023-04-25 17:05:52
#> >> merging all info together ...     2023-04-25 17:05:52
#> >> done      2023-04-25 17:05:52
#> It seems that there 5 genes have not been annotated by nearestGene mode
#> >> using distanceToNearest to find nearest peak of these genes...        2023-04-25 17:05:52
#> >> merging all anno...       2023-04-25 17:05:52
#> >> done      2023-04-25 17:05:52

# all of your input_genes have related peaks
mmAnno_geneBound
#> # A tibble: 13 × 3
#>    feature_id gene_id   distanceToTSS_abs
#>    <chr>      <chr>                 <dbl>
#>  1 peak_14125 AT5G01015               344
#>  2 peak_14126 AT5G01020               206
#>  3 peak_14127 AT5G01030                 0
#>  4 peak_14128 AT5G01030              2823
#>  5 peak_14129 AT5G01040              1402
#>  6 peak_14130 AT5G01040                 0
#>  7 peak_14131 AT5G01050               571
#>  8 peak_14132 AT5G01050                94
#>  9 peak_14125 AT5G01010              1174
#> 10 peak_14132 AT5G01060              2022
#> 11 peak_14133 AT5G01075               949
#> 12 peak_14134 AT5G01080              2339
#> 13 peak_14135 AT5G01110              6623