R version: R version 4.0.3 (2020-10-10)
Bioconductor version: 3.12
Package version: 1.4.0
Biological information encoded in the DNA sequence is often organized into independent modules or clusters. For instance, the eukaryotic system of expression relies on combinations of homotypic or heterotypic transcription factors (TFs) which play an important role in activating and repressing target genes. Identifying clusters of genomic features is a frequent task and includes application such as genomic regions enriched for the presence of a combination of transcription factor binding sites or enriched for the presence of mutations often referred as regions with increase mutational hotspots.
fcScan is designed to detect clusters of genomic features, based on user defined search criteria. Such criteria include:
fcScan is designed to handle large number of genomic features (i.e data generated by High Throughput Sequencing).
fcScan depends on the following packages:
Currently, fcScan has one main function, the
getCluster function. Additional
functionality will be added for future releases including cross-species
identification of orthologous clusters.
The input for
getCluster is given through the parameter
This function accepts input in data frame, GRanges as well as vector of files
in BED and VCF (compressed or not) formats. BED and VCF files are loaded
VariantAnnotation respectively. There is no
limit to the number of files the user can define.
When input is data frame or GRanges objects are given, they should contain the following “named” 5 columns:
The seqnames, start and end columns define the name of the chromosome, the starting and ending position of the feature in the chromosome, respectively. The strand column defines the strand. The site column contains the ID of the site and will be used for clustering. The start and end columns are numeric while the remaining columns are characters.
Note: when input is data frame, the data is considered
as in BED format.
Window size is set using
w. It defines the
maximum size of
The clustering condition
c defines the number and name of genomic features to
search for based on features names defined in the
c = c("a" = 1, "b" = 2)
This searches for clusters with 3 sites, One a site and two b sites.
Another way of writing the condition, only if input is a vector to file paths,
is the following
x = ("a.bed", "b.bed"), c = c(1,2)
Given 2 files, a.bed and b.bed, this condition states that the user is looking for clusters made from 1 “a” site and 2 “b” sites. In this case, the order of sites defined in
c is relative to the order of files.
When input is a data frame or GRanges object (instead of files), the
user needs to explicitly define the site names along with the desired
number relative to each site.
For instance, giving the condition as
c = c(1,2) for a data frame or GRanges
is not allowed.
x = dataframe, c = c("a" = 1, "b" = 2) where
b are valid site
site column in the dataframe/GRanges
Users can exclude clusters containing a specific site(s).
This is done by specifying zero
0 in the condition as
c = c("a" = 1, "b" = 2, "c" = 0). In this case, any
c site will be excluded even if it has 1
a and 2
By default, clustering will be performed on both strands and on all seqnames
unless specified by the user using the
seqnames arguments to limit
the search on a specific strand and/or seqname.
Users can choose to cluster on one specific seqname
(seqnames = "chr1"), or
on several seqnames
(seqnames = c("chr1","chr3","chr4"))
seqnames is NULL) meaning that clustering on all seqnames
will be performed.
s, the values allowed are:
The gap/overlap between adjacent clusters, and not sites, can be controled
overlap is a positive integer, adjacent clusters will be separated by
a minimum of the given value.
overlap is negative, adjacent clusters will be allowed to overlap by
a maximum of the given value. (Default is set to 0)
greedy allows the user to control the number of genomic features found
greedy = FALSE,
getCluster will build clusters having the required
window size and will label TRUE the ones that contain the exact number
of sites provided in the condition argument.
Clusters having the user defined window size but not satisfying
the condition will be labelled as FALSE.
greedy = TRUE, additional sites defined in condition will be added
to the cluster as long as the cluster size is below the defined window size.
(Default is set to FALSE)
order option defines the desired order of the sites within identified
clusters. For instance,
order = c("a","b","b")
will search for clusters containing 1
a and 2
b sites and checks if
they are in the specified order. Clusters with 1
a and 2
b sites that
do not contain the specified order will be rejected.
When greedy is set to
TRUE, order can be satisfied if a subcluster contains
the desired order. For example if a cluster has
a, a, b, b, b sites, it
satisfies the required order (a, b, b) and therefore will be considered as
a correct cluster . (Default is set to NULL)
site_orientation option defines the orientation or strandness of sites
in the found clusters. This option cannot be used if
site_orientation should be specified for each site in
For instance, if
order = c("a","b","b"), we can define
for each site respectively as follow:
site_orientation = c("+","-","-").
The cluster will be correct if it satisfies the required order and sites
orientation. (Default is set to NULL)
site_overlap option defines the maximum or minimum distance allowed
between sites in the cluster.
site_overlap is a positive integer, sites within a cluster should have
minimum distance and above, then the cluster is considered
site_overlap is a negative integer, sites within a cluster should
max distance and below, then the cluster is considered
site_overlap is zero, distance between sites is not taken into
consideration. (Default is set to 0)
verbose option allows the printing of additional messages and the
list of clusters that failed for lack of correct combination of sites.
This option is used mainly for debugging purposes. (Default is set to FALSE)
The output of
getCluster is a GRanges object with fields:
The algorithm returns all clusters containing the correct count of
sites/features, unless verbose is set to
TRUE. If the combination,
overlap and order options are satisfied, the cluster is considered a
The status of a cluster can be either PASS, excludedSites, orderFail,
siteOrientation or siteOverlap
PASS is a cluster that satisfied the desired combination, overlap, order
and sites orientation.
orderFail is a cluster that had the required combination but did not
satisfy the required order of sites.
excludedSites is a cluster that had the required combination and order
but it has one or more sites to exclude.
siteOrientation is a cluster that had the required combination and order
but it has one or more sites with different orientation than requested.
siteOverlap is a cluster that meets all the requirements
but the distance between its sites is not respected.
NOTE: If the user is using
greedy = FALSE and
order contains values more
than in the condition parameter (
c), an error will be raised.
greedy = TRUE, then using
order with more values than the
condition parameter is allowed since the cluster may contain more
sites than the required
c condition as long as the window size is satisfied.
getCluster looks for desired genomic regions of interest like
transcription factor binding sites, within a window size
and specific condition.
This function accepts a data frame and GRanges object. getCluster also accepts
BED or VCF (or mix of both) files as input.
The output of
getCluster is a GRanges object that
contains the genomic coordinates (seqnames, ranges and strand) and
three metadata columns:
sites: contains clusters of sites that conforms with the condition
TRUE if the cluster found conform with the condition
order (if indicated in condition) and
if the cluster fails to conform with the condition or order.
TRUE. However, if
FALSE, status shows why the found cluster is not a
TRUE cluster. If the
order of sites is not respected in the found cluster, status would return
OrderFail. in Addition, if the cluster found contains non desired sites,
excludedSites. Moreover, if the sites orientation is not
respected in found cluster, status would return
Finally , if the distance between sites in the cluster found does not conform
to the condition specified by the user, status would return
In this example, we ask
getCluster to look for clusters that contains
one site “s1”, one site “s2” and zero “s3” sites. In addition, we requested
clusters to have sites in the order s1,s2 and having orientation
“+”,“+” respectively with minimum distance between sites of 2bp.
x1 = data.frame(seqnames = rep("chr1", times = 17), start = c(1,10,17,25,27,32,41,47,60,70,87,94,99,107,113,121,132), end = c(8,15,20,30,35,40,48,55,68,75,93,100,105,113,120,130,135), strand = c("+","+","+","+","+","+","+","+","+", "+","+","+","+","+","+","+","-"), site = c("s3","s1","s2","s2","s1","s2","s1","s1","s2","s1","s2", "s2","s1","s2","s1","s1","s2")) clusters = getCluster(x1, w = 20, c = c("s1" = 1, "s2" = 1, "s3" = 0), greedy = TRUE, order = c("s1","s2"), site_orientation=c("+","+"), site_overlap = 2, verbose = TRUE) #> 17 entries loaded #> Time difference of 0.3072245 secs clusters #> GRanges object with 9 ranges and 3 metadata columns: #> seqnames ranges strand | sites isCluster status #> <Rle> <IRanges> <Rle> | <character> <logical> <character> #>  chr1 2-20 * | s3,s1,s2 FALSE excludedSites #>  chr1 11-30 * | s1,s2,s2 TRUE PASS #>  chr1 33-48 * | s2,s1 FALSE orderFail #>  chr1 48-68 * | s1,s2 TRUE PASS #>  chr1 88-105 * | s2,s2,s1 FALSE orderFail #>  chr1 95-113 * | s2,s1,s2 FALSE siteOverlap #>  chr1 100-120 * | s1,s2,s1 FALSE siteOverlap #>  chr1 108-120 * | s2,s1 FALSE orderFail #>  chr1 122-135 * | s1,s2 FALSE siteOrientation #> ------- #> seqinfo: 1 sequence from an unspecified genome; no seqlengths
Another example but using GRanges as input:
in this example, we ask
getCluster to look for clusters that contains
s1 and two sites
s2 within a window size of 25 bp. Also,
we requested clusters to be searched as
suppressMessages(library(GenomicRanges)) x = GRanges( seqnames = Rle("chr1", 16), ranges = IRanges(start = c(10L,17L,25L,27L,32L,41L,47L, 60L,70L,87L,94L,99L,107L,113L,121L,132L), end = c(15L,20L,30L,35L,40L,48L,55L,68L,75L,93L,100L,105L, 113L,120L,130L,135L)), strand = Rle("+",16), site = c("s1","s2","s2","s1","s2","s1","s1","s2", "s1","s2","s2","s1","s2","s1","s1","s2")) clusters = getCluster(x, w = 25, c = c("s1"=1,"s2"=2), s = "+") #> 16 entries loaded #> Time difference of 0.1627619 secs clusters #> GRanges object with 2 ranges and 3 metadata columns: #> seqnames ranges strand | sites isCluster status #> <Rle> <IRanges> <Rle> | <character> <logical> <character> #>  chr1 10-30 + | s1,s2,s2 TRUE PASS #>  chr1 87-105 + | s2,s2,s1 TRUE PASS #> ------- #> seqinfo: 1 sequence from an unspecified genome; no seqlengths
sessionInfo() #> R version 4.0.3 (2020-10-10) #> Platform: x86_64-pc-linux-gnu (64-bit) #> Running under: Ubuntu 18.04.5 LTS #> #> Matrix products: default #> BLAS: /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so #> LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so #> #> locale: #>  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C #>  LC_TIME=en_US.UTF-8 LC_COLLATE=C #>  LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 #>  LC_PAPER=en_US.UTF-8 LC_NAME=C #>  LC_ADDRESS=C LC_TELEPHONE=C #>  LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C #> #> attached base packages: #>  parallel stats4 stats graphics grDevices utils datasets #>  methods base #> #> other attached packages: #>  GenomicRanges_1.42.0 GenomeInfoDb_1.26.0 IRanges_2.24.0 #>  S4Vectors_0.28.0 BiocGenerics_0.36.0 fcScan_1.4.0 #>  BiocStyle_2.18.0 #> #> loaded via a namespace (and not attached): #>  MatrixGenerics_1.2.0 Biobase_2.50.0 #>  httr_1.4.2 bit64_4.0.5 #>  assertthat_0.2.1 askpass_1.1 #>  highr_0.8 BiocManager_1.30.10 #>  BiocFileCache_1.14.0 blob_1.2.1 #>  BSgenome_1.58.0 GenomeInfoDbData_1.2.4 #>  Rsamtools_2.6.0 yaml_2.2.1 #>  progress_1.2.2 pillar_1.4.6 #>  RSQLite_2.2.1 lattice_0.20-41 #>  glue_1.4.2 digest_0.6.27 #>  XVector_0.30.0 htmltools_0.5.0 #>  Matrix_1.2-18 plyr_1.8.6 #>  XML_3.99-0.5 pkgconfig_2.0.3 #>  biomaRt_2.46.0 bookdown_0.21 #>  zlibbioc_1.36.0 purrr_0.3.4 #>  BiocParallel_1.24.0 tibble_3.0.4 #>  openssl_1.4.3 generics_0.0.2 #>  ellipsis_0.3.1 SummarizedExperiment_1.20.0 #>  GenomicFeatures_1.42.0 magrittr_1.5 #>  crayon_1.3.4 memoise_1.1.0 #>  evaluate_0.14 xml2_1.3.2 #>  tools_4.0.3 prettyunits_1.1.1 #>  hms_0.5.3 lifecycle_0.2.0 #>  matrixStats_0.57.0 stringr_1.4.0 #>  DelayedArray_0.16.0 AnnotationDbi_1.52.0 #>  Biostrings_2.58.0 compiler_4.0.3 #>  rlang_0.4.8 grid_4.0.3 #>  RCurl_1.98-1.2 VariantAnnotation_1.36.0 #>  rappdirs_0.3.1 bitops_1.0-6 #>  rmarkdown_2.5 DBI_1.1.0 #>  curl_4.3 R6_2.4.1 #>  GenomicAlignments_1.26.0 knitr_1.30 #>  dplyr_1.0.2 rtracklayer_1.50.0 #>  bit_4.0.4 stringi_1.5.3 #>  Rcpp_1.0.5 vctrs_0.3.4 #>  dbplyr_1.4.4 tidyselect_1.1.0 #>  xfun_0.18