Contents

1 Processing sequencing Hi-C libraries with HiCool

The HiCool R/Bioconductor package provides an end-to-end interface to process and normalize Hi-C paired-end fastq reads into .(m)cool files.

  1. The heavy lifting (fastq mapping, pairs parsing and pairs filtering) is performed by the underlying lightweight hicstuff python library (https://github.com/koszullab/hicstuff).
  2. Pairs filering is done using the approach described in Cournac et al., 2012 and implemented in hicstuff.
  3. cooler (https://github.com/open2c/cooler) library is used to parse pairs into a multi-resolution, balanced .mcool file. .(m)cool is a compact, indexed HDF5 file format specifically tailored for efficiently storing HiC-based data. The .(m)cool file format was developed by Abdennur and Mirny and published in 2019.
  4. Internally, all these external dependencies are automatically installed and managed in R by a basilisk environment.

The main processing function offered in this package is HiCool(). To process .fastq reads into .pairs & .mcool files, one needs to provide:

x <- HiCool(
    r1 = '<PATH-TO-R1.fq.gz>', 
    r2 = '<PATH-TO-R2.fq.gz>', 
    restriction = '<RE1(,RE2)>', 
    resolutions = "<resolutions of interest>", 
    genome = '<GENOME_ID>'
)

Here is a concrete example of Hi-C data processing.

library(HiCool)
hcf <- HiCool(
    r1 = HiContactsData::HiContactsData(sample = 'yeast_wt', format = 'fastq_R1'), 
    r2 = HiContactsData::HiContactsData(sample = 'yeast_wt', format = 'fastq_R2'), 
    restriction = 'DpnII,HinfI', 
    resolutions = c(4000, 8000, 16000), 
    genome = 'R64-1-1', 
    output = './HiCool/'
)
#> Warning: replacing previous import 'utils::findMatches' by
#> 'S4Vectors::findMatches' when loading 'AnnotationDbi'
#> see ?HiContactsData and browseVignettes('HiContactsData') for documentation
#> loading from cache
#> see ?HiContactsData and browseVignettes('HiContactsData') for documentation
#> loading from cache
#> HiCool :: Recovering bowtie2 genome index from AWS iGenomes...
#> + '/home/biocbuild/.cache/R/basilisk/1.12.0/0/bin/conda' 'create' '--yes' '--prefix' '/home/biocbuild/.cache/R/basilisk/1.12.0/HiCool/1.0.0/env' 'python=3.7.12' '--quiet' '-c' 'conda-forge' '-c' 'bioconda'
#> + '/home/biocbuild/.cache/R/basilisk/1.12.0/0/bin/conda' 'install' '--yes' '--prefix' '/home/biocbuild/.cache/R/basilisk/1.12.0/HiCool/1.0.0/env' 'python=3.7.12'
#> + '/home/biocbuild/.cache/R/basilisk/1.12.0/0/bin/conda' 'install' '--yes' '--prefix' '/home/biocbuild/.cache/R/basilisk/1.12.0/HiCool/1.0.0/env' '-c' 'conda-forge' '-c' 'bioconda' 'python=3.7.12' 'python=3.7.12' 'bowtie2=2.5.0' 'samtools=1.16.1' 'hicstuff=3.1.5' 'chromosight=1.6.3' 'cooler=0.9.1'
#> HiCool :: Initiating processing of fastq files [tmp folder: /tmp/RtmpEMOPmj/URLX2R]...
#> HiCool :: Mapping fastq files...
#> HiCool :: Removing unwanted chromosomes...
#> HiCool :: Parsing pairs into .cool file...
#> HiCool :: Generating multi-resolution .mcool file...
#> HiCool :: Balancing .mcool file...
#> HiCool :: Tidying up everything for you...
#> HiCool :: .fastq to .mcool processing done!
#> HiCool :: Check ./HiCool/folder to find the generated files
#> HiCool :: Generating HiCool report. This might take a while.
#> HiCool :: Report generated and available @ /tmp/RtmpyfHZS5/Rbuild128f3d1337b733/HiCool/vignettes/HiCool/a1d2765d5bf4b_7833^mapped-R64-1-1^URLX2R.html
#> HiCool :: All processing successfully achieved. Congrats!
hcf
#> CoolFile object
#> .mcool file: ./HiCool//matrices/a1d2765d5bf4b_7833^mapped-R64-1-1^URLX2R.mcool 
#> resolution: 4000 
#> pairs file: ./HiCool//pairs/a1d2765d5bf4b_7833^mapped-R64-1-1^URLX2R.pairs 
#> metadata(3): log args stats
S4Vectors::metadata(hcf)
#> $log
#> [1] "./HiCool//logs/a1d2765d5bf4b_7833^mapped-R64-1-1^URLX2R.log"
#> 
#> $args
#> $args$r1
#> [1] "/home/biocbuild/.cache/R/ExperimentHub/a1d2765d5bf4b_7833"
#> 
#> $args$r2
#> [1] "/home/biocbuild/.cache/R/ExperimentHub/a1d276a680060_7834"
#> 
#> $args$genome
#> [1] "/tmp/RtmpEMOPmj/R64-1-1"
#> 
#> $args$resolutions
#> [1] "4000"
#> 
#> $args$resolutions
#> [1] "8000"
#> 
#> $args$resolutions
#> [1] "16000"
#> 
#> $args$restriction
#> [1] "DpnII,HinfI"
#> 
#> $args$iterative
#> [1] TRUE
#> 
#> $args$balancing_args
#> [1] " --min-nnz 10 --mad-max 5 "
#> 
#> $args$threads
#> [1] 1
#> 
#> $args$output
#> [1] "./HiCool/"
#> 
#> $args$exclude_chr
#> [1] "Mito|chrM|MT"
#> 
#> $args$keep_bam
#> [1] FALSE
#> 
#> $args$scratch
#> [1] "/tmp/RtmpEMOPmj"
#> 
#> $args$wd
#> [1] "/tmp/RtmpyfHZS5/Rbuild128f3d1337b733/HiCool/vignettes"
#> 
#> 
#> $stats
#> $stats$nFragments
#> [1] 1e+05
#> 
#> $stats$nPairs
#> [1] 73993
#> 
#> $stats$nDangling
#> [1] 10027
#> 
#> $stats$nSelf
#> [1] 2205
#> 
#> $stats$nDumped
#> [1] 83
#> 
#> $stats$nFiltered
#> [1] 61678
#> 
#> $stats$nDups
#> [1] 719
#> 
#> $stats$nUnique
#> [1] 60959
#> 
#> $stats$threshold_uncut
#> [1] 7
#> 
#> $stats$threshold_self
#> [1] 7

2 Optional parameters

Extra optional arguments can be passed to the hicstuff workhorse library:

3 Output files

The important files generated by HiCool are the following:

The diagnosis plots illustrate how pairs were filtered during the processing, using a strategy described in Cournac et al., BMC Genomics 2012. The event_distance chart represents the frequency of ++, +-, -+ and -- pairs in the library, as a function of the number of restriction sites between each end of the pairs, and shows the inferred filtering threshold. The event_distribution chart indicates the proportion of each type of pairs (e.g. dangling, uncut, abnormal, …) and the total number of pairs retained (3D intra + 3D inter).

Notes:

4 System dependencies

Processing Hi-C sequencing libraries into .pairs and .mcool files requires several dependencies, to (1) align reads to a reference genome, (2) manage alignment files (SAM), (3) filter pairs, (4) bin them to a specific resolution and (5)

All system dependencies are internally managed by basilisk. HiCool maintains a basilisk environment containing:

The first time HiCool() is executed, a fresh basilisk environment will be created and required dependencies automatically installed. This ensures compatibility between the different system dependencies needed to process Hi-C fastq files.

5 Session info

sessionInfo()
#> R version 4.3.0 RC (2023-04-13 r84269)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] HiContactsData_1.1.9 ExperimentHub_2.8.0  AnnotationHub_3.8.0 
#> [4] BiocFileCache_2.8.0  dbplyr_2.3.2         BiocGenerics_0.46.0 
#> [7] HiCool_1.0.0         HiCExperiment_1.0.0  BiocStyle_2.28.0    
#> 
#> loaded via a namespace (and not attached):
#>   [1] DBI_1.1.3                     bitops_1.0-7                 
#>   [3] rlang_1.1.0                   magrittr_2.0.3               
#>   [5] matrixStats_0.63.0            compiler_4.3.0               
#>   [7] RSQLite_2.3.1                 dir.expiry_1.8.0             
#>   [9] png_0.1-8                     vctrs_0.6.2                  
#>  [11] stringr_1.5.0                 pkgconfig_2.0.3              
#>  [13] crayon_1.5.2                  fastmap_1.1.1                
#>  [15] ellipsis_0.3.2                XVector_0.40.0               
#>  [17] rmdformats_1.0.4              utf8_1.2.3                   
#>  [19] promises_1.2.0.1              rmarkdown_2.21               
#>  [21] sessioninfo_1.2.2             tzdb_0.3.0                   
#>  [23] strawr_0.0.91                 purrr_1.0.1                  
#>  [25] bit_4.0.5                     xfun_0.39                    
#>  [27] zlibbioc_1.46.0               cachem_1.0.7                 
#>  [29] GenomeInfoDb_1.36.0           jsonlite_1.8.4               
#>  [31] blob_1.2.4                    later_1.3.0                  
#>  [33] rhdf5filters_1.12.0           DelayedArray_0.26.0          
#>  [35] Rhdf5lib_1.22.0               BiocParallel_1.34.0          
#>  [37] interactiveDisplayBase_1.38.0 parallel_4.3.0               
#>  [39] R6_2.5.1                      bslib_0.4.2                  
#>  [41] stringi_1.7.12                reticulate_1.28              
#>  [43] GenomicRanges_1.52.0          jquerylib_0.1.4              
#>  [45] Rcpp_1.0.10                   bookdown_0.33                
#>  [47] SummarizedExperiment_1.30.0   knitr_1.42                   
#>  [49] IRanges_2.34.0                httpuv_1.6.9                 
#>  [51] Matrix_1.5-4                  tidyselect_1.2.0             
#>  [53] yaml_2.3.7                    codetools_0.2-19             
#>  [55] curl_5.0.0                    lattice_0.21-8               
#>  [57] tibble_3.2.1                  withr_2.5.0                  
#>  [59] KEGGREST_1.40.0               InteractionSet_1.28.0        
#>  [61] Biobase_2.60.0                shiny_1.7.4                  
#>  [63] basilisk.utils_1.12.0         evaluate_0.20                
#>  [65] Biostrings_2.68.0             pillar_1.9.0                 
#>  [67] BiocManager_1.30.20           filelock_1.0.2               
#>  [69] MatrixGenerics_1.12.0         stats4_4.3.0                 
#>  [71] plotly_4.10.1                 generics_0.1.3               
#>  [73] rprojroot_2.0.3               vroom_1.6.1                  
#>  [75] RCurl_1.98-1.12               BiocVersion_3.17.1           
#>  [77] S4Vectors_0.38.0              ggplot2_3.4.2                
#>  [79] munsell_0.5.0                 scales_1.2.1                 
#>  [81] xtable_1.8-4                  glue_1.6.2                   
#>  [83] lazyeval_0.2.2                tools_4.3.0                  
#>  [85] BiocIO_1.10.0                 data.table_1.14.8            
#>  [87] rhdf5_2.44.0                  grid_4.3.0                   
#>  [89] tidyr_1.3.0                   crosstalk_1.2.0              
#>  [91] AnnotationDbi_1.62.0          colorspace_2.1-0             
#>  [93] GenomeInfoDbData_1.2.10       basilisk_1.12.0              
#>  [95] cli_3.6.1                     rappdirs_0.3.3               
#>  [97] fansi_1.0.4                   viridisLite_0.4.1            
#>  [99] dplyr_1.1.2                   gtable_0.3.3                 
#> [101] sass_0.4.5                    digest_0.6.31                
#> [103] htmlwidgets_1.6.2             memoise_2.0.1                
#> [105] htmltools_0.5.5               lifecycle_1.0.3              
#> [107] here_1.0.1                    httr_1.4.5                   
#> [109] mime_0.12                     bit64_4.0.5