Basepair-Resolution Analysis with BRGenomics

Straightforward tools for high-resolution genomics data

Mike DeBerardine

27 October 2020

Abstract

BRGenomics is designed to help users avoid code repetition by providing efficient and tested functions to accomplish common, discrete tasks in the analysis of high-throughput sequencing data. The included functions are geared toward analyzing basepair-resolution sequencing data, the properties of which are exploited to increase performance and user-friendliness. We leverage standard Bioconductor methods and classes to maximize compatibility with its rich ecoystem of bioinformatics tools, and we aim to make BRGenomics sufficient for most post-alignment data processing. Common data processing and analytical steps are turned into fast-running one-liners that can be simultaneously applied across numerous datasets. BRGenomics is fully-documented, and we aim it to be beginner-friendly.

Package

BRGenomics 1.2.0

1 Motivation

This package is designed to:

Replace the use of command-line utilities for most post-alignment processing, e.g. bedtools and deeptools
Be easy-to-use and easy-to-install, without requiring external dependencies, e.g. hitslib or the kent source utilities from the UCSC genome browser
Allow users to string together common analysis pipelines with simple, fast-running one-liners
Avoid code repetition by providing tested and validated code
Exploit the properties of basepair-resolution data to optimize performance and increase user-friendliness
Use process forking to make use of multicore processors
Maximize compatibility with Bioconductor’s rich ecosystem of analysis software, in addition to leveraging the traditional strengths of R in statistics and data visualization
Fully replace the bigWig R package

2 Features

Process and import bedGraph, bigWig, and bam files quickly and easily, with several pre-configured defaults for typical uses
Count and filter spike-in reads
Calculate spike-in normalization factors using several methods and options, including options for batch normalization
Count reads by regions of interest
Count reads at positions within regions of interest, at single-base resolution or in larger bins, and generate count matrices for heatmapping
Calculate bootstrapped signal (e.g. readcount) profiles with confidence intervals (i.e. meta-profiles)
Modify gene regions (e.g. extract promoters or genebody regions) using a single simple and straightforward function
Conveniently and efficiently call DESeq2 to calculate differential expression in a manner that is robust to global changes111 Avoid the default behavior of calculating genewise dispersion across all samples present, which is invalid if any experimental condition causes broad changes
- Use non-contiguous genes in DESeq2 analysis, e.g. to exclude of specific sites/peaks from the analysis (not usually supported by DESeq2)
- Efficiently generate results across a list of comparisons
Support for blacklisting throughout, and proper accounting of blacklisted sites in relevant calculations
Users interact with an intuitive and computationally efficient data structure (the “basepair resolution GRanges” object), which is already supported by a rich, user-friendly suite of tools that greatly simplify working with datasets and annotations

3 Coming Soon

Data processing:

Summarizing and plotting replicate correlations
Function to use random read sampling to assess if sequencing depth sufficient to stabilize arbitrary calculations (so a user can supply anonymous function to calculate things like rank expression, power analysis or differential expression by DESeq2, pausing indices, etc.)

Signal counting and analysis:

Two-stranded meta-profile calculations
Automated generation of a list of DESeq2 comparisons using all possible combinations; all possible permutations; or by defining a simple hierarchy of each-vs-one comparisons