BRGenomics 1.14.1

As an example, weâ€™ll plot PRO-seq signal in promoter-proximal regions by calculating the mean signal intensity at each base in the first 100 bases of a transcript.

First, calculate signal at each base for all promoter-proximal regions:

```
library(BRGenomics)
data("PROseq")
data("txs_dm6_chr4")
txs_pr <- promoters(txs_dm6_chr4, 0, 100)
```

```
countmatrix_pr <- getCountsByPositions(PROseq, txs_pr, ncores = 1)
dim(countmatrix_pr)
## [1] 339 100
dim(countmatrix_pr) == c(length(txs_pr), unique(width(txs_pr)))
## [1] TRUE TRUE
```

For each position (each column of the matrix), calculate the mean, and plot:

```
plot(x = 1:ncol(countmatrix_pr),
y = colMeans(countmatrix_pr),
type = "l",
xlab = "Distance to TSS (bp)",
ylab = "Mean PRO-seq Reads")
```

Â

One drawback of using the arithmetic mean is that means are not robust to outliers. In other words, the mean signal at any position is liable to be determined by a small number of highly influential points. This is especially problematic for high dynamic range data like PRO-seq.

Itâ€™s common to see this issue addressed through the use medians/quantiles in place of arithmetic means. However, observe what happens as we plot the median signal across our gene list using several different gene-filtering thresholds:

```
plot_meds <- function(sig_thresh) {
idx <- which(rowSums(countmatrix_pr) > sig_thresh)
plot(x = 1:ncol(countmatrix_pr),
y = apply(countmatrix_pr[idx, ], 2, median),
type = "l",
main = sprintf("Regions with >%s reads", sig_thresh),
xlab = "Distance to TSS (bp)",
ylab = "Median PRO-seq Reads")
}
par(mfrow = c(3, 2))
for (i in c(0, 30*2^(0:4))) {
plot_meds(i)
}
```

The paucity of transcription on Drosophila chromosome 4 makes this a somewhat extreme example, and you might find that the above situation is a non-issue