motif_pvalue {universalmotif}R Documentation

Motif P-value and scoring utility

Description

For calculating p-values/logodds scores for any number of motifs.

Usage

motif_pvalue(motifs, score, pvalue, bkg.probs, use.freq = 1, k = 6,
  progress = TRUE, BP = FALSE)

Arguments

motifs

See convert_motifs() for acceptable motif formats.

score

numeric Get a p-value for a motif from a logodds score.

pvalue

numeric Get a logodds score for a motif from a p-value.

bkg.probs

numeric, list If supplying individual background probabilities for each motif, a list. If missing, assumes a uniform background. Currently does not supported if use.freq > 1.

use.freq

numeric(1) By default uses the regular motif matrix; otherwise uses the corresponding multifreq matrix.

k

numeric(1) For speed, scores/p-values can be approximated after subsetting the motif every k columns. If k is a value equal or higher to the size of input motif(s), then the calculations are exact.

progress

logical(1) Show progress. Not recommended if BP = TRUE.

BP

logical(1) Allows the use of BiocParallel within motif_pvalue(). See BiocParallel::register() to change the default backend. Setting BP = TRUE is only recommended for exceptionally large jobs. Furthermore, the behaviour of progress = TRUE is changed if BP = TRUE; the default BiocParallel progress bar will be shown (which unfortunately is much less informative).

Details

Calculating p-values for motifs can be very computationally intensive. This is due to how p-values must be calculated: for a given score, all possible sequences which score equal or higher must be found, and the probability for each of these sequences (based on background probabilities) summed. For a DNA motif of length 10, the number of possible unique sequences is 4^10 = 1,048,576. Finding all possible sequences higher than a given score can be done very efficiently and quickly with a branch-and-bound algorithm, but as the motif length increases this calculation becomes impractical. To get around this, the p-value calculation can be approximated.

In order to calculate p-values for longer motifs, this function uses the approximation proposed by Hartmann et al. (2013), where the motif is subset, p-values calculated for the subsets, and finally combined for a total p-value. The smaller the size of the subsets, the faster the calculation; but also, the bigger the approximation. This can be controlled by setting k. In fact, for smaller motifs (< 13 positions) calculating exact p-values can be done individually in reasonable time by setting k = 12.

To calculate a score based on a given p-value, the function simply guesses different scores until it finds one which when used to calculate a p-value, returns a p-value reasonably close to the given p-value. For low p-values, this usually only takes a couple of guesses.

Note that the approximation is mostly on the conservative side. The higher the number of motif subsets, the worse the approximation. Furthermore, the speed of motif_pvalue() increases with a decreasing p-value.

Value

numeric A vector of scores/p-values.

Author(s)

Benjamin Jean-Marie Tremblay, b2tremblay@uwaterloo.ca

References

Hartmann H, Guthöhrlein E, Siebert M, and S. Luehr J. Söding (2013). “P-value-based regulatory motif discovery using positional weight matrices.” Genome Research, 23, 181–194.

Examples

data(examplemotif)

## p-value/score calculations are performed using the PWM version of the
## motif; these calculations do not work if any -Inf values are present
examplemotif["pseudocount"] <- 1
# or
examplemotif <- BiocGenerics::normalize(examplemotif)

## get a minimum score based on a p-value
motif_pvalue(examplemotif, pvalue = 0.001)

## get the probability of a particular sequence hit
motif_pvalue(examplemotif, score = 0)

## the calculations can be performed for multiple motifs
motif_pvalue(list(examplemotif, examplemotif), pvalue = c(0.001, 0.0001))


[Package universalmotif version 1.0.22 Index]