A test for when to use global normalization methods, such as quantile normalization.
quantro 1.34.0
Multi-sample normalization techniques such as quantile normalization Bolstad et al. (2003), Irizarry et al. (2003) have become a standard and essential part of analysis pipelines for high-throughput data. Although it was originally developed for gene expression microarrays, it is now used across many different high-throughput applications including genotyping arrays, DNA Methylation, RNA Sequencing (RNA-Seq) and Chromatin Immunoprecipitation Sequencing (ChIP-Seq). These techniques transform the original raw data to remove unwanted technical variation. However, quantile normalization and other global normalization methods rely on assumptions about the data generation process that are not appropriate in some context. Until now, it has been left to the researcher to check for the appropriateness of these assumptions.
Quantile normalization assumes that the statistical distribution of each sample is the same. Normalization is achieved by forcing the observed distributions to be the same and the average distribution, obtained by taking the average of each quantile across samples, is used as the reference. This method has worked very well in practice but note that when the assumptions are not met, global changes in distribution, that may be of biological interest, will be wiped out and features that are not different across samples can be artificially induced. These types of assumptions are justified in many biomedical applications, for example in gene expression studies in which only a minority of genes are expected to be differentially expressed. However, if, for example, a substantially higher percentage of genes are expected to be expressed in only one group of samples, it may not be appropriate to use global adjustment methods.
The quantro
R-package can be used to test for global differences
between groups of distributions which asses whether global normalization
methods such as quantile normalization should be applied. Our method uses
the raw unprocessed high-throughput data to test for global differences in
the distributions across a set of groups. The main function quantro()
will perform two tests:
An ANOVA to test if the medians of the distributions are different across groups. Differences across groups could be attributed to unwanted technical variation (such as batch effects) or real global biological variation. This is a helpful step for the user to verify if there is any technical variation unaccounted for.
A test for global differences between the distributions across groups
which returns a test statistic called quantroStat
. This test
statistic is a ratio of two variances and is similar to the idea of ANOVA.
The main idea is to compare the variability of distributions within groups
relative to between groups. If the variability between groups is sufficiently
larger than the variability within groups, then this suggests global
adjustment methods may not be appropriate. As a default, we perform this test
on the median normalized data, but the user may change this option.
The R-package quantro can be installed from the Bioconductor
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("quantro")
After installation, the package can be loaded into R.
library(quantro)
flowSorted
Data ExampleTo explore how to use quantro()
, we use the
FlowSorted.DLPFC.450k
data package in Bioconductor
Jaffe and Kaminsky (2021).
The data in this package originally came from Guintivano, Aryee, and Kaminsky (2013).
This data set in FlowSorted.DLPFC.450k
contains raw data objects of 58
Illumina 450K DNA methylation microarrays, formatted as RGset
objects. The samples represent two different cellular populations of brain
tissues on the same 29 individuals extracted using flow sorting. For more
information on this data set, please see the FlowSorted.DLPFC.450k
User’s
Guide.For the purposes of this vignette, a MethylSet object from the
minfi
Bioconductor package Aryee et al. (2014) was created which is
a subset of the rows from the original FlowSorted.DLPFC.450k
data
set. This MethylSet
object is found in the /data folder and the script to
create the object is found in /inst.
Here we will explore the distributions of these two cellular populations of
brain tissue (NeuN_pos
and NeuN_neg
) and then test if there
are global differences in the distributions across groups. First, load the
MethylSet object (flowSorted
) and compute the Beta values using
the function getBeta()
in the minfi
Bioconductor package.
We use an offset of 100 as this is the default used by Illumina.
library(minfi)
data(flowSorted)
p <- getBeta(flowSorted, offset = 100)
pd <- pData(flowSorted)
dim(p)
## [1] 10000 58
head(pd)
## DataFrame with 6 rows and 11 columns
## Sample_Name SampleID CellType Sentrix.Barcode Sample.Section
## <character> <character> <character> <numeric> <character>
## 813_N 813_N 813 NeuN_pos 7766130090 R02C01
## 1740_N 1740_N 1740 NeuN_pos 7766130090 R02C02
## 1740_G 1740_G 1740 NeuN_neg 7766130090 R04C01
## 1228_G 1228_G 1228 NeuN_neg 7766130090 R04C02
## 813_G 813_G 813 NeuN_neg 7766130090 R06C01
## 1228_N 1228_N 1228 NeuN_pos 7766130090 R06C02
## diag sex ethnicity age PMI
## <character> <character> <character> <integer> <integer>
## 813_N Control Female Caucasian 30 14
## 1740_N Control Female African 13 17
## 1740_G Control Female African 13 17
## 1228_G Control Male Caucasian 47 13
## 813_G Control Female Caucasian 30 14
## 1228_N Control Male Caucasian 47 13
## BasePath
## <character>
## 813_N /dcs01/lieber/ajaffe..
## 1740_N /dcs01/lieber/ajaffe..
## 1740_G /dcs01/lieber/ajaffe..
## 1228_G /dcs01/lieber/ajaffe..
## 813_G /dcs01/lieber/ajaffe..
## 1228_N /dcs01/lieber/ajaffe..
quantro
contains two functions to view the distributions of the
samples of interest: matdensity()
and matboxplot()
. The function
matdensity()
computes the density for each sample (columns) and
uses the matplot()
function to plot all the densities.
matboxplot()
orders and colors the samples by a group level variable.
These two functions use the RColorBrewer
package and the brewer
palettes can be changed using the arguments brewer.n
and
brewer.name
.
The distributions of the two groups of cellular populations are shown here.
The NeuN_neg
samples are colored in green and the NeuN_pos
are
colored in red.
matdensity(p, groupFactor = pd$CellType, xlab = " ", ylab = "density",
main = "Beta Values", brewer.n = 8, brewer.name = "Dark2")
legend('top', c("NeuN_neg", "NeuN_pos"), col = c(1, 2), lty = 1, lwd = 3)