ABAData: gene expression data from Allen Brain Atlas to use with enrichment analysis package ABAEnrichment

Package: ABAData
Author: Steffi Grote
Date: October 06, 2016

Overview

This package provides the data used in the gene expression enrichment package ABAEnrichment. It contains three datasets on gene expression in adult and developing human brains which base on data provided by the Allen Brain Atlas project [1-4]. The data and its processing is described below. For usage of the data for gene expression enrichment analyses please refer to the ABAEnrichment vignette.

Overview of the datasets included in ABAData:

object description
dataset_adult microarray data from six adult donors
dataset_5_stages RNA-seq data from 42 donors grouped into five developmental stages (prenatal to adult)
dataset_dev_effect scores that describe the age effect on gene expression per gene and brain region

Data structure

All datasets in the package are represented in a data.frame with the following columns:

column description
hgnc_symbol HGNC-symbol
entrezgene Entrez-ID
ensembl_gene_id Ensembl-ID
structure brain region ID as used in the ontology from Allen Brain Atlas
signal normalized microarray data, RNA-seq data or developmental effect score
age_category developmental stage, see below

The column age_category indicates the developmental stage as follows:

age_category description
0 all developmental stages
1 prenatal
2 infant (0-2 yrs)
3 child (3-11 yrs)
4 adolescent (12-19 yrs)
5 adult (>19 yrs)

Get more information on brain region IDs

The column structure contains the brain region IDs used in the ontology from Allen Brain Atlas. However, also functions provided in the ABAEnrichment package (>= 1.3.4) can be used to retrieve the name and superstructures of a given brain region ID:

## load ABAEnrichment software package
require(ABAEnrichment)
## get name and superstructures of brain region 4679
get_name(4679)
get_superstructures(4679)
get_name(get_superstructures(4679))

Data acquisition and manipulation

Adult Human Brain

Data from Allen Human Brain Atlas

The microarray gene expression data downloaded from Allen Human Brain Atlas [3] contain expression data for 29176 genes (HGNC-symbols) from six adult human donors. 93% of known genes were hybridized with at least two probes. Gene expression was measured in a total of 414 brain regions and was normalized within and across donors as described in the Technical White Paper (March 2013 v.1) on http://human.brain-map.org/.

Data manipulation for dataset_adult

Expression data for genes obtained from multiple sources were merged, i.e. for each donor the values for one gene in one brain region were averaged (different probes and samples) and the data for the donors were pooled by taking an average for each gene in one brain region across all six donors. Entrez-ID and Ensembl-ID were added using biomaRt [5], and the gene set was reduced to protein coding genes (Ensembl v.69). The final dataset contains expression data for 15698 genes (Ensembl-IDs) in 414 brain regions:

## load package
require(ABAData)
## require averaged gene expression data (microarray) from adult human brain regions
data(dataset_adult)
## look at first lines
head(dataset_adult)
##   hgnc_symbol entrezgene ensembl_gene_id structure   signal age_category
## 1        A1BG          1 ENSG00000121410      4679 4.579497            5
## 2        A1BG          1 ENSG00000121410      4909 4.515927            5
## 3        A1BG          1 ENSG00000121410      4307 5.016984            5
## 4        A1BG          1 ENSG00000121410      4139 5.119118            5
## 5        A1BG          1 ENSG00000121410      4124 5.028861            5
## 6        A1BG          1 ENSG00000121410      4033 4.938708            5

Developing Human Brain

Data from BrainSpan Atlas of the Developing Human Brain

The RNA-seq dataset 'RNA-Seq Gencode v10 summarized to genes' was obtained from the BrainSpan Atlas of the Developing Human Brain, which contains expression data for 52376 genes (Ensembl-IDs) from 42 human donors of 31 different ages, 8 pcw to 40 yrs [4]. The downloaded dataset contained RPKM values assigned to genes and donors. A total of 26 brain regions were sampled, 10 of them in donors of five or less different ages. The remaining 16 brain regions were sampled in donors of 20 different ages or more. For details on the expression data see the documentation on http://brainspan.org/.

Data manipulation for dataset_5_stages and dataset_dev_effect

To increase the power in detecting developmental effects by using highly overlapping brain regions the dataset for the enrichment analysis was restricted to the 16 brain regions sampled in at least 20 ages. As for the dataset_adult the dataset was limited to protein coding genes leading to a set of 17259 genes (Ensembl-IDs).
For the dataset_5_stages the expression data were summarized in five major developmental stages (age_category 1-5, see above). The mean expression of all donors in a given brain region in a developmental stage is assigned to a gene:

## require averaged gene expression data (RNA-seq) for 5 age categories
data(dataset_5_stages)
## look at first lines
head(dataset_5_stages)
##   hgnc_symbol entrezgene ensembl_gene_id structure   signal age_category
## 1        A1BG          1 ENSG00000121410     10163 1.921838            1
## 2        A1BG          1 ENSG00000121410     10163 2.708304            2
## 3        A1BG          1 ENSG00000121410     10163 3.533027            3
## 4        A1BG          1 ENSG00000121410     10163 2.180276            4
## 5        A1BG          1 ENSG00000121410     10163 2.158442            5
## 6        A1BG          1 ENSG00000121410     10173 1.897469            1

The data.frame dataset_dev_effect instead of expression values contains developmental effect scores which have been computed based on the data of the Atlas of the Developing Human Brain [4]: expression values (RPKM) were log2-transformed and averaged per age across individuals. The age, which is given in post-conceptional weeks (pcw), months or years, was transformed to pcw, assuming a pregnancy of 38 weeks, a month to have 4 weeks for infants aged under 1 year, and 1 year to have 52 weeks for older individuals. For each gene and structure a linear model was fit where cumulative gene expression over time is predicted by the age in pcw. The developmental effect score is defined to be 1 - R2_adjusted for the model. Hence, higher values of that score come from lower R2 which indicate less uniform gene expression over time. Scores that result to be NAs given all expression values are 0, are set to be 0, consistent with a gene with constant expression. Again, the dataset_dev_effect has the same structure as the datasets above, but with the signal column indicating the developmental effect scores instead of gene expression:

## require developmental effect score for genes in brain regions
data(dataset_dev_effect)
## look at first lines
head(dataset_dev_effect)
##   hgnc_symbol entrezgene ensembl_gene_id structure    signal age_category
## 1        A1BG          1 ENSG00000121410     10163 0.1044852            0
## 2        A1BG          1 ENSG00000121410     10173 0.3091228            0
## 3        A1BG          1 ENSG00000121410     10185 0.2493307            0
## 4        A1BG          1 ENSG00000121410     10194 0.1867477            0
## 5        A1BG          1 ENSG00000121410     10209 0.1824356            0
## 6        A1BG          1 ENSG00000121410     10225 0.2200680            0

References

[1] Hawrylycz, M.J. et al. (2012) An anatomically comprehensive atlas of the adult human brain transcriptome, Nature 489: 391-399. [doi:10.1038/nature11405]

[2] Miller, J.A. et al. (2014) Transcriptional landscape of the prenatal human brain, Nature 508: 199-206. [doi:10.1038/nature13185]

[3] Allen Institute for Brain Science. Allen Human Brain Atlas (Internet). Available from: [http://human.brain-map.org/]

[4] Allen Institute for Brain Science. BrainSpan Atlas of the Developing Human Brain (Internet). Available from: [http://brainspan.org/]

[5] Durinck, S. et al. (2009) Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt, Nature Protocols 4: 1184-1191