Contents

1 Introduction

NHGRI maintains and routinely updates a database of selected genome-wide association studies. This document describes R/Bioconductor facilities for working with contents of this database.

1.1 Installation

The package can be installed using Bioconductor’s package, with the sequence

1.2 Attachment and access to documentation

Once the package has been installed, use to obtain interactive access to all the facilities. After executing this command, use to obtain an overview. The current version of this vignette can always be accessed at www.bioconductor.org, or by suitably navigating the web pages generated with .

1.3 Using tidy methods – added August 2022

gwcat = get_cached_gwascat()
library(dplyr)
gwcat |> arrange(DATE)
## # A tibble: 539,522 × 38
##    DATE ADDED TO CATALO…¹ PUBMEDID `FIRST AUTHOR` DATE       JOURNAL LINK  STUDY
##    <date>                    <dbl> <chr>          <date>     <chr>   <chr> <chr>
##  1 2008-06-16             15761122 Klein RJ       2005-03-10 Science www.… Comp…
##  2 2008-06-16             16252231 Maraganore DM  2005-09-09 Am J H… www.… High…
##  3 2008-06-16             16648850 Arking DE      2006-04-30 Nat Ge… www.… A co…
##  4 2008-06-16             17052657 Fung HC        2006-09-28 Lancet… www.… Geno…
##  5 2008-06-16             17052657 Fung HC        2006-09-28 Lancet… www.… Geno…
##  6 2008-06-16             17052657 Fung HC        2006-09-28 Lancet… www.… Geno…
##  7 2008-06-16             17053108 Dewan A        2006-10-19 Science www.… HTRA…
##  8 2008-06-16             17068223 Duerr RH       2006-10-26 Science www.… A ge…
##  9 2008-06-16             17158188 Bierut LJ      2006-12-07 Hum Mo… www.… Nove…
## 10 2008-06-16             17158188 Bierut LJ      2006-12-07 Hum Mo… www.… Nove…
## # ℹ 539,512 more rows
## # ℹ abbreviated name: ¹​`DATE ADDED TO CATALOG`
## # ℹ 31 more variables: `DISEASE/TRAIT` <chr>, `INITIAL SAMPLE SIZE` <chr>,
## #   `REPLICATION SAMPLE SIZE` <chr>, REGION <chr>, CHR_ID <chr>, CHR_POS <chr>,
## #   `REPORTED GENE(S)` <chr>, MAPPED_GENE <chr>, UPSTREAM_GENE_ID <chr>,
## #   DOWNSTREAM_GENE_ID <chr>, SNP_GENE_IDS <chr>, UPSTREAM_GENE_DISTANCE <dbl>,
## #   DOWNSTREAM_GENE_DISTANCE <dbl>, `STRONGEST SNP-RISK ALLELE` <chr>, …

We can produce a GRanges in two forms. By default we get an mcols that has a small set of columns. Note that records that lack a CHR_POS value are omitted.
Records that have complicated CHR_POS values, including semicolons or " x " notation are kept, but only the first position is retained. The CHR_ID field may have complicated character values, these are not normalized, and are simply used as seqnames “as is”.

gg = gwcat |> as_GRanges()
## dropping 31648 records that have NA for CHR_POS
## 372 records have semicolon in CHR_POS; splitting and using first entry.
## 342 records have ' x ' in CHR_POS indicating multiple SNP effects, using first.
gg
## GRanges object with 507874 ranges and 4 metadata columns:
##            seqnames    ranges strand |  PUBMEDID       DATE
##               <Rle> <IRanges>  <Rle> | <numeric>     <Date>
##        [1]        9 121277593      * |  32888493 2020-09-01
##        [2]        9 127866860      * |  32888493 2020-09-01
##        [3]        9 129079608      * |  32888493 2020-09-01
##        [4]        9 133271182      * |  32888493 2020-09-01
##        [5]       10   3749032      * |  32888493 2020-09-01
##        ...      ...       ...    ... .       ...        ...
##   [507870]       14  35368270      * |  34315907 2021-07-27
##   [507871]        2 168424203      * |  34315907 2021-07-27
##   [507872]        4  27296111      * |  34315907 2021-07-27
##   [507873]        5 132707659      * |  34315907 2021-07-27
##   [507874]       14  99392633      * |  34315907 2021-07-27
##                     DISEASE/TRAIT        SNPS
##                       <character> <character>
##        [1] Hemoglobin concentra..   rs4837827
##        [2] Hemoglobin concentra..    rs913987
##        [3] Hemoglobin concentra..  rs10819461
##        [4] Hemoglobin concentra..    rs550057
##        [5] Hemoglobin concentra..  rs10904089
##        ...                    ...         ...
##   [507870] Phonological awarene..  rs57645874
##   [507871] Phonological awarene..   rs2601079
##   [507872] Phonological awarene..   rs2005865
##   [507873] Phonological awarene..  rs17691077
##   [507874]  Nonsense word reading   rs1257267
##   -------
##   seqinfo: 255 sequences from GRCh38 genome; no seqlengths

We can set the seqinfo as follows, retaining only records that employ standard chromosomes.

library(GenomeInfoDb)
data(si.hs.38)
gg = keepStandardChromosomes(gg, pruning="coarse")
seqlevels(gg) = seqlevels(si.hs.38)
seqinfo(gg) = si.hs.38
gg
## GRanges object with 507160 ranges and 4 metadata columns:
##            seqnames    ranges strand |  PUBMEDID       DATE
##               <Rle> <IRanges>  <Rle> | <numeric>     <Date>
##        [1]        9 121277593      * |  32888493 2020-09-01
##        [2]        9 127866860      * |  32888493 2020-09-01
##        [3]        9 129079608      * |  32888493 2020-09-01
##        [4]        9 133271182      * |  32888493 2020-09-01
##        [5]       10   3749032      * |  32888493 2020-09-01
##        ...      ...       ...    ... .       ...        ...
##   [507156]       14  35368270      * |  34315907 2021-07-27
##   [507157]        2 168424203      * |  34315907 2021-07-27
##   [507158]        4  27296111      * |  34315907 2021-07-27
##   [507159]        5 132707659      * |  34315907 2021-07-27
##   [507160]       14  99392633      * |  34315907 2021-07-27
##                     DISEASE/TRAIT        SNPS
##                       <character> <character>
##        [1] Hemoglobin concentra..   rs4837827
##        [2] Hemoglobin concentra..    rs913987
##        [3] Hemoglobin concentra..  rs10819461
##        [4] Hemoglobin concentra..    rs550057
##        [5] Hemoglobin concentra..  rs10904089
##        ...                    ...         ...
##   [507156] Phonological awarene..  rs57645874
##   [507157] Phonological awarene..   rs2601079
##   [507158] Phonological awarene..   rs2005865
##   [507159] Phonological awarene..  rs17691077
##   [507160]  Nonsense word reading   rs1257267
##   -------
##   seqinfo: 24 sequences from GRCh38 genome

1.4 Getting a recent version of the GWAS catalog

We use BiocFileCache to manage downloaded TSV from EBI’s download site. The file is provided without compression, so prepare for 200+MB download if you are not working from a cache. There is no etag set, so you have to check for updates on your own.

args(get_cached_gwascat)
## function (url = "http://www.ebi.ac.uk/gwas/api/search/downloads/alternative", 
##     cache = BiocFileCache::BiocFileCache(), refresh = FALSE, 
##     ...) 
## NULL

This is converted to a manageable extension of GRanges using process_gwas_dataframe.

args(process_gwas_dataframe)
## function (df) 
## NULL

2 Illustrations: computing

Available functions are:

library(gwascat)
## gwascat loaded.  Use makeCurrentGwascat() to extract current image.
##  from EBI.  The data folder of this package has some legacy extracts.
objects("package:gwascat")
##  [1] "as_GRanges"             "bindcadd_snv"           "getRsids"              
##  [4] "getTraits"              "get_cached_gwascat"     "gwascat_from_AHub"     
##  [7] "gwcat_snapshot"         "gwcex2gviz"             "ldtagr"                
## [10] "locs4trait"             "makeCurrentGwascat"     "obo2graphNEL"          
## [13] "process_gwas_dataframe" "riskyAlleleCount"       "subsetByChromosome"    
## [16] "subsetByTraits"         "topTraits"              "traitsManh"

An extended GRanges instance with a sample of 50000 SNP-disease associations reported on 30 April 2020 is obtained as follows, with addresses based on the GRCh38 genome build. We use gwtrunc to refer to it in the sequel.

data(ebicat_2020_04_30)
gwtrunc = ebicat_2020_04_30

To determine the most frequently occurring traits in this sample:

topTraits(gwtrunc)
## 
##                        Blood protein levels 
##                                        1941 
##                   Heel bone mineral density 
##                                        1309 
##                             Body mass index 
##                                        1283 
##                                      Height 
##                                        1238 
##                           Metabolite levels 
##                                         691 
##                     Systolic blood pressure 
##                                         654 
##                               Schizophrenia 
##                                         643 
## Educational attainment (years of education) 
##                                         642 
##          Post bronchodilator FEV1/FVC ratio 
##                                         479 
##                             Type 2 diabetes 
##                                         466

For a given trait, obtain a GRanges with all recorded associations; here only three associations are shown:

subsetByTraits(gwtrunc, tr="LDL cholesterol")[1:3]
## gwasloc instance with 3 records and 38 attributes per record.
## Extracted:  2020-04-30 23:24:51 
## metadata()$badpos includes records for which no unique locus was given.
## Genome:  GRCh38 
## Excerpt:
## GRanges object with 3 ranges and 3 metadata columns:
##       seqnames    ranges strand |   DISEASE/TRAIT        SNPS   P-VALUE
##          <Rle> <IRanges>  <Rle> |     <character> <character> <numeric>
##   [1]        7  21567734      * | LDL cholesterol  rs12670798     6e-09
##   [2]        5  75355259      * | LDL cholesterol   rs3846662     2e-11
##   [3]        2  43837951      * | LDL cholesterol   rs6756629     3e-10
##   -------
##   seqinfo: 24 sequences from GRCh38 genome

3 Some visualizations

3.1 Basic Manhattan plot

A basic Manhattan plot is easily constructed with the ggbio package facilities. Here we confine attention to chromosomes 4:6. First, we create a version of the catalog with \(-log_{10} p\) truncated at a maximum value of 25.

requireNamespace("S4Vectors")
mcols = S4Vectors::mcols
mlpv = mcols(gwtrunc)$PVALUE_MLOG
mlpv = ifelse(mlpv > 25, 25, mlpv)
S4Vectors::mcols(gwtrunc)$PVALUE_MLOG = mlpv
library(GenomeInfoDb)
# seqlevelsStyle(gwtrunc) = "UCSC" # no more!
seqlevels(gwtrunc) = paste0("chr", seqlevels(gwtrunc))
gwlit = gwtrunc[ which(as.character(seqnames(gwtrunc)) %in% c("chr4", "chr5", "chr6")) ]
library(ggbio)
mlpv = mcols(gwlit)$PVALUE_MLOG
mlpv = ifelse(mlpv > 25, 25, mlpv)
S4Vectors::mcols(gwlit)$PVALUE_MLOG = mlpv
autoplot(gwlit, geom="point", aes(y=PVALUE_MLOG), xlab="chr4-chr6")
## Warning: The `facets` argument of `facet_grid()` is deprecated as of ggplot2 2.2.0.
## ℹ Please use the `rows` argument instead.
## ℹ The deprecated feature was likely used in the ggbio package.
##   Please report the issue at <https://github.com/lawremi/ggbio/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

3.2 Annotated Manhattan plot

A simple call permits visualization of GWAS results for a small number of traits. Note the defaults in this call.

traitsManh(gwtrunc)