PAST

Adam Thrash

2023-04-25

Genome-wide association study (GWAS) of complex traits in maize and other crops has become very popular to identify regions of the genome that influence these traits [1, 2, 3]. In general, hundreds of thousands of single nucleotide polymorphisms (SNPs) markers are each tested using F statistics for association with the trait, which assigns a p-value for the SNP-trait association. Individual marker-trait associations that meet the threshold set for the false discovery rate (FDR, the proportion of false positives among all significant results for some level α) are then studied in more detail to uncover hints as to the genetic architecture of the trait, and how best to improve it in the future. Many true associations may be missed in GWAS, however, because the threshold for FDR could be as low as α divided by the total number of SNPs being tested. Metabolic pathway analysis focuses on the combined effects of many genes that are grouped according to their shared biological function. Combining GWAS analysis with metabolic pathway analysis considers all genetic sequences positively associated with the trait of interest, regardless of magnitude, and jointly may highlight which sequences lead to mechanisms for crop improvement and which warrant further study and manipulation, for example, by gene editing.

While combined GWAS and pathway analyses were highly successful in uncovering associated pathways, the analyses were slow and cumbersome, as the analysis tools were written in a combination of R, Perl, and Bash, and the output of each analysis was manually input into the next analysis [1]. The Pathway Association Study Tool (PAST) was developed to facilitate easier and more efficient GWAS-based metabolic pathway analysis. PAST was tested using maize but is usable for other species as well. It tracks all SNP marker - trait associations, regardless of significance or magnitude. PAST groups SNPs into linkage blocks based on linkage disequilibrium (LD) data and identifies a tagSNP from each block. PAST then identifies genes within a user-defined distance of the tagSNPs, and transfers the attributes of the tagSNP to the gene(s), including the allele effect, R2 and p-value of the original SNP-trait association found from the GWAS analysis. Finally, PAST uses the gene effect values to calculate an enrichment score (ES) and p-value for each pathway.

PAST

PAST is an implementation of the GWAS to pathway analysis described in Tang et al. 2015.

The following blocks of code show how to analyze data with PAST from loading in the data to plotting the rugplots.

library(PAST)
#> Warning: replacing previous import 'S4Vectors::first' by 'dplyr::first' when
#> loading 'PAST'
#> Warning: replacing previous import 'S4Vectors::setequal' by 'dplyr::setequal'
#> when loading 'PAST'
#> Warning: replacing previous import 'S4Vectors::rename' by 'dplyr::rename' when
#> loading 'PAST'
#> Warning: replacing previous import 'S4Vectors::intersect' by 'dplyr::intersect'
#> when loading 'PAST'
#> Warning: replacing previous import 'S4Vectors::union' by 'dplyr::union' when
#> loading 'PAST'
#> Warning: replacing previous import 'S4Vectors::setdiff' by 'dplyr::setdiff'
#> when loading 'PAST'
#> Warning: replacing previous import 'S4Vectors::findMatches' by
#> 'utils::findMatches' when loading 'PAST'
#> Warning: replacing previous import 'S4Vectors::tail' by 'utils::tail' when
#> loading 'PAST'
#> Warning: replacing previous import 'S4Vectors::stack' by 'utils::stack' when
#> loading 'PAST'
#> Warning: replacing previous import 'S4Vectors::head' by 'utils::head' when
#> loading 'PAST'
#> Warning: replacing previous import 'S4Vectors::complete.cases' by
#> 'stats::complete.cases' when loading 'PAST'
#> Warning: replacing previous import 'S4Vectors::sd' by 'stats::sd' when loading
#> 'PAST'
demo_association_file = system.file("extdata", "association.txt.xz", 
                                    package = "PAST", mustWork = TRUE)
demo_effects_file = system.file("extdata", "effects.txt.xz", 
                                package = "PAST", mustWork = TRUE)
demo_LD_file = system.file("extdata", "LD.txt.xz", 
                           package = "PAST", mustWork = TRUE)
demo_genes_file = system.file("extdata", "genes.gff", 
                              package = "PAST", mustWork = TRUE)
demo_pathways_file = system.file("extdata", "pathways.txt.xz", 
                                 package = "PAST", mustWork = TRUE)

Loading GWAS Data

Loading GWAS data takes the statistics and effects from a GWAS and stores them together. In the process, non-biallelic data is dropped. The two files are described below.

gwas_data <- load_GWAS_data(demo_association_file, 
                            demo_effects_file)

Loading LD Data

LD data is loaded from the linkage disequilibrium file. In the process, incomplete cases are dropped and the data is split into a data.frame for each chromomsome. Your LD data’s Chromosome information should match your GFF annotation’s chromosome column (the first column). The LD file is described below.

LD <- load_LD(demo_LD_file)

Assigning SNPs to Genes

PAST uses the linkage disequilibrium output from TASSEL between each marker SNP (denoted as the reference SNP) and its closest neighboring SNPs (50 upstream and 50 downstream). Within this window, linkages between SNPs are calculated. The threshold for linkage can be determined from a plot of linkage disequilibrium values (-log(pDiseq) against r2). Based on this plot, Tang et al. [1] defined linkage when the two SNPs being compared had R^2 > 0.8 [3]. PAST uses linkage data to determine which SNP represents the linkage group (the tagSNP), and then uses the tagSNP to determine the linked gene(s) within a window of ± 1Kb. The rationale behind the decision to use 1 Kb is that most genes are regulated within 1 Kb upstream and downstream of the start and stop codons, respectively. At this point, the association and effects data of the tagSNP are transferred to the linked gene. If more than one gene is equally linked to a tagSNP, the attributes of the tagSNP are transferred to both (or all) linked genes.

genes <-assign_SNPs_to_genes(gwas_data, 
                             LD, 
                             demo_genes_file,
                             c("gene"),
                             1000, 
                             0.8, 
                             2)

Finding Pathway Significance

Pathway scores are obtained from gene-set enrichment calculations [1, 3]. First, all genes found by tagSNPs are ranked by their effect values from negative to positive in the case of traits where a reduction is beneficial, as in disease resistance, or vice-versa where increase is beneficial, as in the case of yield. Enrichment is based on gene membership in pathways, as assigned by a pathways database supplied by the user. Only pathways with a certain number of genes (supplied by the user) or more genes are considered to reduce bias from small sample size. Next, a running sum is calculated in a manner similar to that used for a weighted Kolmogorov-Smirnov statistic. The running sum statistic increases or decreases if genes are or are not, in the pathway, respectively. The score increases by the fraction of genes in the pathway weighted by the absolute value of the gene effect value or decreases by the fraction of genes not in the pathway. It is a running sum statistic because each gene is considered sequentially in order of their rank among all genes. The final enrichment score (ES) for the pathway is the maximum positive deviation from zero and can be visualized by plotting the values of the running sum statistic against the rank order of genes (see section on rug plots). The significance of a pathway is determined by running 1000 permutations of all genes and their gene effect values to generate a null distribution for the ES. The null distribution mean (μ) and standard deviation (σ) serve to normalize the ES for the pathway. The values of p are then corrected for the false discovery rate as calculated by the QVALUE package [4] in R.

PWY-ID\tPathway Description\tGene

rugplots_data <- find_pathway_significance(genes, 
                                           demo_pathways_file, 
                                           5,
                                           "increasing", 
                                           1000, 
                                           2)

Plotting Selected Pathways

A rug plot is generated for each pathway of interest to visualize the gene-set enrichment calculation. All genes found by the tagSNPs are ranked based on their effect values and their ranks are projected along the x axis of the plot. Hatch marks along the top of the graph denote the rank position of the genes that have membership in the pathway. Values of the running sum statistic enrichment score for each pathway gene are then plotted against their rank. The highest point in the curve is the ES for the pathway and is denoted by the vertical dashed line.

plot_pathways(rugplots_data, 
              "pvalue", 
              0.02, 
              "increasing", 
              tempdir())

References

[1] Tang JD, Perkins A, Williams WP, Warburton ML. Using genome-wide associations to identify metabolic pathways involved in maize aflatoxin accumulation resistance. BMC Genomics. 2015;16. doi:10.1186/s12864-015-1874-9.

[2] Bradbury PJ, Zhang Z, Kroon DE, Casstevens TM, Ramdoss Y, Buckler ES. TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics. 2007;23:2633–5.

[3] Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005;102:15545–50.

[4] Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci U S A. 2003;100:9440–5.