Contents

1 Introduction

Over the past years, RNA-seq data for several species have accumulated in public repositories. Additionally, genome-wide association studies (GWAS) have identified SNPs associated with phenotypes of interest, such as agronomic traits in plants, production traits in livestock, and complex human diseases. However, although GWAS can identify SNPs, they cannot identify causative genes associated with the studied phenotype. The goal of cageminer is to integrate GWAS-derived SNPs with transcriptomic data to mine candidate genes and identify high-confidence genes associated with traits of interest.

2 Installation

if(!requireNamespace('BiocManager', quietly = TRUE))
  install.packages('BiocManager')
BiocManager::install("cageminer")
# Load package after installation
library(cageminer)
set.seed(123) # for reproducibility

3 Data description

For this vignette, we will use transcriptomic data on pepper (Capsicum annuum) response to Phytophthora root rot (Kim et al. 2018), and GWAS SNPs associated with resistance to Phytophthora root rot from Siddique et al. (2019). To ensure interoperability with other Bioconductor packages, expression data are stored as SummarizedExperiment objects, and gene/SNP positions are stored as GRanges objects.

# GRanges of SNP positions
data(snp_pos)
snp_pos
#> GRanges object with 116 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>     2    Chr02 149068682      *
#>     3    Chr03   5274098      *
#>     4    Chr05  27703815      *
#>     5    Chr05  27761792      *
#>     6    Chr05  27807397      *
#>   ...      ...       ...    ...
#>   114    Chr12 230514706      *
#>   115    Chr12 230579178      *
#>   116    Chr12 230812962      *
#>   117    Chr12 230887290      *
#>   118    Chr12 231022812      *
#>   -------
#>   seqinfo: 8 sequences from an unspecified genome; no seqlengths

# GRanges of chromosome lengths
data(chr_length) 
chr_length
#> GRanges object with 12 ranges and 0 metadata columns:
#>        seqnames      ranges strand
#>           <Rle>   <IRanges>  <Rle>
#>    [1]    Chr01 1-272704604      *
#>    [2]    Chr02 1-171128871      *
#>    [3]    Chr03 1-257900543      *
#>    [4]    Chr04 1-222584275      *
#>    [5]    Chr05 1-233468049      *
#>    ...      ...         ...    ...
#>    [8]    Chr08 1-145103255      *
#>    [9]    Chr09 1-252779264      *
#>   [10]    Chr10 1-233593809      *
#>   [11]    Chr11 1-259726002      *
#>   [12]    Chr12 1-235688241      *
#>   -------
#>   seqinfo: 12 sequences from an unspecified genome; no seqlengths

# GRanges of gene coordinates
data(gene_ranges) 
gene_ranges
#> GRanges object with 30242 ranges and 6 metadata columns:
#>           seqnames              ranges strand |   source     type     score
#>              <Rle>           <IRanges>  <Rle> | <factor> <factor> <numeric>
#>       [1]    Chr01         63209-63880      - |  PGA1.55     gene        NA
#>       [2]    Chr01       112298-112938      - |  PGA1.55     gene        NA
#>       [3]    Chr01       117979-118392      + |  PGA1.55     gene        NA
#>       [4]    Chr01       119464-119712      + |  PGA1.55     gene        NA
#>       [5]    Chr01       119892-120101      + |  PGA1.55     gene        NA
#>       ...      ...                 ...    ... .      ...      ...       ...
#>   [30238]    Chr12 235631138-235631467      - |  PGA1.55     gene        NA
#>   [30239]    Chr12 235642644-235645110      + |  PGA1.55     gene        NA
#>   [30240]    Chr12 235645483-235651927      - |  PGA1.55     gene        NA
#>   [30241]    Chr12 235652709-235655955      - |  PGA1.55     gene        NA
#>   [30242]    Chr12 235663655-235665276      - |  PGA1.55     gene        NA
#>               phase          ID          Parent
#>           <integer> <character> <CharacterList>
#>       [1]      <NA>  CA01g00010                
#>       [2]      <NA>  CA01g00020                
#>       [3]      <NA>  CA01g00030                
#>       [4]      <NA>  CA01g00040                
#>       [5]      <NA>  CA01g00050                
#>       ...       ...         ...             ...
#>   [30238]      <NA>  CA12g22890                
#>   [30239]      <NA>  CA12g22900                
#>   [30240]      <NA>  CA12g22910                
#>   [30241]      <NA>  CA12g22920                
#>   [30242]      <NA>  CA12g22930                
#>   -------
#>   seqinfo: 12 sequences from an unspecified genome; no seqlengths

# SummarizedExperiment of pepper response to Phytophthora root rot (RNA-seq)
data(pepper_se)
pepper_se
#> class: SummarizedExperiment 
#> dim: 3892 45 
#> metadata(0):
#> assays(1): ''
#> rownames(3892): CA02g23440 CA02g05510 ... CA03g35110 CA02g12750
#> rowData names(0):
#> colnames(45): PL1 PL2 ... TMV-P0-3D TMV-P0-Up
#> colData names(1): Condition

4 Visualizing SNP distribution

Before mining high-confidence candidates, you can visualize the SNP distribution in the genome to explore possible patterns. First, let’s see if SNPs are uniformly across chromosomes with plot_snp_distribution().

plot_snp_distribution(snp_pos)

As we can see, SNPs associated with resistance to Phytophthora root rot tend to co-occur in chromosome 5. Now, we can see if they are close to each other in the genome, and if they are located in gene-rich regions. We can visualize it with plot_snp_circos, which displays a circos plot of SNPs across chromosomes.

plot_snp_circos(chr_length, gene_ranges, snp_pos)

There seems to be no clustering in gene-rich regions, but we can see that SNPs in the same chromosome tend to be physically close to each other.

If you have SNP positions for multiple traits, you need to store them in GRangesList or CompressedGRangesList objects, so each element will have SNP positions for a particular trait. Then, you can visualize their distribution as you would do for a single trait. Let’s simulate multiple traits to see how it works:

# Simulate multiple traits by sampling 20 SNPs 4 times
snp_list <- GenomicRanges::GRangesList(
  Trait1 = sample(snp_pos, 20),
  Trait2 = sample(snp_pos, 20),
  Trait3 = sample(snp_pos, 20),
  Trait4 = sample(snp_pos, 20)
)

# Visualize SNP distribution across chromosomes
plot_snp_distribution(snp_list)

# Visualize SNP positions in the genome as a circos plot
plot_snp_circos(chr_length, gene_ranges, snp_list)