1 Introduction

The formation of hybrids through the fusion of distinct genomes and subsequent genome duplication (in cases of allopolyploidy) represents a significant evolutionary event with complex effects on cellular biology, particularly gene expression. The impact of such genome mergings and duplications on transcription remain incompletely understood. To bridge this gap, we introduce HybridExpress, a comprehensive package designed to facilitate the comparative transcriptomic analysis of hybrids and their progenitor species. HybridExpress is tailored for RNA-Seq data derived from a ‘hybrid triplet’: the hybrid organism and its two parental species. This package offers a suite of intuitive functions enabling researchers to perform differential expression analysis with ease, generate principal component analysis (PCA) plots to visualize sample grouping, categorize genes into 12 distinct expression pattern groups (as in Rapp, Udall, and Wendel (2009)), and conduct functional analyses. Acknowledging the potential variability in cell and transcriptome size across species and ploidy levels, HybridExpress incorporates features for rigorous normalization of count data. Specifically, it allows for the integration of spike-in controls directly into the normalization process, ensuring accurate transcriptome size adjustments when these standards are present in the RNA-Seq count data (see full methodology in Coate (2023)). By offering these capabilities, HybridExpress provides a robust tool set for unraveling the intricate effects of genome doubling and merging on gene expression, paving the way for novel insights into the cellular biology of hybrid organisms.

2 Installation

HybridExpress can be installed from Bioconductor with the following code:

if(!requireNamespace('BiocManager', quietly = TRUE))
  install.packages('BiocManager')

BiocManager::install("HybridExpress")
# Load package after installation
library(HybridExpress)

set.seed(123) # for reproducibility

3 Data description

For this vignette, we will use an example data set that comprises (unpublished) gene expression data (in counts) for Chlamydomonas reinhardtii. In our lab, we crossed a diploid line of C. reinhardtii (hereafter “P1”) with a haploid line (hereafter “P2”), thus generating a triploid line through the merging of the two parental genomes. The count matrix and sample metadata are stored in SummarizedExperiment objects, which is the standard Bioconductor data structure required by HybridExpress. For instructions on how to create a SummarizedExperiment object, check the FAQ section of this vignette.

Let’s load the example data and take a quick look at it:

library(SummarizedExperiment)

# Load data
data(se_chlamy)

# Inspect the `SummarizedExperiment` object
se_chlamy
#> class: SummarizedExperiment 
#> dim: 13058 18 
#> metadata(0):
#> assays(1): counts
#> rownames(13058): Cre01.g000050 Cre01.g000150 ... ERCC-00170 ERCC-00171
#> rowData names(0):
#> colnames(18): S1 S2 ... S17 S18
#> colData names(2): Ploidy Generation

## Take a look at the colData and count matrix
colData(se_chlamy)
#> DataFrame with 18 rows and 2 columns
#>          Ploidy Generation
#>     <character>   <factor>
#> S1      diploid         P1
#> S2      diploid         P1
#> S3      diploid         P1
#> S4      diploid         P1
#> S5      diploid         P1
#> ...         ...        ...
#> S14    triploid         F1
#> S15    triploid         F1
#> S16    triploid         F1
#> S17    triploid         F1
#> S18    triploid         F1
assay(se_chlamy) |> head()
#>                 S1  S2   S3   S4  S5   S6   S7   S8   S9 S10 S11  S12  S13  S14
#> Cre01.g000050   31  21   26   33  19   48   17    6   13   6  10    7   31   22
#> Cre01.g000150   50  29   35   99  11   58   51   46   41  33  28   14   53   25
#> Cre01.g000200   36  26   24   29  17   36   14   15   12   8  14    7   25   24
#> Cre01.g000250  440 272  394  332 283  585  272  274  255 160 225  235  405  391
#> Cre01.g000300 1242 839 1216 1251 811 1785 1341 1306 1122 877 844 1082 1704 1739
#> Cre01.g000350  412 264  294  336 252  478  233  221  195 155 190  201  299  272
#>                S15  S16  S17  S18
#> Cre01.g000050   29   22   21   16
#> Cre01.g000150   37   17   24   21
#> Cre01.g000200   24   26   18   21
#> Cre01.g000250  358  339  340  332
#> Cre01.g000300 1524 1720 1517 1243
#> Cre01.g000350  276  324  246  275

table(se_chlamy$Ploidy, se_chlamy$Generation)
#>           
#>            F1 P1 P2
#>   diploid   0  6  0
#>   haploid   0  0  6
#>   triploid  6  0  0

As you can see, the count matrix contains 13058 genes and 18 samples, with 6 replicates for parent 1 (P1, diploid), 6 replicates for parent 2 (P2, haploid), and 6 replicates for the progeny (F1, triploid).

4 Adding midparent expression values

First of all, you’d want to add in your count matrix in silico samples that contain the expression values of the midparent. This can be done with the function add_midparent_expression(), which takes a random sample pair (one sample from each parent, sampling without replacement) and calculates the midparent expression value in one of three ways:

  1. Mean (default): get the mean expression of the two samples.
  2. Sum: get the sum of the two samples.
  3. Weighted mean: get the weighted mean of the two samples by multiplying the expression value of each parent by a weight. Typically, this can be used if the two parents have different ploidy levels, and the weights would correspond to the ploidy level of each parent.

For this function, besides specifying the method to obtain the midparent expression values (i.e., “mean”, “sum”, or “weightedmean”), users must also specify the name of the column in colData that contains information about the generations (default: “Generation”), as well as the levels corresponding to each parent (default: “P1” and “P2” for parents 1 and 2, respectively).

# Add midparent expression using the mean of the counts
se <- add_midparent_expression(se_chlamy)
head(assay(se))
#>                 S1  S2   S3   S4  S5   S6   S7   S8   S9 S10 S11  S12  S13  S14
#> Cre01.g000050   31  21   26   33  19   48   17    6   13   6  10    7   31   22
#> Cre01.g000150   50  29   35   99  11   58   51   46   41  33  28   14   53   25
#> Cre01.g000200   36  26   24   29  17   36   14   15   12   8  14    7   25   24
#> Cre01.g000250  440 272  394  332 283  585  272  274  255 160 225  235  405  391
#> Cre01.g000300 1242 839 1216 1251 811 1785 1341 1306 1122 877 844 1082 1704 1739
#> Cre01.g000350  412 264  294  336 252  478  233  221  195 155 190  201  299  272
#>                S15  S16  S17  S18 midparent1 midparent2 midparent3 midparent4
#> Cre01.g000050   29   22   21   16         18         27         14         20
#> Cre01.g000150   37   17   24   21         32         46         38         56
#> Cre01.g000200   24   26   18   21         19         22         20         18
#> Cre01.g000250  358  339  340  332        310        372        273        284
#> Cre01.g000300 1524 1720 1517 1243       1030       1331       1072       1166
#> Cre01.g000350  276  324  246  275        242        316        242        268
#>               midparent5 midparent6
#> Cre01.g000050         18         22
#> Cre01.g000150         31         46
#> Cre01.g000200         16         24
#> Cre01.g000250        278        348
#> Cre01.g000300       1076       1182
#> Cre01.g000350        242        304

# Alternative 1: using the sum of the counts
add_midparent_expression(se_chlamy, method = "sum") |>
    assay() |> 
    head()
#>                 S1  S2   S3   S4  S5   S6   S7   S8   S9 S10 S11  S12  S13  S14
#> Cre01.g000050   31  21   26   33  19   48   17    6   13   6  10    7   31   22
#> Cre01.g000150   50  29   35   99  11   58   51   46   41  33  28   14   53   25
#> Cre01.g000200   36  26   24   29  17   36   14   15   12   8  14    7   25   24
#> Cre01.g000250  440 272  394  332 283  585  272  274  255 160 225  235  405  391
#> Cre01.g000300 1242 839 1216 1251 811 1785 1341 1306 1122 877 844 1082 1704 1739
#> Cre01.g000350  412 264  294  336 252  478  233  221  195 155 190  201  299  272
#>                S15  S16  S17  S18 midparent1 midparent2 midparent3 midparent4
#> Cre01.g000050   29   22   21   16         43         26         58         46
#> Cre01.g000150   37   17   24   21         86         25         86        140
#> Cre01.g000200   24   26   18   21         38         24         50         41
#> Cre01.g000250  358  339  340  332        666        518        810        587
#> Cre01.g000300 1524 1720 1517 1243       2557       1893       2629       2373
#> Cre01.g000350  276  324  246  275        527        453        668        531
#>               midparent5 midparent6
#> Cre01.g000050         37         27
#> Cre01.g000150         96         62
#> Cre01.g000200         51         34
#> Cre01.g000250        714        432
#> Cre01.g000300       2548       1716
#> Cre01.g000350        633        419

# Alternative 2: using the weighted mean of the counts (weights = ploidy)
w <- c(2, 1) # P1 = diploid; P2 = haploid
add_midparent_expression(se_chlamy, method = "weightedmean", weights = w) |>
    assay() |> 
    head()
#>                 S1  S2   S3   S4  S5   S6   S7   S8   S9 S10 S11  S12  S13  S14
#> Cre01.g000050   31  21   26   33  19   48   17    6   13   6  10    7   31   22
#> Cre01.g000150   50  29   35   99  11   58   51   46   41  33  28   14   53   25
#> Cre01.g000200   36  26   24   29  17   36   14   15   12   8  14    7   25   24
#> Cre01.g000250  440 272  394  332 283  585  272  274  255 160 225  235  405  391
#> Cre01.g000300 1242 839 1216 1251 811 1785 1341 1306 1122 877 844 1082 1704 1739
#> Cre01.g000350  412 264  294  336 252  478  233  221  195 155 190  201  299  272
#>                S15  S16  S17  S18 midparent1 midparent2 midparent3 midparent4
#> Cre01.g000050   29   22   21   16         25         29         39         22
#> Cre01.g000150   37   17   24   21         27         30         29         16
#> Cre01.g000200   24   26   18   21         27         32         33         21
#> Cre01.g000250  358  339  340  332        181        232        270        210
#> Cre01.g000300 1524 1720 1517 1243       1453       1576       1753       1532
#> Cre01.g000350  276  324  246  275        166        202        223        165
#>               midparent5 midparent6
#> Cre01.g000050         26         17
#> Cre01.g000150         48         15
#> Cre01.g000200         29         17
#> Cre01.g000250        202        148
#> Cre01.g000300       1705       1125
#> Cre01.g000350        186        136

We will proceed our analyses with the midparent expression values obtained from the mean of the counts, stored in the se object.

5 Normalizing count data

To normalize count data, the function add_size_factors() calculates size factors (used by DESeq2 for normalization) and adds them as an extra column in the colData slot of your SummarizedExperiment object. Such size factors are calculated using one of two methods:

  1. By library size (default) using the ‘median of ratios’ method implemented in DESeq2.

  2. By cell size/biomass using spike-in controls (if available). If spike-in controls are present in the count matrix, you can use them for normalization by setting spikein = TRUE and specifying the pattern used to indicate rows that contain spike-ins (usually they start with ERCC)1 Note: if you use spike-in normalization, the function add_size_factors() automatically removes rows containing spike-in counts from the SummarizedExperiment object after normalization. However, if you have counts for spike-in controls in your count matrix, but do not want to use spike-in normalization, you should remove them before creating the SummarizedExperiment object.. Normalization with spike-in controls is particularly useful if the amount of mRNA per cell is not equal between generations (due to, for instance, different ploidy levels, which in turn can lead to different cell sizes and/or biomass).

In our example data set, spike-in controls are available in the last rows of the count matrix. Let’s take a look at them.

# Show last rows of the count matrix
assay(se) |> 
    tail()
#>              S1   S2   S3   S4   S5   S6   S7   S8   S9  S10  S11  S12  S13
#> ERCC-00163   74   75   55   77   51   84  132  127   93  108   79  102   66
#> ERCC-00164    0    4    0    1    4    1    6    4    1    4    3    2    5
#> ERCC-00165  147  139   87  165  118  179  236  246  139  218  176  145  118
#> ERCC-00168    2    5    2    2    2    6    5    1    1    5    0    3    4
#> ERCC-00170   97   95   73  101   70  118  186  148  110  167  110  103   72
#> ERCC-00171 4644 4959 3554 5357 4170 5946 8207 7915 4992 7843 6455 5884 4537
#>             S14  S15  S16  S17  S18 midparent1 midparent2 midparent3 midparent4
#> ERCC-00163   67   62   80   88   47         67         96        101         90
#> ERCC-00164    4    3    5    4    2          2          2          4          2
#> ERCC-00165  143  115  185  192  136        132        198        192        155
#> ERCC-00168    6    2   11    5    2          1          6          3          2
#> ERCC-00170   99   64  121  103   84         92        142        122        102
#> ERCC-00171 5665 4598 6168 5771 4747       5004       6894       6437       5620
#>            midparent5 midparent6
#> ERCC-00163         92         84
#> ERCC-00164          5          0
#> ERCC-00165        177        143
#> ERCC-00168          4          2
#> ERCC-00170        128        104
#> ERCC-00171       6188       4818

As you can see, rows with spike-in counts start with “ERCC”. Let’s add size factors to our SummarizedExperiment object using spike-in controls.

# Take a look at the original colData
colData(se)
#> DataFrame with 24 rows and 2 columns
#>                 Ploidy Generation
#>            <character>   <factor>
#> S1             diploid         P1
#> S2             diploid         P1
#> S3             diploid         P1
#> S4             diploid         P1
#> S5             diploid         P1
#> ...                ...        ...
#> midparent2          NA  midparent
#> midparent3          NA  midparent
#> midparent4          NA  midparent
#> midparent5          NA  midparent
#> midparent6          NA  midparent

# Add size factors estimated from spike-in controls
se <- add_size_factors(se, spikein = TRUE, spikein_pattern = "ERCC")
#> converting counts to integer mode

# Take a look at the new colData
colData(se)
#> DataFrame with 24 rows and 3 columns
#>                 Ploidy Generation sizeFactor
#>            <character>   <factor>  <numeric>
#> S1             diploid         P1   0.856262
#> S2             diploid         P1   0.938417
#> S3             diploid         P1   0.648653
#> S4             diploid         P1   0.969369
#> S5             diploid         P1   0.718761
#> ...                ...        ...        ...
#> midparent2          NA  midparent   1.278078
#> midparent3          NA  midparent   1.201072
#> midparent4          NA  midparent   1.000892
#> midparent5          NA  midparent   1.186773
#> midparent6          NA  midparent   0.871889

6 Exploratory data analyses

Next, you can perform exploratory analyses of sample clustering to verify if samples group as expected. With HybridExpress, this can be performed using two functions:

  1. pca_plot(): creates principal component analysis (PCA) plots, with colors and shapes (optional) mapped to levels of colData variables;
  2. plot_samplecor(): plots a heatmap of hierarchically clustered pairwise sample correlations.

Let’s start with the PCA plot:

# For colData rows with missing values (midparent samples), add "midparent"
se$Ploidy[is.na(se$Ploidy)] <- "midparent"
se$Generation[is.na(se$Generation)] <- "midparent"

# Create PCA plot
pca_plot(se, color_by = "Generation", shape_by = "Ploidy", add_mean = TRUE)

In the plot above, each data point corresponds to a sample, and colors and shapes are mapped to levels of the variables specified in the arguments color_by and shape_by, respectively. Besides, by specifying add_mean = TRUE, we added a diamond shape indicating the mean PC coordinates based on the variable in color_by (here, “Generation”).

Now, let’s plot the heatmap of sample correlations:

# Plot a heatmap of sample correlations
plot_samplecor(se, coldata_cols = c("Generation", "Ploidy"))