The formation of hybrids through the fusion of distinct genomes and subsequent genome duplication (in cases of allopolyploidy) represents a significant evolutionary event with complex effects on cellular biology, particularly gene expression. The impact of such genome mergings and duplications on transcription remain incompletely understood. To bridge this gap, we introduce HybridExpress, a comprehensive package designed to facilitate the comparative transcriptomic analysis of hybrids and their progenitor species. HybridExpress is tailored for RNA-Seq data derived from a ‘hybrid triplet’: the hybrid organism and its two parental species. This package offers a suite of intuitive functions enabling researchers to perform differential expression analysis with ease, generate principal component analysis (PCA) plots to visualize sample grouping, categorize genes into 12 distinct expression pattern groups (as in Rapp, Udall, and Wendel (2009)), and conduct functional analyses. Acknowledging the potential variability in cell and transcriptome size across species and ploidy levels, HybridExpress incorporates features for rigorous normalization of count data. Specifically, it allows for the integration of spike-in controls directly into the normalization process, ensuring accurate transcriptome size adjustments when these standards are present in the RNA-Seq count data (see full methodology in Coate (2023)). By offering these capabilities, HybridExpress provides a robust tool set for unraveling the intricate effects of genome doubling and merging on gene expression, paving the way for novel insights into the cellular biology of hybrid organisms.
HybridExpress can be installed from Bioconductor with the following code:
if(!requireNamespace('BiocManager', quietly = TRUE))
install.packages('BiocManager')
BiocManager::install("HybridExpress")
# Load package after installation
library(HybridExpress)
set.seed(123) # for reproducibility
For this vignette, we will use an example data set that comprises
(unpublished) gene expression data (in counts) for Chlamydomonas reinhardtii.
In our lab, we crossed a diploid line of C. reinhardtii (hereafter “P1”)
with a haploid line (hereafter “P2”), thus generating a triploid line through
the merging of the two parental genomes. The count matrix and sample
metadata are stored in SummarizedExperiment
objects, which is the standard
Bioconductor data structure required by
HybridExpress. For instructions on how to create
a SummarizedExperiment
object, check the FAQ section of this vignette.
Let’s load the example data and take a quick look at it:
library(SummarizedExperiment)
# Load data
data(se_chlamy)
# Inspect the `SummarizedExperiment` object
se_chlamy
#> class: SummarizedExperiment
#> dim: 13058 18
#> metadata(0):
#> assays(1): counts
#> rownames(13058): Cre01.g000050 Cre01.g000150 ... ERCC-00170 ERCC-00171
#> rowData names(0):
#> colnames(18): S1 S2 ... S17 S18
#> colData names(2): Ploidy Generation
## Take a look at the colData and count matrix
colData(se_chlamy)
#> DataFrame with 18 rows and 2 columns
#> Ploidy Generation
#> <character> <factor>
#> S1 diploid P1
#> S2 diploid P1
#> S3 diploid P1
#> S4 diploid P1
#> S5 diploid P1
#> ... ... ...
#> S14 triploid F1
#> S15 triploid F1
#> S16 triploid F1
#> S17 triploid F1
#> S18 triploid F1
assay(se_chlamy) |> head()
#> S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14
#> Cre01.g000050 31 21 26 33 19 48 17 6 13 6 10 7 31 22
#> Cre01.g000150 50 29 35 99 11 58 51 46 41 33 28 14 53 25
#> Cre01.g000200 36 26 24 29 17 36 14 15 12 8 14 7 25 24
#> Cre01.g000250 440 272 394 332 283 585 272 274 255 160 225 235 405 391
#> Cre01.g000300 1242 839 1216 1251 811 1785 1341 1306 1122 877 844 1082 1704 1739
#> Cre01.g000350 412 264 294 336 252 478 233 221 195 155 190 201 299 272
#> S15 S16 S17 S18
#> Cre01.g000050 29 22 21 16
#> Cre01.g000150 37 17 24 21
#> Cre01.g000200 24 26 18 21
#> Cre01.g000250 358 339 340 332
#> Cre01.g000300 1524 1720 1517 1243
#> Cre01.g000350 276 324 246 275
table(se_chlamy$Ploidy, se_chlamy$Generation)
#>
#> F1 P1 P2
#> diploid 0 6 0
#> haploid 0 0 6
#> triploid 6 0 0
As you can see, the count matrix contains 13058 genes and 18 samples, with 6 replicates for parent 1 (P1, diploid), 6 replicates for parent 2 (P2, haploid), and 6 replicates for the progeny (F1, triploid).
First of all, you’d want to add in your count matrix in silico samples
that contain the expression values of the midparent. This can be done
with the function add_midparent_expression()
, which takes a random
sample pair (one sample from each parent, sampling without replacement)
and calculates the midparent expression value in one of three ways:
For this function, besides specifying the method to obtain the midparent
expression values (i.e., “mean”, “sum”, or “weightedmean”), users must
also specify the name of the column in colData
that contains information
about the generations (default: “Generation”), as well as the levels
corresponding to each parent (default: “P1” and “P2” for parents 1 and 2,
respectively).
# Add midparent expression using the mean of the counts
se <- add_midparent_expression(se_chlamy)
head(assay(se))
#> S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14
#> Cre01.g000050 31 21 26 33 19 48 17 6 13 6 10 7 31 22
#> Cre01.g000150 50 29 35 99 11 58 51 46 41 33 28 14 53 25
#> Cre01.g000200 36 26 24 29 17 36 14 15 12 8 14 7 25 24
#> Cre01.g000250 440 272 394 332 283 585 272 274 255 160 225 235 405 391
#> Cre01.g000300 1242 839 1216 1251 811 1785 1341 1306 1122 877 844 1082 1704 1739
#> Cre01.g000350 412 264 294 336 252 478 233 221 195 155 190 201 299 272
#> S15 S16 S17 S18 midparent1 midparent2 midparent3 midparent4
#> Cre01.g000050 29 22 21 16 18 27 14 20
#> Cre01.g000150 37 17 24 21 32 46 38 56
#> Cre01.g000200 24 26 18 21 19 22 20 18
#> Cre01.g000250 358 339 340 332 310 372 273 284
#> Cre01.g000300 1524 1720 1517 1243 1030 1331 1072 1166
#> Cre01.g000350 276 324 246 275 242 316 242 268
#> midparent5 midparent6
#> Cre01.g000050 18 22
#> Cre01.g000150 31 46
#> Cre01.g000200 16 24
#> Cre01.g000250 278 348
#> Cre01.g000300 1076 1182
#> Cre01.g000350 242 304
# Alternative 1: using the sum of the counts
add_midparent_expression(se_chlamy, method = "sum") |>
assay() |>
head()
#> S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14
#> Cre01.g000050 31 21 26 33 19 48 17 6 13 6 10 7 31 22
#> Cre01.g000150 50 29 35 99 11 58 51 46 41 33 28 14 53 25
#> Cre01.g000200 36 26 24 29 17 36 14 15 12 8 14 7 25 24
#> Cre01.g000250 440 272 394 332 283 585 272 274 255 160 225 235 405 391
#> Cre01.g000300 1242 839 1216 1251 811 1785 1341 1306 1122 877 844 1082 1704 1739
#> Cre01.g000350 412 264 294 336 252 478 233 221 195 155 190 201 299 272
#> S15 S16 S17 S18 midparent1 midparent2 midparent3 midparent4
#> Cre01.g000050 29 22 21 16 43 26 58 46
#> Cre01.g000150 37 17 24 21 86 25 86 140
#> Cre01.g000200 24 26 18 21 38 24 50 41
#> Cre01.g000250 358 339 340 332 666 518 810 587
#> Cre01.g000300 1524 1720 1517 1243 2557 1893 2629 2373
#> Cre01.g000350 276 324 246 275 527 453 668 531
#> midparent5 midparent6
#> Cre01.g000050 37 27
#> Cre01.g000150 96 62
#> Cre01.g000200 51 34
#> Cre01.g000250 714 432
#> Cre01.g000300 2548 1716
#> Cre01.g000350 633 419
# Alternative 2: using the weighted mean of the counts (weights = ploidy)
w <- c(2, 1) # P1 = diploid; P2 = haploid
add_midparent_expression(se_chlamy, method = "weightedmean", weights = w) |>
assay() |>
head()
#> S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14
#> Cre01.g000050 31 21 26 33 19 48 17 6 13 6 10 7 31 22
#> Cre01.g000150 50 29 35 99 11 58 51 46 41 33 28 14 53 25
#> Cre01.g000200 36 26 24 29 17 36 14 15 12 8 14 7 25 24
#> Cre01.g000250 440 272 394 332 283 585 272 274 255 160 225 235 405 391
#> Cre01.g000300 1242 839 1216 1251 811 1785 1341 1306 1122 877 844 1082 1704 1739
#> Cre01.g000350 412 264 294 336 252 478 233 221 195 155 190 201 299 272
#> S15 S16 S17 S18 midparent1 midparent2 midparent3 midparent4
#> Cre01.g000050 29 22 21 16 25 29 39 22
#> Cre01.g000150 37 17 24 21 27 30 29 16
#> Cre01.g000200 24 26 18 21 27 32 33 21
#> Cre01.g000250 358 339 340 332 181 232 270 210
#> Cre01.g000300 1524 1720 1517 1243 1453 1576 1753 1532
#> Cre01.g000350 276 324 246 275 166 202 223 165
#> midparent5 midparent6
#> Cre01.g000050 26 17
#> Cre01.g000150 48 15
#> Cre01.g000200 29 17
#> Cre01.g000250 202 148
#> Cre01.g000300 1705 1125
#> Cre01.g000350 186 136
We will proceed our analyses with the midparent expression values obtained
from the mean of the counts, stored in the se
object.
To normalize count data, the function add_size_factors()
calculates
size factors (used by DESeq2
for normalization) and adds them as an extra column in the colData slot
of your SummarizedExperiment
object. Such size factors are calculated using
one of two methods:
By library size (default) using the ‘median of ratios’ method implemented in DESeq2.
By cell size/biomass using spike-in controls (if available). If spike-in
controls are present in the count matrix, you can use them
for normalization by setting spikein = TRUE
and specifying the
pattern used to indicate rows that contain spike-ins (usually they start
with ERCC)1 Note: if you use spike-in normalization, the
function add_size_factors()
automatically removes rows containing
spike-in counts from the SummarizedExperiment
object after normalization.
However, if you have counts for spike-in controls in your count matrix,
but do not want to use spike-in normalization, you should remove them
before creating the SummarizedExperiment
object.. Normalization with spike-in controls is
particularly useful if the amount of mRNA per cell is not equal
between generations (due to, for instance, different ploidy levels, which
in turn can lead to different cell sizes and/or biomass).
In our example data set, spike-in controls are available in the last rows of the count matrix. Let’s take a look at them.
# Show last rows of the count matrix
assay(se) |>
tail()
#> S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13
#> ERCC-00163 74 75 55 77 51 84 132 127 93 108 79 102 66
#> ERCC-00164 0 4 0 1 4 1 6 4 1 4 3 2 5
#> ERCC-00165 147 139 87 165 118 179 236 246 139 218 176 145 118
#> ERCC-00168 2 5 2 2 2 6 5 1 1 5 0 3 4
#> ERCC-00170 97 95 73 101 70 118 186 148 110 167 110 103 72
#> ERCC-00171 4644 4959 3554 5357 4170 5946 8207 7915 4992 7843 6455 5884 4537
#> S14 S15 S16 S17 S18 midparent1 midparent2 midparent3 midparent4
#> ERCC-00163 67 62 80 88 47 67 96 101 90
#> ERCC-00164 4 3 5 4 2 2 2 4 2
#> ERCC-00165 143 115 185 192 136 132 198 192 155
#> ERCC-00168 6 2 11 5 2 1 6 3 2
#> ERCC-00170 99 64 121 103 84 92 142 122 102
#> ERCC-00171 5665 4598 6168 5771 4747 5004 6894 6437 5620
#> midparent5 midparent6
#> ERCC-00163 92 84
#> ERCC-00164 5 0
#> ERCC-00165 177 143
#> ERCC-00168 4 2
#> ERCC-00170 128 104
#> ERCC-00171 6188 4818
As you can see, rows with spike-in counts start with “ERCC”. Let’s add
size factors to our SummarizedExperiment
object using
spike-in controls.
# Take a look at the original colData
colData(se)
#> DataFrame with 24 rows and 2 columns
#> Ploidy Generation
#> <character> <factor>
#> S1 diploid P1
#> S2 diploid P1
#> S3 diploid P1
#> S4 diploid P1
#> S5 diploid P1
#> ... ... ...
#> midparent2 NA midparent
#> midparent3 NA midparent
#> midparent4 NA midparent
#> midparent5 NA midparent
#> midparent6 NA midparent
# Add size factors estimated from spike-in controls
se <- add_size_factors(se, spikein = TRUE, spikein_pattern = "ERCC")
#> converting counts to integer mode
# Take a look at the new colData
colData(se)
#> DataFrame with 24 rows and 3 columns
#> Ploidy Generation sizeFactor
#> <character> <factor> <numeric>
#> S1 diploid P1 0.856262
#> S2 diploid P1 0.938417
#> S3 diploid P1 0.648653
#> S4 diploid P1 0.969369
#> S5 diploid P1 0.718761
#> ... ... ... ...
#> midparent2 NA midparent 1.278078
#> midparent3 NA midparent 1.201072
#> midparent4 NA midparent 1.000892
#> midparent5 NA midparent 1.186773
#> midparent6 NA midparent 0.871889
Next, you can perform exploratory analyses of sample clustering to verify if samples group as expected. With HybridExpress, this can be performed using two functions:
pca_plot()
: creates principal component analysis (PCA) plots, with colors
and shapes (optional) mapped to levels of colData
variables;plot_samplecor()
: plots a heatmap of hierarchically clustered pairwise
sample correlations.Let’s start with the PCA plot:
# For colData rows with missing values (midparent samples), add "midparent"
se$Ploidy[is.na(se$Ploidy)] <- "midparent"
se$Generation[is.na(se$Generation)] <- "midparent"
# Create PCA plot
pca_plot(se, color_by = "Generation", shape_by = "Ploidy", add_mean = TRUE)
In the plot above, each data point corresponds to a sample, and colors
and shapes are mapped to levels of the variables specified in the
arguments color_by
and shape_by
, respectively. Besides,
by specifying add_mean = TRUE
,
we added a diamond shape indicating the mean PC coordinates based on the variable
in color_by
(here, “Generation”).
Now, let’s plot the heatmap of sample correlations:
# Plot a heatmap of sample correlations
plot_samplecor(se, coldata_cols = c("Generation", "Ploidy"))