Individual-specific and cell-type-specific deconvolution using ISLET

Hao Feng1* and Qian Li2

1Department of Population and Quantitative Health Sciences, Case Western Reserve University
2Department of Biostatistics, St. Jude Children's Research Hospital

*hxf155@case.edu

25 April 2023

Abstract

This vignette introduces the usage of the Bioconductor package ISLET (Individual-Specific ceLl typE referencing Tool). Complementary to classic deconvolution algorithms, ISLET can take cell type proportions as input, and infer the individual-specific and cell-type-specific reference panels. ISLET also offers functions to detect cell-type specific differential expression (csDE) genes. Additionally, it can test for csDE genes change rate difference between two groups, given an additional covariate of time points or age. ISLET is based on rigorous statistical framework of Expectation–Maximization(EM) algorithm, and has parallel computing embedded to provide superior computational performance.

Package

ISLET 1.2.0

1 Install and help
- 1.1 Install ISLET
- 1.2 How to get help
2 Introduction
3 ISLET input files
4 Data preparation
5 Deconvolve individual-specific reference panel
6 Test cell-type-specific differential expression (csDE) in mean (intercept)
7 Test csDE in change-rate (slope)
Session info

1 Install and help

1.1 Install ISLET

To install the package, start R (version 4.1.0 or higher) and enter:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("ISLET")

1.2 How to get help

You may post your question on ISLET’s GitHub Issue section: https://github.com/haoharryfeng/ISLET/issues.

2 Introduction

In clinical samples, the observed bulk sequencing/microarray data are often a mixture of different cell types. Because each unique cell type has its own gene expression profile, the real sequencing/microarray data are the weighted average of signals from multiple pure cell types. In high-throughput data analysis, the mixing proportions will confound with the primary phenotype-of-interest, if not properly accounted for. Over the past several years, researchers have gained substantial interests in using computational methods to deconvolute cell compositions. Under the assumption of a commonly shared feature-by-cell-type reference panel across all samples, deconvolution methods were developed. However, this assumption may not hold. For example, when repeated samples are measured for each subject, assuming a shared reference panel across different time points for each subject is a preferred choice over assuming a shared one across all the samples.

Here, we developed a method called ISLET (Individual-Specific ceLl typE referencing Tool), to solve for the individual-specific and cell-type-specific reference panels, once the cell type proportions are given. ISLET can leverage on multiple observations or temporal measurements of the same subject. ISLET adopted a more reasonable assumption that repeated samples from the same subject would share the same reference panel. This unknown subject-specific panel, treated as missing values, are modeled by Gaussian distribution in the mixed-effect regression framework and estimated by an iterative Expectation–Maximization (EM) algorithm, when combining all samples from all subjects together. This is the first statistical framework to estimate the subject-level cell-type-specific reference panel, for repeated measures. Our modeling can effectively borrow information across samples within the same subject. ISLET can deconvolve reference panels based on the raw counts without batch effect in library size or the normalized counts such as Transcript Per Million (TPM). In the current version, ISLET performs cell-type-specific differential expression analysis for two groups of subjects. Other covariates and additional groups will be added in future versions.

Schematic overview of ISLET workflow.

ISLET depends on the following packages:

BiocParallel, for parallel computing implementation,
Matrix, for large matrices operations in R,
SummarizedExperiment, to store rectangular matrices of experimental results.

3 ISLET input files

ISLET needs one input file organized into SummarizedExperiment objects, combining cases and controls. The input file should contain a feature by sample matrix for observed values stored in the counts slot. It should also use the first column in colData slot to store the group status (i.e. case/control), the second column in the colData slot to store the subject IDs mapping to each sample. The remaining columns in the colData slot should store the cell type proportions. In other words, use the column 3 to K+2 to store the cell type proportions for all K cell types. An example dataset GE600 is included:

Step 1: Load in example data.

library(ISLET)
data(GE600)
ls()

## [1] "GE600_se"

GE600_se

## class: SummarizedExperiment 
## dim: 10 520 
## metadata(0):
## assays(1): counts
## rownames(10): gene1 gene2 ... gene9 gene10
## rowData names(0):
## colnames(520): 6454256 1716203 ... 3657905 2440389
## colData names(8): group subject_ID ... Mono Others

It contains a SummarizedExperiment object containing the following elements:

counts stores the gene expression value data frame of 10 genes by 520 sample, with 83 cases and 89 controls, and multiple repeated measurements (i.e. time points) per subject. Each row is a gene and each column is a sample.

assays(GE600_se)$counts[1:5, 1:6]

##         6454256   1716203    8125261    6264143    5640428   3764673
## gene1 51.796233 50.593053  29.870425  55.052521 193.789977 60.638259
## gene2  1.486545  1.995879   2.670510   1.915750   1.456993  1.786843
## gene3 34.213228 41.148280  50.366612  15.964298  45.800864 22.525563
## gene4  5.788845  4.489374   7.776873   1.094732   1.451257  1.388009
## gene5 67.010747 76.308153 106.936595 257.108819  86.269951 67.152534

colData stores the sample meta-data and the input cell type proportions. The first column is the group status (i.e. case/ctrl), the second column is the subject ID, shows the relationship between the 520 samples IDs and their 172 subject IDs. The remaining 6 columns (i.e. column 3-8) are the cell type proportions of all samples by their 6 cell types. The 6 cell types are: B cells, Tcells_CD4, Tcells_CD8, NK cells, Mono cells, and others cells.

colData(GE600_se)

## DataFrame with 520 rows and 8 columns
##               group subject_ID    Bcells Tcells_CD4 Tcells_CD8   NKcells
##         <character>  <integer> <numeric>  <numeric>  <numeric> <numeric>
## 6454256        case     210298  0.294597  0.0459207  0.0960261 0.0245194
## 1716203        case     210298  0.229228  0.0307202  0.0874901 0.0237722
## 8125261        case     210298  0.229506  0.0429694  0.1207701 0.0212622
## 6264143        case     223361  0.262023  0.0127117  0.0520090 0.0194373
## 5640428        case     223361  0.124125  0.0645530  0.0586977 0.0615492
## ...             ...        ...       ...        ...        ...       ...
## 5220586        ctrl     954888  0.426594 0.04046180  0.0854448 0.0184139
## 4601267        ctrl     954888  0.332744 0.04181961  0.0995010 0.0267642
## 6500466        ctrl     999257  0.311047 0.01287898  0.1226221 0.0312183
## 3657905        ctrl     999257  0.242521 0.01412359  0.1105289 0.0241399
## 2440389        ctrl     999257  0.353854 0.00908941  0.1042287 0.0192127
##              Mono    Others
##         <numeric> <numeric>
## 6454256 0.1003072  0.438630
## 1716203 0.1284324  0.500357
## 8125261 0.0736778  0.511814
## 6264143 0.0608441  0.592975
## 5640428 0.2664628  0.424613
## ...           ...       ...
## 5220586 0.1106113  0.318474
## 4601267 0.0876010  0.411570
## 6500466 0.1019383  0.420296
## 3657905 0.0509589  0.557728
## 2440389 0.0952407  0.418374

4 Data preparation

This is the first step required before using ISLET for individual-specific reference deconvolution or testing. This step will prepare your input data for the downstream reference panels deconvolution (function isletSolve) and/or testing of differential expression gene (function isletTest). During this step, the input data in SummarizedExperiment format will be further processed for ISLET. The expression values, from cases and controls respectively, will be extracted. The cell type names, number of cell types, number of cases/controls subject, number of cases/controls samples, will be obtained.

Step 2: Data preparation for downstream ISLET analysis.

study123input <- dataPrep(dat_se=GE600_se)

The output, are the extracted information in a S4 object, and can be overviewed by the function below:

study123input

## First couple of elements from cases and controls: 
##         6454256   1716203  8125261  6264143    5640428   3764673
## gene1 51.796233 50.593053 29.87042 55.05252 193.789977 60.638259
## gene2  1.486545  1.995879  2.67051  1.91575   1.456993  1.786843
## gene3 34.213228 41.148280 50.36661 15.96430  45.800864 22.525563
##         1622468   6724003   3390865   2961023   2297235   1104596
## gene1 69.360096 42.421512 72.869948 59.052826 53.256419 77.130120
## gene2  3.823171  2.414935  3.391997  3.035638  3.539411  3.906642
## gene3 48.030845 40.214322 37.962721 13.309217 11.876977 16.314866
## Design matrices hidded. 
## Total cell type number: 
## [1] 6
## Cell type categories: 
## [1] "Bcells"     "Tcells_CD4" "Tcells_CD8" "NKcells"    "Mono"      
## [6] "Others"    
## Total sample number and subject number: 
## [1] 520 172
## Total case number and ctrl number: 
## [1] 83 89
## First several subject ID for the samples: 
##  [1] 210298 210298 210298 223361 223361 223361 228055 228055 228055 228055
## Data preparation type (intercept/slope): 
## [1] "intercept"

[Attention] Here we have strict requirement for the input data. Each subject ID represents a unique participant across cases and controls. Subjects also need to be sorted.

5 Deconvolve individual-specific reference panel

Step 3: With the curated data study123input from the previous step, now we can use ISLET to conduct deconvolution and obtain the individual-specific and cell-type-specific reference panels. This process can be achieved by running:

#Use ISLET for deconvolution
res.sol <- isletSolve(input=study123input)

The res.sol is the deconvolution result list. For both case and control group, the deconvolution result is a list of length K, where K is the number of cell types. For each of the K elements, it is a matrix of dimension G by N. For each of the K cell types, it stores the deconvoluted values in a feature (G) by subject (N) matrix,

#View the deconvolution results
caseVal <- caseEst(res.sol)
ctrlVal <- ctrlEst(res.sol)
length(caseVal) #For cases, a list of 6 cell types' matrices.

## [1] 6

length(ctrlVal) #For controls, a list of 6 cell types' matrices.

## [1] 6

caseVal$Bcells[1:5, 1:4] #view the reference panels for B cells, for the first 5 genes and first 4 subjects, in Case group.

##           210298       223361    228055   229203
## gene1  0.0000000  0.004321134  1.134649  0.00000
## gene2  0.7402177  0.858061741  1.481642  0.11822
## gene3  9.3032183  4.673933671  4.991038  0.00000
## gene4 15.0234343  2.922345346  5.235796 15.70844
## gene5 26.9082246 31.428902564 15.347462 30.43319

ctrlVal$Bcells[1:5, 1:4] #view the reference panels for B cells, for the first 5 genes and first 4 subjects, in Control group.

##         225490    230198    248848    253527
## gene1  0.00000  0.000000  0.000000  0.000000
## gene2  2.73228  2.678124  2.163444  2.285876
## gene3 28.81544  2.697245 37.830728  5.347082
## gene4 13.66613 29.445485  6.849386  8.915755
## gene5 62.50238 53.985388 74.033019 52.470559

case.ind.ref A list of length K, where K is the number of cell types. For each of the K elements in this list, it is a feature by subject matrix containing all the feature values (i.e. gene expression values), for case group. It is one of the main products the individual-specific and cell-type-specific solve algorithm. ctrl.ind.ref A list of length K, where K is the number of cell types. For each of the K elements in this list, it is a feature by subject matrix containing all the feature values (i.e., gene expression values), for control group. It is one of the main products the individual-specific and cell-type-specific solve algorithm. mLLK A scalar, the optimized marginal log-likelihood for the current model. It will be used in Likelihood Ratio Test (LRT).

6 Test cell-type-specific differential expression (csDE) in mean (intercept)

Also, with the curated data study123input from the previous Step 2, now we can test the group effect on individual reference panels, i.e., identifying csDE genes in mean or intercept. In this ‘intercept test’, we assume that the individual-specific reference panel is unchanged across time points. Note that Step 3 can be skipped, if one only need to call csDE genes. This test is done by the following line of code:

#Test for csDE genes
res.test <- isletTest(input=study123input)

## csDE testing on cell type 1 
## csDE testing on cell type 2 
## csDE testing on cell type 3 
## csDE testing on cell type 4 
## csDE testing on cell type 5 
## csDE testing on cell type 6 
## csDE testing on 6 cell types finished

The result res.test is a matrix of p-values, in the dimension of feature by cell type. Each element is the LRT p-value, by contrasting case group and control group, for one feature in one cell type.

#View the test p-values
head(res.test)

##          Bcells Tcells_CD4 Tcells_CD8    NKcells       Mono     Others
## [1,] 0.06649261 0.77489318  0.6077971 0.15315973 0.69627533 0.02982063
## [2,] 0.41736594 0.05028712  0.6539022 0.39584728 0.88311825 0.79493488
## [3,] 0.58806617 0.78104069  0.4299641 0.17664435 0.05892608 0.37546502
## [4,] 0.25000763 0.09684269  0.6367174 0.90274779 0.67013276 1.00000000
## [5,] 0.40303484 0.40255103  0.7866814 0.04603692 0.62486315 0.58097702
## [6,] 0.02558441 0.55927759  0.1807293 0.85494477 0.88159601 0.94549772

7 Test csDE in change-rate (slope)

Given an additional continuous variable such as time or age, ISLET is able to compare cases and controls in the change-rate of reference profile over time. This is the ‘slope test’. Here, the assumption is that for the participants or subjects in a group, the individual reference profile could change over time, with change-rate fixed by group. At a given time point, there may be no (significant) group effect in the reference panel, but the participants still have distinct underlying reference profiles. Under this setting, it is of interest to test for such difference. Below is an example to detect reference panel change-rate difference between two groups, from data preparation to test.

We provide an additional example dataset GE600age from the initial step to illustrate this. Different from the dataset GE600 above, here GE600age has an additional age column in the colData, besides subject ID and cell type proportions. This covariate age is required for the test.

Step 1: Load example dataset.

#(1) Example dataset for 'slope' test
data(GE600age)
ls()

## [1] "GE600_se"      "GE600age_se"   "caseVal"       "ctrlVal"      
## [5] "res.sol"       "res.test"      "study123input"

Similar to previous sections, it contains one SummarizedExperiment objects containing the following elements:

counts has the gene expression value data frame of 10 genes by 520 sample, with 83 cases and 89 controls, and multiple repeated measurements (i.e. time points) per subject. Each row is a gene and each column is a sample.

assays(GE600age_se)$counts[1:5, 1:6]

##         6454256   1716203    8125261    6264143    5640428   3764673
## gene1 51.796233 50.593053  29.870425  55.052521 193.789977 60.638259
## gene2  1.486545  1.995879   2.670510   1.915750   1.456993  1.786843
## gene3 34.213228 41.148280  50.366612  15.964298  45.800864 22.525563
## gene4  5.788845  4.489374   7.776873   1.094732   1.451257  1.388009
## gene5 67.010747 76.308153 106.936595 257.108819  86.269951 67.152534

colData contains the sample meta-data. The first column is the case/ctrl group status, the second column is the subject ID, shows the relationship between the samples IDs and the corresopnding subject IDs. The third column is the age variable for each sample, which is the main variable in downstream testing. The remaining 6 columns (i.e. column 4-9) are the cell type proportions of all samples by their 6 cell types. The 6 cell types are: B cells, Tcells_CD4, Tcells_CD8, NK cells, Mono cells, and others cells.

colData(GE600age_se)

## DataFrame with 520 rows and 9 columns
##               group subject_ID       age    Bcells Tcells_CD4 Tcells_CD8
##         <character>  <integer> <numeric> <numeric>  <numeric>  <numeric>
## 6454256        case     210298   9.63333  0.294597  0.0459207  0.0960261
## 1716203        case     210298  12.26667  0.229228  0.0307202  0.0874901
## 8125261        case     210298  15.50000  0.229506  0.0429694  0.1207701
## 6264143        case     223361   8.43333  0.262023  0.0127117  0.0520090
## 5640428        case     223361  16.66667  0.124125  0.0645530  0.0586977
## ...             ...        ...       ...       ...        ...        ...
## 5220586        ctrl     954888   16.2333  0.426594 0.04046180  0.0854448
## 4601267        ctrl     954888   19.1000  0.332744 0.04181961  0.0995010
## 6500466        ctrl     999257   12.8667  0.311047 0.01287898  0.1226221
## 3657905        ctrl     999257   15.2000  0.242521 0.01412359  0.1105289
## 2440389        ctrl     999257   18.0333  0.353854 0.00908941  0.1042287
##           NKcells      Mono    Others
##         <numeric> <numeric> <numeric>
## 6454256 0.0245194 0.1003072  0.438630
## 1716203 0.0237722 0.1284324  0.500357
## 8125261 0.0212622 0.0736778  0.511814
## 6264143 0.0194373 0.0608441  0.592975
## 5640428 0.0615492 0.2664628  0.424613
## ...           ...       ...       ...
## 5220586 0.0184139 0.1106113  0.318474
## 4601267 0.0267642 0.0876010  0.411570
## 6500466 0.0312183 0.1019383  0.420296
## 3657905 0.0241399 0.0509589  0.557728
## 2440389 0.0192127 0.0952407  0.418374

[Attention] This time/age covariate must be stored in the third column in colData, to successfully execute this testing. The data must be sorted by subject ID, so that the multiple replicates per subject are close to each other.

Step 2: Data preparation.

#(2) Data preparation
study456input <- dataPrepSlope(dat_se=GE600age_se)

## Begin: working on data preparation as the input for ISLET algorithm.

## Complete: data preparation for ISLET.

Step 3: ‘Slope’ testing.

#(3) Test for slope effect(i.e. age) difference in csDE testing
age.test <- isletTest(input=study456input)

## csDE testing on cell type 1 
## csDE testing on cell type 2 
## csDE testing on cell type 3 
## csDE testing on cell type 4 
## csDE testing on cell type 5 
## csDE testing on cell type 6 
## csDE testing on 6 cell types finished

The result age.test is a matrix of p-values, in the dimension of feature by cell type. Each element is the LRT p-value, by contrasting case group and control group, for one feature in one cell type. In contrast to the (intercept) test described before, here is a test for difference of the expression CHANGE IN REFERENCE over time, between cases and controls.

#View the test p-values
head(age.test)

##         Bcells Tcells_CD4 Tcells_CD8   NKcells       Mono     Others
## [1,] 1.0000000  0.7031954 0.03177445 0.8321539 0.02039022 0.09609757
## [2,] 0.7085139  0.9329100 0.79584834 0.3971310 0.80106928 0.76964312
## [3,] 0.3392235  0.1902524 0.06803089 0.1597954 0.03443943 0.85405799
## [4,] 0.1280650  0.5198595 0.40505410 0.1427035 0.53713379 0.11159749
## [5,] 0.0851573  0.6204602 0.87195749 0.7323157 0.24655800 0.50021983
## [6,] 0.4048984  0.6680858 0.71375795 0.9476769 0.50651788 0.30298067

Session info

## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ISLET_1.2.0                 SummarizedExperiment_1.30.0
##  [3] Biobase_2.60.0              GenomicRanges_1.52.0       
##  [5] GenomeInfoDb_1.36.0         IRanges_2.34.0             
##  [7] S4Vectors_0.38.0            BiocGenerics_0.46.0        
##  [9] MatrixGenerics_1.12.0       matrixStats_0.63.0         
## [11] BiocParallel_1.34.0         Matrix_1.5-4               
## [13] BiocStyle_2.28.0           
## 
## loaded via a namespace (and not attached):
##  [1] jsonlite_1.8.4          compiler_4.3.0          BiocManager_1.30.20    
##  [4] bitops_1.0-7            jquerylib_0.1.4         yaml_2.3.7             
##  [7] fastmap_1.1.1           mime_0.12               lattice_0.21-8         
## [10] R6_2.5.1                XVector_0.40.0          knitr_1.42             
## [13] DelayedArray_0.26.0     bookdown_0.33           GenomeInfoDbData_1.2.10
## [16] bslib_0.4.2             rlang_1.1.0             cachem_1.0.7           
## [19] xfun_0.39               sass_0.4.5              cli_3.6.1              
## [22] zlibbioc_1.46.0         digest_0.6.31           grid_4.3.0             
## [25] evaluate_0.20           codetools_0.2-19        RCurl_1.98-1.12        
## [28] rmarkdown_2.21          tools_4.3.0             htmltools_0.5.5