Analysing single-cell RNA-sequencing Data

Johannes Griss

2024-11-29

Introduction

The ReactomeGSA package is a client to the web-based Reactome Analysis System. Essentially, it performs a gene set analysis using the latest version of the Reactome pathway database as a backend.

This vignette shows how the ReactomeGSA package can be used to perform a pathway analysis of cell clusters in single-cell RNA-sequencing data.

Citation

To cite this package, use

Griss J. ReactomeGSA, https://github.com/reactome/ReactomeGSA (2019)

Installation

The ReactomeGSA package can be directly installed from Bioconductor:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

if (!require(ReactomeGSA))
  BiocManager::install("ReactomeGSA")

# install the ReactomeGSA.data package for the example data
if (!require(ReactomeGSA.data))
  BiocManager::install("ReactomeGSA.data")

For more information, see https://bioconductor.org/install/.

Example data

As an example we load single-cell RNA-sequencing data of B cells extracted from the dataset published by Jerby-Arnon et al. (Cell, 2018).

Note: This is not a complete Seurat object. To decrease the size, the object only contains gene expression values and cluster annotations.

library(ReactomeGSA.data)
#> Loading required package: limma
#> Loading required package: edgeR
#> Loading required package: ReactomeGSA
#> Loading required package: Seurat
#> Loading required package: SeuratObject
#> Loading required package: sp
#> 
#> Attaching package: 'SeuratObject'
#> The following objects are masked from 'package:base':
#> 
#>     intersect, t
data(jerby_b_cells)

jerby_b_cells
#> An object of class Seurat 
#> 23686 features across 920 samples within 1 assay 
#> Active assay: RNA (23686 features, 0 variable features)
#>  2 layers present: counts, data

Types of analyses

There are two methods to analyse scRNA-seq data using ReactomeGSA:

ReactomeGSA can generate pseudo-bulk data from scRNA-seq data and then analyse this data using the classical quantitative pathway analysis algorithms. Thereby, it is possible to directly compare, f.e. two cell types with each other or two treatment approaches. The result is a classical pathway analysis result with p-values and fold-changes attributed to each pathway.

The analyse_sc_clusters function offers a second approach using the gene set variation algorithm ssGSEA to derive pathway-level quantitative values for each cluster or cell type in the dataset. This is helpful to visualize the “expression level” of pathways accross the dataset. Statistical analyses have to be performed separately.

Comparative pathway analysis (pseudo-bulk approach)

The pathway analysis is at the very end of a scRNA-seq workflow. This means, that any Q/C was already performed, the data was normalized and cells were already clustered.

In this example we are going to compare Cluster 1 against Cluster 2.

# store the Idents as a meta.data field - this was
# removed while compressing the object as an example
jerby_b_cells$clusters <- Idents(jerby_b_cells)

table(jerby_b_cells$clusters)
#> 
#>  Cluster 4  Cluster 8  Cluster 1 Cluster 11  Cluster 3  Cluster 6  Cluster 2 
#>        106         54        140         31        109         85        114 
#>  Cluster 7  Cluster 9 Cluster 13  Cluster 5 Cluster 12 Cluster 10 
#>         55         47         24         90         25         40

As a next step, we need to create the pseudo-bulk data for the analysis. This is achieved through the generate_pseudo_bulk_data function.

library(ReactomeGSA)

# This creates a pseudo-bulk object by splitting each cluster in 4
# random bulks of data. This approach can be changed through the
# split_by and k_variable parameter.
pseudo_bulk_data <- generate_pseudo_bulk_data(jerby_b_cells, group_by = "clusters")
#> Centering and scaling data matrix

# we can immediately create the metadata data.frame for this data
pseudo_metadata <- generate_metadata(pseudo_bulk_data)

This pseudo-bulk data is directly compatible with the existing algorithms for quantitative pathway analysis and can be processed using the respective ReactomeGSA methods.

# Create a new request object using 'Camera' for the gene set analysis
my_request <- ReactomeAnalysisRequest(method = "Camera")

# set the maximum number of allowed missing values to 50%
my_request <- set_parameters(request = my_request, max_missing_values = 0.5)

# add the pseudo-bulk data as a dataset
my_request <- add_dataset(request = my_request,
                          expression_values = pseudo_bulk_data,
                          sample_data = pseudo_metadata,
                          name = "Pseudo-bulk",
                          type = "rnaseq_counts",
                          comparison_factor = "Group",
                          comparison_group_1 = "Cluster 1",
                          comparison_group_2 = "Cluster 2")
#> Converting expression data to string... (This may take a moment)
#> Conversion complete

my_request
#> ReactomeAnalysisRequestObject
#>   Method = Camera
#>   Parameters:
#>   - max_missing_values: 0.5
#>   Datasets:
#>   - Pseudo-bulk (rnaseq_counts)
#>     No parameters set.
#> ReactomeAnalysisRequest

This request object can be directly submitted to the ReactomeGSA analysis.

quant_result <- perform_reactome_analysis(my_request, compress = FALSE)
#> Submitting request to Reactome API...
#> Reactome Analysis submitted succesfully
#> Converting dataset Pseudo-bulk...
#> Mapping identifiers...
#> Performing gene set analysis using Camera
#> Analysing dataset 'Pseudo-bulk' using Camera
#> Creating REACTOME visualization
#> Retrieving result...
quant_result
#> ReactomeAnalysisResult object
#>   Reactome Release: 90
#>   Results:
#>   - Pseudo-bulk:
#>     2573 pathways
#>     10419 fold changes for genes
#>   Reactome visualizations:
#>   - Gene Set Analysis Summary
#> ReactomeAnalysisResult

This object can be used in the same fashion as any ReactomeGSA result object.

# get the pathway-level results
quant_pathways <- pathways(quant_result)
head(quant_pathways)
#>                                                                                       Name
#> R-HSA-156842                                             Eukaryotic Translation Elongation
#> R-HSA-192823                                                        Viral mRNA Translation
#> R-HSA-156902                                                      Peptide chain elongation
#> R-HSA-72764                                             Eukaryotic Translation Termination
#> R-HSA-975956  Nonsense Mediated Decay (NMD) independent of the Exon Junction Complex (EJC)
#> R-HSA-2408557                                                     Selenocysteine synthesis
#>               Direction.Pseudo-bulk FDR.Pseudo-bulk PValue.Pseudo-bulk
#> R-HSA-156842                     Up    2.777595e-23       1.403481e-26
#> R-HSA-192823                     Up    2.777595e-23       2.159032e-26
#> R-HSA-156902                     Up    9.060867e-23       1.056456e-25
#> R-HSA-72764                      Up    2.062321e-21       3.206096e-24
#> R-HSA-975956                     Up    5.617992e-21       1.091720e-23
#> R-HSA-2408557                    Up    5.686902e-21       1.326133e-23
#>               NGenes.Pseudo-bulk av_foldchange.Pseudo-bulk sig.Pseudo-bulk
#> R-HSA-156842                  81                 0.3500341            TRUE
#> R-HSA-192823                  78                 0.3725893            TRUE
#> R-HSA-156902                  78                 0.3481714            TRUE
#> R-HSA-72764                   82                 0.3155141            TRUE
#> R-HSA-975956                  84                 0.3166279            TRUE
#> R-HSA-2408557                 81                 0.2437127            TRUE
# get the top pathways to label them
library(tidyr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

top_pathways <- quant_pathways %>% 
  tibble::rownames_to_column(var = "Id") %>%
  filter(`FDR.Pseudo-bulk` < 0.001) %>%
  arrange(desc(`av_foldchange.Pseudo-bulk`))

# limit to a few pathway for aesthetic reasons
top_pathways <- top_pathways[c(1,5,6), ]

# create a simple volcano plot of the pathway results
library(ggplot2)
ggplot(quant_pathways,
       aes(x = `av_foldchange.Pseudo-bulk`,
           y = -log10(`FDR.Pseudo-bulk`))) +
  geom_vline(xintercept = c(log2(0.5), log2(2)), colour="grey", linetype = "longdash") +
  geom_hline(yintercept = -log10(0.05), colour="grey", linetype = "longdash") +
  geom_point() +
  geom_label(data = top_pathways, aes(label = Name), nudge_y = 1, check_overlap = TRUE)
#> Warning in geom_label(data = top_pathways, aes(label = Name), nudge_y = 1, :
#> Ignoring unknown parameters: `check_overlap`

Pathway analysis of cell clusters (analyse_sc_clusters)

The ReactomeGSA package can now be used to get pathway-level expression values for every cell cluster. This is achieved by calculating the mean gene expression for every cluster and then submitting this data to a gene set variation analysis.

All of this is wrapped in the single analyse_sc_clusters function.

gsva_result <- analyse_sc_clusters(jerby_b_cells, verbose = TRUE)
#> Calculating average cluster expression...
#> Converting expression data to string... (This may take a moment)
#> Conversion complete
#> Submitting request to Reactome API...
#> Compressing request data...
#> Reactome Analysis submitted succesfully
#> Converting dataset Seurat...
#> Mapping identifiers...
#> Performing gene set analysis using ssGSEA
#> Analysing dataset 'Seurat' using ssGSEA
#> Retrieving result...

The resulting object is a standard ReactomeAnalysisResult object.

gsva_result
#> ReactomeAnalysisResult object
#>   Reactome Release: 90
#>   Results:
#>   - Seurat:
#>     1741 pathways
#>     11759 fold changes for genes
#>   No Reactome visualizations available
#> ReactomeAnalysisResult

pathways returns the pathway-level expression values per cell cluster:

pathway_expression <- pathways(gsva_result)

# simplify the column names by removing the default dataset identifier
colnames(pathway_expression) <- gsub("\\.Seurat", "", colnames(pathway_expression))

pathway_expression[1:3,]
#>                                  Name   Cluster_1  Cluster_10 Cluster_11
#> R-HSA-1059683 Interleukin-6 signaling -0.03556065 -0.04521415 0.14931444
#> R-HSA-109703      PKB-mediated events  0.32848879 -0.19771237 0.05087568
#> R-HSA-109704             PI3K Cascade -0.12080150 -0.13450596 0.16088061
#>                Cluster_12   Cluster_13   Cluster_2   Cluster_3   Cluster_4
#> R-HSA-1059683  0.06075330 0.0004459001  0.14896349 -0.09445591 -0.11884117
#> R-HSA-109703   0.09556102 0.0274677841 -0.05142555  0.15462891 -0.16856699
#> R-HSA-109704  -0.04114424 0.0680222582  0.05924518  0.05295837 -0.02208554
#>                 Cluster_5   Cluster_6   Cluster_7    Cluster_8   Cluster_9
#> R-HSA-1059683 -0.13882420 -0.07577576 -0.06560393  0.174814091 -0.04388071
#> R-HSA-109703   0.10554980 -0.06251424 -0.30506281 -0.007670405 -0.09929333
#> R-HSA-109704   0.02008053 -0.12603775 -0.01371145  0.101810119 -0.04142070

A simple approach to find the most relevant pathways is to assess the maximum difference in expression for every pathway:

# find the maximum differently expressed pathway
max_difference <- do.call(rbind, apply(pathway_expression, 1, function(row) {
    values <- as.numeric(row[2:length(row)])
    return(data.frame(name = row[1], min = min(values), max = max(values)))
}))

max_difference$diff <- max_difference$max - max_difference$min

# sort based on the difference
max_difference <- max_difference[order(max_difference$diff, decreasing = T), ]

head(max_difference)
#>                                                                    name
#> R-HSA-140180                                              COX reactions
#> R-HSA-1296067                              Potassium transport channels
#> R-HSA-392023  Adrenaline signalling through Alpha-2 adrenergic receptor
#> R-HSA-8964540                                        Alanine metabolism
#> R-HSA-3248023                                       Regulation by TREX1
#> R-HSA-350864                     Regulation of thyroid hormone activity
#>                      min       max     diff
#> R-HSA-140180  -0.9647899 0.9841810 1.948971
#> R-HSA-1296067 -1.0000000 0.8858649 1.885865
#> R-HSA-392023  -0.9008335 0.9738051 1.874639
#> R-HSA-8964540 -0.8731077 0.9938765 1.866984
#> R-HSA-3248023 -0.9185236 0.9421670 1.860691
#> R-HSA-350864  -0.9134206 0.9394455 1.852866

Plotting the results

The ReactomeGSA package contains two functions to visualize these pathway results. The first simply plots the expression for a selected pathway:

plot_gsva_pathway(gsva_result, pathway_id = rownames(max_difference)[1])

For a better overview, the expression of multiple pathways can be shown as a heatmap using gplots heatmap.2 function:

# Additional parameters are directly passed to gplots heatmap.2 function
plot_gsva_heatmap(gsva_result, max_pathways = 15, margins = c(6,20))

The plot_gsva_heatmap function can also be used to only display specific pahtways:

# limit to selected B cell related pathways
relevant_pathways <- c("R-HSA-983170", "R-HSA-388841", "R-HSA-2132295", "R-HSA-983705", "R-HSA-5690714")
plot_gsva_heatmap(gsva_result, 
                  pathway_ids = relevant_pathways, # limit to these pathways
                  margins = c(6,30), # adapt the figure margins in heatmap.2
                  dendrogram = "col", # only plot column dendrogram
                  scale = "row", # scale for each pathway
                  key = FALSE, # don't display the color key
                  lwid=c(0.1,4)) # remove the white space on the left
#> Warning in plot_gsva_heatmap(gsva_result, pathway_ids = relevant_pathways, :
#> Warning: No results for the following pathways: R-HSA-983705

This analysis shows us that cluster 8 has a marked up-regulation of B Cell receptor signalling, which is linked to a co-stimulation of the CD28 family. Additionally, there is a gradient among the cluster with respect to genes releated to antigen presentation.

Therefore, we are able to further classify the observed B cell subtypes based on their pathway activity.

Pathway-level PCA

The pathway-level expression analysis can also be used to run a Principal Component Analysis on the samples. This is simplified through the function plot_gsva_pca:

plot_gsva_pca(gsva_result)

In this analysis, cluster 11 is a clear outlier from the other B cell subtypes and therefore might be prioritised for further evaluation.

Session Info

sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] ggplot2_3.5.1           dplyr_1.1.4             tidyr_1.3.1            
#>  [4] ReactomeGSA.data_1.21.0 Seurat_5.1.0            SeuratObject_5.0.2     
#>  [7] sp_2.1-4                ReactomeGSA_1.21.2      edgeR_4.5.0            
#> [10] limma_3.63.2           
#> 
#> loaded via a namespace (and not attached):
#>   [1] RColorBrewer_1.1-3     jsonlite_1.8.9         magrittr_2.0.3        
#>   [4] spatstat.utils_3.1-1   farver_2.1.2           rmarkdown_2.29        
#>   [7] vctrs_0.6.5            ROCR_1.0-11            spatstat.explore_3.3-3
#>  [10] progress_1.2.3         htmltools_0.5.8.1      curl_6.0.1            
#>  [13] sass_0.4.9             sctransform_0.4.1      parallelly_1.39.0     
#>  [16] KernSmooth_2.23-24     bslib_0.8.0            htmlwidgets_1.6.4     
#>  [19] ica_1.0-3              plyr_1.8.9             plotly_4.10.4         
#>  [22] zoo_1.8-12             cachem_1.1.0           igraph_2.1.1          
#>  [25] mime_0.12              lifecycle_1.0.4        pkgconfig_2.0.3       
#>  [28] Matrix_1.7-1           R6_2.5.1               fastmap_1.2.0         
#>  [31] fitdistrplus_1.2-1     future_1.34.0          shiny_1.9.1           
#>  [34] digest_0.6.37          colorspace_2.1-1       patchwork_1.3.0       
#>  [37] tensor_1.5             RSpectra_0.16-2        irlba_2.3.5.1         
#>  [40] labeling_0.4.3         progressr_0.15.1       fansi_1.0.6           
#>  [43] spatstat.sparse_3.1-0  httr_1.4.7             polyclip_1.10-7       
#>  [46] abind_1.4-8            compiler_4.5.0         withr_3.0.2           
#>  [49] fastDummies_1.7.4      gplots_3.2.0           MASS_7.3-61           
#>  [52] gtools_3.9.5           caTools_1.18.3         tools_4.5.0           
#>  [55] lmtest_0.9-40          httpuv_1.6.15          future.apply_1.11.3   
#>  [58] goftest_1.2-3          glue_1.8.0             nlme_3.1-166          
#>  [61] promises_1.3.2         grid_4.5.0             Rtsne_0.17            
#>  [64] cluster_2.1.6          reshape2_1.4.4         generics_0.1.3        
#>  [67] gtable_0.3.6           spatstat.data_3.1-4    hms_1.1.3             
#>  [70] data.table_1.16.2      utf8_1.2.4             BiocGenerics_0.53.3   
#>  [73] spatstat.geom_3.3-4    RcppAnnoy_0.0.22       ggrepel_0.9.6         
#>  [76] RANN_2.6.2             pillar_1.9.0           stringr_1.5.1         
#>  [79] spam_2.11-0            RcppHNSW_0.6.0         later_1.4.1           
#>  [82] splines_4.5.0          lattice_0.22-6         survival_3.7-0        
#>  [85] deldir_2.0-4           tidyselect_1.2.1       locfit_1.5-9.10       
#>  [88] miniUI_0.1.1.1         pbapply_1.7-2          knitr_1.49            
#>  [91] gridExtra_2.3          scattermore_1.2        xfun_0.49             
#>  [94] Biobase_2.67.0         statmod_1.5.0          matrixStats_1.4.1     
#>  [97] stringi_1.8.4          lazyeval_0.2.2         yaml_2.3.10           
#> [100] evaluate_1.0.1         codetools_0.2-20       tibble_3.2.1          
#> [103] cli_3.6.3              uwot_0.2.2             xtable_1.8-4          
#> [106] reticulate_1.40.0      munsell_0.5.1          jquerylib_0.1.4       
#> [109] Rcpp_1.0.13-1          globals_0.16.3         spatstat.random_3.3-2 
#> [112] png_0.1-8              spatstat.univar_3.1-1  parallel_4.5.0        
#> [115] prettyunits_1.2.0      dotCall64_1.2          bitops_1.0-9          
#> [118] listenv_0.9.1          viridisLite_0.4.2      scales_1.3.0          
#> [121] ggridges_0.5.6         crayon_1.5.3           leiden_0.4.3.1        
#> [124] purrr_1.0.2            rlang_1.1.4            cowplot_1.1.3