GSEA.Rmd
The SeuratExtend
package integrates both the GO and
Reactome databases, streamlining the GSEA analysis process. This is
primarily facilitated through the GeneSetAnalysisGO
and
GeneSetAnalysisReactome
functions, among other
supplementary functions. In this section, we’ll delve into the usage and
features of these functions.
Performing GSEA using the GO database can be resource-intensive due
to its extensive size. To make the analysis more feasible, you might
consider evaluating pathways under specific categories. For instance, in
the example below, only the pathways under the “immune_system_process”
category are evaluated. The results from this analysis are saved in the
location: seu@misc$AUCell$GO[[title]]
.
library(SeuratExtend)
library(dplyr)
options(max.print = 12, spe = "human")
pbmc <- GeneSetAnalysisGO(pbmc, parent = "immune_system_process", nCores = 4) # calculating with 4 cores
matr <- pbmc@misc$AUCell$GO$immune_system_process
matr <- RenameGO(matr)
head(matr, 2:3)
## cells
## gene sets CTATAAGATCGTTT-1 GTGATTCTGGTTCA-1
## GO:0002376 immune system process (3213g) 0.25078236 0.29242312
## GO:0001776 leukocyte homeostasis (84g) 0.05491224 0.05165926
## cells
## gene sets ACGTTGGACCGTAA-1
## GO:0002376 immune system process (3213g) 0.30178639
## GO:0001776 leukocyte homeostasis (84g) 0.05097954
For the “parent” argument, you can input any term from the GO
database, be it a GO ID or a pathway name. To get a glimpse of commonly
used GO categories, you can run GeneSetAnalysisGO()
without
any arguments:
## immune_system_process response_to_stimulus
## "GO:0002376" "GO:0050896"
## signaling metabolic_process
## "GO:0023052" "GO:0008152"
## regulation_of_vasculature_development signal_transduction
## "GO:1901342" "GO:0007165"
Here are some suggested visualization methods:
VlnPlot2(matr[1:3,], f = pbmc$cluster, ncol = 1)
WaterfallPlot(matr, f = pbmc$cluster, ident.1 = "B cell", ident.2 = "CD8 T cell", top.n = 20)
For GSEA using the Reactome database, consider assessing pathways
under certain categories to make the process more manageable. The
example below evaluates pathways under the “Immune System” category.
Results from this analysis are saved under:
seu@misc$AUCell$Reactome[[title]]
.
pbmc <- GeneSetAnalysisReactome(pbmc, parent = "Immune System")
matr <- pbmc@misc$AUCell$Reactome$`Immune System`
matr <- RenameReactome(matr)
Heatmap(CalcStats(matr, f = pbmc$cluster, order = "p", n = 4), lab_fill = "zscore")
Similar to the GO database, running
GeneSetAnalysisReactome()
without any arguments lets you
view commonly used categories in the Reactome database:
## R-HSA-109582 R-HSA-112316
## "Hemostasis" "Neuronal System"
## R-HSA-1266738 R-HSA-1430728
## "Developmental Biology" "Metabolism"
## R-HSA-1474165 R-HSA-1474244
## "Reproduction" "Extracellular matrix organization"
## R-HSA-1500931 R-HSA-162582
## "Cell-Cell communication" "Signal Transduction"
## R-HSA-1640170 R-HSA-1643685
## "Cell Cycle" "Disease"
## R-HSA-168256 R-HSA-1852241
## "Immune System" "Organelle biogenesis and maintenance"
## [ reached getOption("max.print") -- omitted 15 entries ]
To conduct a Gene Set Enrichment Analysis (GSEA) with custom gene
sets, the GeneSetAnalysis
function is the tool of choice.
For instance, one might consider utilizing the Hallmark
50 gene set, commonly employed for general screening. This set can
be accessed via the hall50
object. Upon execution, the
resulting AUCell matrix will be stored under the path:
seu@misc$AUCell[[title]]
.
pbmc <- GeneSetAnalysis(pbmc, genesets = hall50$human)
matr <- pbmc@misc$AUCell$genesets
Heatmap(CalcStats(matr, f = pbmc$cluster), lab_fill = "zscore")
For those seeking a plethora of other gene sets, the
SeuratExtendData::Genesets_data
offers an expansive
collection sourced from the GSEA
MSigDB website. Here’s how you can view the available
collections:
names(SeuratExtendData::Genesets_data$human$GSEA)
## [1] "positional gene sets"
## [2] "all curated gene sets"
## [3] "chemical and genetic perturbations"
## [4] "BioCarta gene sets"
## [5] "KEGG gene sets"
## [6] "PID gene sets"
## [7] "all canonical pathways"
## [8] "all motif gene sets"
## [9] "transcription factor targets"
## [10] "all computational gene sets"
## [11] "all immunologic signatures gene sets"
## [12] "hallmark gene sets"
Furthermore, for cluster annotations, the
SeuratExtend::PanglaoDB_data
contains a valuable resource:
marker lists for 178 distinct cell types, curated from PanglaoDB. To explore these
marker lists:
names(SeuratExtend::PanglaoDB_data$marker_list_human)
## [1] "Acinar cells" "Adipocyte progenitor cells"
## [3] "Adipocytes" "Adrenergic neurons"
## [5] "Airway epithelial cells" "Airway goblet cells"
## [7] "Airway smooth muscle cells" "Alpha cells"
## [9] "Alveolar macrophages" "Anterior pituitary gland cells"
## [11] "Astrocytes" "B cells"
## [ reached getOption("max.print") -- omitted 166 entries ]
Navigating the plethora of pathways in databases like GO and Reactome
can be overwhelming. The SearchDatabase
function simplifies
this process by offering a wide array of customizable search
parameters.
The ‘item’ parameter is highly versatile, allowing you to search by gene name, pathway ID, or even keywords within pathway names. The following example demonstrates how to find pathways containing the gene “CD3D” or pathways with names including “metabolic.”
result <- SearchDatabase(c("CD3D", "metabolic"))
names(result)
## [1] "GO:0000023 maltose metabolic process (3g)"
## [2] "GO:0000038 very long-chain fatty acid metabolic process (31g)"
## [3] "GO:0000052 citrulline metabolic process (8g)"
## [4] "GO:0000053 argininosuccinate metabolic process (1g)"
## [5] "GO:0000096 sulfur amino acid metabolic process (43g)"
## [6] "GO:0000255 allantoin metabolic process (2g)"
## [7] "GO:0000270 peptidoglycan metabolic process (6g)"
## [8] "GO:0000820 regulation of glutamine family amino acid metabolic process (6g)"
## [9] "GO:0000821 regulation of arginine metabolic process (2g)"
## [10] "GO:0000959 mitochondrial RNA metabolic process (48g)"
## [11] "GO:0001523 retinoid metabolic process (103g)"
## [12] "GO:0001573 ganglioside metabolic process (21g)"
## [ reached getOption("max.print") -- omitted 1029 entries ]
## List of 3
## $ GO:0000023 maltose metabolic process (3g) :List of 3
## ..$ SetID : chr "GO:0000023"
## ..$ SetName: chr "maltose metabolic process"
## ..$ Genes : chr [1:3] "MGAM" "GAA" "GANC"
## $ GO:0000038 very long-chain fatty acid metabolic process (31g):List of 3
## ..$ SetID : chr "GO:0000038"
## ..$ SetName: chr "very long-chain fatty acid metabolic process"
## ..$ Genes : chr [1:31] "ELOVL7" "HACD1" "SLC27A2" "ACAA1" ...
## $ GO:0000052 citrulline metabolic process (8g) :List of 3
## ..$ SetID : chr "GO:0000052"
## ..$ SetName: chr "citrulline metabolic process"
## ..$ Genes : chr [1:8] "DDAH1" "DDAH2" "OTC" "ASS1" ...
If you wish to limit your search to specific types of items such as gene names, you can utilize the ‘type’ parameter as shown below.
result <- SearchDatabase("CD3D", type = "gene")
names(result)
## [1] "GO:0001775 cell activation (1458g)"
## [2] "GO:0002250 adaptive immune response (719g)"
## [3] "GO:0002253 activation of immune response (738g)"
## [4] "GO:0002376 immune system process (3213g)"
## [5] "GO:0002429 immune response-activating cell surface receptor signaling pathway (493g)"
## [6] "GO:0002520 immune system development (967g)"
## [7] "GO:0002521 leukocyte differentiation (517g)"
## [8] "GO:0002682 regulation of immune system process (1662g)"
## [9] "GO:0002684 positive regulation of immune system process (1213g)"
## [10] "GO:0002757 immune response-activating signal transduction (656g)"
## [11] "GO:0002764 immune response-regulating signaling pathway (690g)"
## [12] "GO:0002768 immune response-regulating cell surface receptor signaling pathway (526g)"
## [ reached getOption("max.print") -- omitted 168 entries ]
To focus your search within a particular database, specify the database name using the ‘database’ parameter.
result <- SearchDatabase("CD3D", database = "Reactome")
names(result)
## [1] "R-HSA-1280218 Adaptive Immune System (817g)"
## [2] "R-HSA-168256 Immune System (2190g)"
## [3] "R-HSA-198933 Immunoregulatory interactions between a Lymphoid and a non-Lymphoid cell (191g)"
## [4] "R-HSA-199991 Membrane Trafficking (628g)"
## [5] "R-HSA-202403 TCR signaling (124g)"
## [6] "R-HSA-202424 Downstream TCR signaling (102g)"
## [7] "R-HSA-202427 Phosphorylation of CD3 and TCR zeta chains (27g)"
## [8] "R-HSA-202430 Translocation of ZAP-70 to Immunological synapse (24g)"
## [9] "R-HSA-202433 Generation of second messenger molecules (39g)"
## [10] "R-HSA-388841 Costimulation by the CD28 family (74g)"
## [11] "R-HSA-389948 PD-1 signaling (28g)"
## [12] "R-HSA-5653656 Vesicle-mediated transport (723g)"
## [ reached getOption("max.print") -- omitted 2 entries ]
You can specify either ‘human’ or ‘mouse’ using the ‘spe’ parameter.
result <- SearchDatabase("Cd3d", spe = "mouse")
glimpse(head(result, 3))
## List of 3
## $ GO:0001775 cell activation (1169g) :List of 3
## ..$ SetID : chr "GO:0001775"
## ..$ SetName: chr "cell activation"
## ..$ Genes : chr [1:1169] "Cd86" "Cd79a" "Cd80" "Serpine2" ...
## $ GO:0002250 adaptive immune response (640g):List of 3
## ..$ SetID : chr "GO:0002250"
## ..$ SetName: chr "adaptive immune response"
## ..$ Genes : chr [1:640] "Cd86" "Cd79a" "Msh2" "Traf2" ...
## $ GO:0002376 immune system process (2850g) :List of 3
## ..$ SetID : chr "GO:0002376"
## ..$ SetName: chr "immune system process"
## ..$ Genes : chr [1:2850] "Gm42543" "Cd86" "Cd79a" "Cd80" ...
The function also offers flexibility in output types. For example, if you require a list of pathway IDs for downstream analysis, you can use the ‘return’ parameter as follows.
result <- SearchDatabase("CD3D", return = "ID")
result
## [1] "GO:0001775" "GO:0002250" "GO:0002253" "GO:0002376" "GO:0002429"
## [6] "GO:0002520" "GO:0002521" "GO:0002682" "GO:0002684" "GO:0002757"
## [11] "GO:0002764" "GO:0002768"
## [ reached getOption("max.print") -- omitted 168 entries ]
Alternatively, if you need the output as a gene list formatted for
GeneSetAnalysis
, adjust the ‘return’ parameter like so:
result <- SearchDatabase("CD3D", return = "genelist")
glimpse(head(result, 5))
## List of 5
## $ GO:0001775: chr [1:1458] "IGHV3-64" "IGHV4-4" "IGHV4OR15-8" "IGHV3OR16-12" ...
## $ GO:0002250: chr [1:719] "IGKV3-7" "IGKV1D-42" "IGLV4-69" "IGLV8-61" ...
## $ GO:0002253: chr [1:738] "IGKV2-28" "IGHV3-64" "IGHV4-4" "IGKV2D-30" ...
## $ GO:0002376: chr [1:3213] "IGKV3-7" "IGKV1D-42" "IGLV4-69" "IGLV8-61" ...
## $ GO:0002429: chr [1:493] "IGKV2-28" "IGHV3-64" "IGHV4-4" "IGKV2D-30" ...
To export the result as a data frame, suitable for formats like Excel or CSV, set the ‘export.to.data.frame’ parameter to TRUE.
result <- SearchDatabase("CD3D", export.to.data.frame = TRUE)
glimpse(result)
## Rows: 180
## Columns: 3
## $ SetID <chr> "GO:0001775", "GO:0002250", "GO:0002253", "GO:0002376", "GO:00…
## $ SetName <chr> "cell activation", "adaptive immune response", "activation of …
## $ Genes <chr> "IGHV3-64,IGHV4-4,IGHV4OR15-8,IGHV3OR16-12,IGHV1OR15-1,IGHV3OR…
Lastly, you can also filter a given gene set list with the
SearchPathways
function. For instance, within the “Hallmark
50” database, you can find pathways that include the gene “CD3D” or have
names that contain “interferon.”
SearchPathways(genesets = hall50$human, item = c("CD3D", "interferon"))
## $HALLMARK_INTERFERON_ALPHA_RESPONSE
## [1] "MX1" "ISG15" "AC004551.1" "IFIT3" "IFI44"
## [6] "IFI35" "IRF7" "RSAD2" "IFI44L" "IFITM1"
## [11] "IFI27" "IRF9"
## [ reached getOption("max.print") -- omitted 85 entries ]
##
## $HALLMARK_INTERFERON_GAMMA_RESPONSE
## [1] "STAT1" "ISG15" "IFIT1" "MX1" "IFIT3" "IFI35" "IRF7"
## [8] "IFIT2" "OAS2" "TAP1" "EIF2AK2" "RSAD2"
## [ reached getOption("max.print") -- omitted 188 entries ]
##
## $HALLMARK_ALLOGRAFT_REJECTION
## [1] "PTPRC" "IL12B" "TGFB1" "IL12A" "CD3E" "CD3D" "CD28" "LYN" "HCLS1"
## [10] "IL18" "CRTAM" "IFNG"
## [ reached getOption("max.print") -- omitted 188 entries ]
During the course of analyses, researchers often encounter pathway
IDs from databases such as GO and Reactome. While these IDs are great
for computational tasks, they can be cryptic when it comes to
interpretability. RenameGO
and RenameReactome
functions provide a convenient means to transform these IDs into their
more descriptive pathway names.
The primary parameter these functions require is “item”. This can either be:
FilterGOTerms
or
FilterReactomeTerms
.GeneSetAnalysisGO
or
GeneSetAnalysisReactome
.Convert GO IDs to their respective pathway names for human:
## [1] "GO:0002376 immune system process (3213g)"
## [2] "GO:0050896 response to stimulus (9342g)"
Similarly, for Reactome IDs:
RenameReactome(c("R-HSA-109582","R-HSA-112316"), spe = "human")
## [1] "R-HSA-109582 Hemostasis (679g)" "R-HSA-112316 Neuronal System (411g)"
Both GO and Reactome databases contain thousands of pathways, but not
all of which may be relevant to your study. To streamline the analysis,
you can use the FilterGOTerms
and
FilterReactomeTerms
functions to subset and refine the list
of GO or Reactome pathways based on specific criteria.
Let’s start by looking at how you can filter GO pathways:
parent
parameter. For example, to get
pathways related to the immune system process:
terms <- FilterGOTerms(parent = "GO:0002376")
RenameGO(terms)
## [1] "GO:0001773 myeloid dendritic cell activation (28g)"
## [2] "GO:0001774 microglial cell activation (45g)"
## [3] "GO:0001776 leukocyte homeostasis (84g)"
## [4] "GO:0001777 T cell homeostatic proliferation (4g)"
## [5] "GO:0001779 natural killer cell differentiation (21g)"
## [6] "GO:0001780 neutrophil homeostasis (16g)"
## [7] "GO:0001782 B cell homeostasis (27g)"
## [8] "GO:0001787 natural killer cell proliferation (11g)"
## [9] "GO:0001788 antibody-dependent cellular cytotoxicity (2g)"
## [10] "GO:0001794 type IIa hypersensitivity (4g)"
## [11] "GO:0001802 type III hypersensitivity (2g)"
## [12] "GO:0001806 type IV hypersensitivity (4g)"
## [ reached getOption("max.print") -- omitted 395 entries ]
n.min
and n.max
parameters. Building upon the
pathways we selected under the “immune system process”
(terms
), to keep only those pathways that contain between
10 and 1000 genes:
terms2 <- FilterGOTerms(term = terms, n.min = 10, n.max = 1000)
RenameGO(terms2)
## [1] "GO:0001773 myeloid dendritic cell activation (28g)"
## [2] "GO:0001774 microglial cell activation (45g)"
## [3] "GO:0001776 leukocyte homeostasis (84g)"
## [4] "GO:0001779 natural killer cell differentiation (21g)"
## [5] "GO:0001780 neutrophil homeostasis (16g)"
## [6] "GO:0001782 B cell homeostasis (27g)"
## [7] "GO:0001787 natural killer cell proliferation (11g)"
## [8] "GO:0001865 NK T cell differentiation (10g)"
## [9] "GO:0001867 complement activation, lectin pathway (11g)"
## [10] "GO:0001913 T cell mediated cytotoxicity (46g)"
## [11] "GO:0002200 somatic diversification of immune receptors (77g)"
## [12] "GO:0002204 somatic recombination of immunoglobulin genes involved in immune response (50g)"
## [ reached getOption("max.print") -- omitted 202 entries ]
only.end.terms
parameter to TRUE
.
terms3 <- FilterGOTerms(term = terms, only.end.terms = TRUE)
RenameGO(terms3)
## [1] "GO:0001777 T cell homeostatic proliferation (4g)"
## [2] "GO:0001780 neutrophil homeostasis (16g)"
## [3] "GO:0001787 natural killer cell proliferation (11g)"
## [4] "GO:0001788 antibody-dependent cellular cytotoxicity (2g)"
## [5] "GO:0001802 type III hypersensitivity (2g)"
## [6] "GO:0001806 type IV hypersensitivity (4g)"
## [7] "GO:0001865 NK T cell differentiation (10g)"
## [8] "GO:0001866 NK T cell proliferation (7g)"
## [9] "GO:0001867 complement activation, lectin pathway (11g)"
## [10] "GO:0001905 activation of membrane attack complex (3g)"
## [11] "GO:0001922 B-1 B cell homeostasis (3g)"
## [12] "GO:0002223 stimulatory C-type lectin receptor signaling pathway (112g)"
## [ reached getOption("max.print") -- omitted 188 entries ]
The process for Reactome pathways is analogous. For instance, to select pathways related to the Immune System:
terms <- FilterReactomeTerms(parent = "R-HSA-168256")
RenameReactome(terms)
## [1] "R-HSA-1059683 Interleukin-6 signaling (11g)"
## [2] "R-HSA-109703 PKB-mediated events (2g)"
## [3] "R-HSA-109704 PI3K Cascade (44g)"
## [4] "R-HSA-1168372 Downstream signaling events of B Cell Receptor (BCR) (81g)"
## [5] "R-HSA-1169091 Activation of NF-kappaB in B cells (67g)"
## [6] "R-HSA-1169092 Activation of RAS in B cells (5g)"
## [7] "R-HSA-1169408 ISG15 antiviral mechanism (74g)"
## [8] "R-HSA-1169410 Antiviral mechanism by IFN-stimulated genes (82g)"
## [9] "R-HSA-1170546 Prolactin receptor signaling (15g)"
## [10] "R-HSA-1222556 ROS and RNS production in phagocytes (36g)"
## [11] "R-HSA-1236973 Cross-presentation of particulate exogenous antigens (phagosomes) (8g)"
## [12] "R-HSA-1236974 ER-Phagosome pathway (83g)"
## [ reached getOption("max.print") -- omitted 185 entries ]
The GSEAplot
function is designed to generate plots that
emulate the Gene Set Enrichment Analysis (GSEA) as developed by the
Broad Institute. This function provides a way to visualize the
enrichment of specific gene sets within different biological states or
conditions.
Here’s how you can create a GSEA plot for the
“INTERFERON_GAMMA_RESPONSE” gene set within the “Naive CD4 T” cell
population of the pbmc
dataset:
GSEAplot(
pbmc,
ident.1 = "CD4 T Naive",
title = "INTERFERON_GAMMA_RESPONSE",
geneset = hall50$human$HALLMARK_INTERFERON_GAMMA_RESPONSE
)
## R version 4.4.0 (2024-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 20.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=de_BE.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=de_BE.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=de_BE.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=de_BE.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Brussels
## tzcode source: system (glibc)
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] ggpubr_0.6.0 tidyr_1.3.1 rlist_0.4.6.2 mosaic_1.9.1
## [5] mosaicData_0.20.4 ggformula_0.12.0 lattice_0.22-6 rlang_1.1.4
## [9] scales_1.3.0 reshape2_1.4.4 ggplot2_3.5.1 doMC_1.3.8
## [ reached getOption("max.print") -- omitted 18 entries ]
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.22 splines_4.4.0 later_1.3.2 tibble_3.2.1
## [5] polyclip_1.10-6 fastDummies_1.7.3 lifecycle_1.0.4 rstatix_0.7.2
## [9] globals_0.16.3 MASS_7.3-60.2 backports_1.5.0 magrittr_2.0.3
## [ reached getOption("max.print") -- omitted 117 entries ]