hello,大家好,今天給大家分享的是,擬時序軌跡分析軟件中對軌跡關鍵驅動基因的下游分析,挖掘擬時序分化軌跡開關基因的分析,這個方面大家應該都不陌生,文章在GeneSwitches: ordering gene expression and functional events in single-cell experiments,我們首先來分享一下文獻,最后看一看參考代碼。
首先認識一下什么是開關基因,開關基因指細胞分化轉化過程中表達沉默或表達激活的基因,它可能會引起或推動發育體系,在可選擇的相關細胞途徑中進行轉換,對生物過程的發生發展有著重要的意義。
Abstract
1、Based on the similarity of gene expression profiles, many tools have been developed to generate an in silico(電腦模擬) ordering of cells in the form of pseudo-time trajectories.
2、these tools do not provide a means to find the ordering of critical gene expression changes over pseudo-time.(做軌跡分析做的好的生信人員都不是很多,每個軌跡分析的官方示例都是例題,需要靈活運用,能做到靈活運用的生信人員,比例不會很高,軌跡分析有很多需要注意的地方,大家可以參考文章10X單細胞軌跡分析之回顧,但這個也是簡單的總結,遠遠不夠)。
關于軌跡轉變基因這個,以monocle2為例,主要就是計算不同state之間的差異基因,這些基因的代表了不同的發育軌跡。(這個做法是否合理,有待商榷)。
3、We present GeneSwitches, a tool that takes any single-cell pseudo-time trajectory and determines the precise(準確的,精確的) order of gene expression and functional-event changes over time.(這個很多軟件就是計算差異,不知道這個方法有什么新的思路沒有)。
4、GeneSwitches uses a statistical framework based on logistic regression to identify the order in which genes are either switched on or off along pseudo-time.(開關基因的識別,這個是我們最為關心的地方)。
5、With this information, users can identify the order in which surface markers appear, investigate how functional ontologies are gained or lost over time and compare the ordering of switching genes from two related pseudo-temporal processes(開關基因的意義,下游分析,這個是后話了~~~)
Introduction 我們提煉一下
現有軟件(monocle2、Slingshot等),A limitation of current pseudo-time methods is that extracting the underlying order of gene expression changes from these trajectories can be difficult(其實有關空間位置檢測變化基因的能力,monocle3已經可以做到~~).
being able to interpret these gene expression changes in terms of the order that they occur would allow for a fuller understanding of the underlying biological processes.(這遠不是我們簡單計算一下差異基因就能解決的問題~~~)。
To address this, here we developed GeneSwitches, a statistical framework that processes scRNA-seq data together with a pseudo-time trajectory to find the set of genes that switch during the transition.(軟件可以幫助我們找到軌跡發生過程中的開關基因)。
For each gene, we calculate a switching time and associated confidence level.(對于每個基因,我們計算轉換時間和相關的置信水平。,有點酷~~~~),能實現這樣目的的前提有兩個(i)
investigate how gene-regulatory networks or gene ontologies are gained or lost over time(這個絕對可以)(ii)
stratify selected gene sets (e.g. surface markers) by the order in which they appear(按基因出現的順序對選定的基因集(例如表面標記)進行分層 ) (iii)
identify key differences in the gene expression changes in cell transitions that bifurcate over time(分化轉變基因,這個最為關鍵)。
接下來,我們來看看真實的案例(下圖)
GeneSwitches functions and examples
workflow (看看數據處理的過程)
1、GeneSwitches requires two inputs, namely the gene expression matrix and the pseudo-time ordering of each single cell。(矩陣和時間,分析過程如下圖)。
2、First, GeneSwitches binarizes the gene expression into either an ‘on’ or ‘off’ state to enable the identification of switching events.This binarization is performed gene-wise in a data-driven manner by fitting a mixture model of two Gaussian distributions to the gene expression)The threshold separating the ‘on’ and ‘off’ gene expression states is then determined by the intersection of the two fitted Gaussian distributions.(這個地方涉及到一些數學上的知識,大家感興趣可以查一下)。
3、the pseudo-time can be partitioned to identify switching events within specific parts of the trajectory(也可以截取一段的時間序列進行開關基因的識別)。
4、Ordering and visualization of switching genes,the binarized gene expression is used as a dependent variable in a logistic regression with the pseudo-time value of each cell providing the independent variable.In doing so, the probability of expression throughout pseudo-time is calculated and the quality of fit is determined by McFadden’s Pseudo R2. 這樣就找到了很多的開關基因,開關基因以0.5為界限,然后進行表達的可視化。
我們簡單來總結一下,GeneSwitches首先通過對分化軌跡中的基因進行二值化分析,篩選出了表達特性上存在on和off兩個狀態的潛在開關基因。隨后軟件對這些潛在開關基因進行邏輯回歸分析和McFadden’s Pseudo R2擬時間相關性分析:通過邏輯回歸分析推算出每個開關基因的開關時間(Switching Time Point);通過擬時間相關性分析得到每個相關性R2值,其中表達激活的開關基因與擬時序正相關(R2>0),被定義為上調型開關基因(Up-regulation),而表達沉默的開關基因與擬時序負相關(R2<0),被定義為下調型開關基因(Down-regulation)。擬時序相關性越高代表基因與該軌跡進程的關系越密切。得到了每個潛在開關基因的開關時間和相關性R2值后,將Top開關基因按其開關時間在擬時序上排序可視化,可以更直觀地展示分化軌跡進程中的關鍵基因作用。同時,GeneSwitches嵌入了功能分析模塊,通過基因注釋(如表面蛋白,轉錄因子或其他功能類型)和富集分析(GO, KEGG和HALLMARK)的方法幫助解讀開關基因分析結果,更好地與生物過程和疾病發生發展聯系起來。除此之外,該軟件不僅可以對單條軌跡進行開關基因分析,還能比較兩條不同分化分支之間開關基因的異同,可用于尋找分支特異性的開關基因或比較相同開關基因在不同分支中開關時間的先后差異。
看一下基因的排序和可視化
Ordering and visualization of gene classes and functional groups
1、Switching genes can be used to investigate the functional nature of the pseudo-time trajectory.例如,it might be desirable to know for a set of known surface proteins at what point they are activated or deactivated during a transition in order to facilitate the identification of suitable markers on which to sort cells that are transitioning.(開關基因什么時候轉變的)。
2、GeneSwitches can also identify the order in which functional ontologies are acquired or lost during a transition.(功能的獲得和丟失)。
3、軟件提供了可視化的功能,To visualize these changes, we provide the functionality to plot the density of switching genes from each ontological class with respect to pseudo-time in order to study when and how different functional classes are important。
4、具體案例(單軌跡),首先monocle2計算矩陣的軌跡值,然后識別開關基因GeneSwitches identified that TIMP1 and VIM were early surface proteins to be activated, indicating that they might represent good candidate markers to identify cells progressing along the differentiation process more rapidly。We also observe that POU5F1 is deactivated early, whilst MYH7 is activated late(下圖d)。 Functional ontology analysis showed that the cell cycle-related ontologies were down-regulated at an early time and cells acquired cardiac-related functions later in the pseudotime(功能缺失和獲得的時間關系)。
這部分我們簡單總結一下,從人胚胎干細胞(hESC)分化為心肌細胞(CM)的分化軌跡開關基因分析。通過開關分析發現了擬時序較為早期表達激活的開關基因VIM和TIMP1,可以作為加速分化進程的候選基因。另外他們還找到了早期表達沉默的開關基因POU5F1與后期表達激活的開關基因MYH7。開關基因富集分析結果表明與細胞周期(Cell Cycle)相關的通路在擬時序早期下調,而與心臟功能相關的通路在擬時序后期上調。
雙軌跡開關基因分析案例,這個用于多個分化分支的情況
If GeneSwitches has been used to analyse two related pseudo-time trajectories, it is possible to compare the switching genes and their switching times.例如,在分化過程中,某些細胞群通常會分叉,每組都處于不同的細胞狀態。 GeneSwitches can be used to compare these trajectories, looking for similarities and differences in the switching genes, as well as their switching times.(比較和尋找差異、開關基因的時間)。下圖展示了用GeneSwitches分析兩條有聯系的分化分支的開關基因分析,兩條分化軌跡分別為從人胚胎干細胞(hESC)到心肌細胞(CM)的分化分支1(Definitive CM)和到非收縮細胞的分化分支2(Non-contractile)。分析發現分支1上調了一些與心臟功能相關的基因如CSRP3和NKX2-5,而分支2則上調了如DCN和COL1A2等成纖維細胞的marker基因。
In summary, GeneSwitches can help identify the timing of gene expression events within a pseudo-time trajectory, which in turn allows for a fuller understanding of the order of regulatory and functional events that occur during a cellular transition.
看一看代碼
GeneSwitches
GeneSwitches 的目標是以單細胞分辨率發現細胞狀態轉換期間基因表達和功能事件的順序。 它適用于任何單細胞軌跡或細胞的偽時間排序,以發現充當細胞狀態之間開/關開關的基因,重要的是這些開關發生的順序。
Installation
Check and install required packages
Users may use following codes to check and install all the required packages.
list.of.packages <- c("SingleCellExperiment", "Biobase", "fastglm", "ggplot2", "monocle",
"plyr", "RColorBrewer", "ggrepel", "ggridges", "gridExtra", "devtools",
"mixtools")
## for package "fastglm", "ggplot2", "plyr", "RColorBrewer", "ggrepel", "ggridges", "gridExtra", "mixtools"
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
## for package "SingleCellExperiment", "Biobase"
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) BiocManager::install(new.packages)
Install GeneSwitches
The source code of GeneSwitches can be installed from GitHub with:
devtools::install_github("SGDDNB/GeneSwitches")
Input datasets
GeneSwitches 需要兩個輸入,即基因表達矩陣和每個細胞的相應偽時間排序。 我們將這些輸入數據集轉換為一個 SingleCellExperiment
對象(Lun and Risso 2017),在下面你會發現一個完整的“從頭到尾”的工作流程來實現這個分析的潛力。
## load libraries
library(GeneSwitches)
library(SingleCellExperiment)
這里的實例代碼,我們將使用已發布的單細胞 RNA-seq 數據集,這些數據從人類胚胎干細胞 (hESC) 到心肌細胞 (CM) 的分化(Friedman 等人,2018 年)。 運用monocle2進行軌跡分析。 選擇這個數據集的部分原因是它顯示了心臟 hESC 分化的分叉細胞命運,它產生了明確的心肌細胞 (Path1) 或非收縮性心臟衍生物 (Path2),允許應用 GeneSwitches 的所有方面。
## Download example files to current directory
get_example_inputData()
## Load input data log-normalized gene expression
load("./logexpdata.RData")
## Load Monocle2 object with pseudo-time and dimensionality reduction
load("./cardiac_monocle2.RData")
Direct input (NOT run)
Users can input the gene expression (logexpdata
; recommend for log-normalized expression), pseudo-time (cell_pseudotime
) and dimensionality reductions (rd_PCA
; optional and only for gene expression plots) into SingleCellExperiment object as follows.
### create SingleCellExperiment object with log-normalized single cell data
#sce <- SingleCellExperiment(assays = List(expdata = logexpdata))
### add pseudo-time information
#colData(sce)$Pseudotime <- cell_pseudotime
### add dimensionality reductions, e.g. PCA, UMAP, tSNE
#pca <- prcomp(t(assays(sce)$expdata), scale. = FALSE)
#rd_PCA <- pca$x[,1:2]
#reducedDims(sce) <- SimpleList(PCA = rd_PCA)
Convert from trajectory results
Alternatively, GeneSwitches provides functions to convert Monocle2 or Slingshot results into SingleCellExperiment object directly. For Monocle2 trajectory, users need to indicate the states of the desired path, which can be checked by plotting the trajectory using Monocle2 function plot_cell_trajectory
or the following function.
## plot Monocle2 trajectory colored by State
# monocle:::plot_cell_trajectory(cardiac_monocle2, color_by = "State")
plot_monocle_State(cardiac_monocle2)
Based on the marker genes, the pseudo-time trajectory starts from State 3, which are hESC cells. Definitive CM cells are in State 1 and non-contractile cardiac derivatives are in State 5. Therefore, we focus on Path1 with cells in states 3, 2, 1 and Path2 with cells in states 3, 2, 5, and extract these two paths from Monocle2 object.
## Input log-normalized gene expression, Monocle2 pseudo-time and dimensionality reduction
## Path1 containing cells in states 3,2,1
sce_p1 <- convert_monocle2(monocle2_obj = cardiac_monocle2,
states = c(3,2,1), expdata = logexpdata)
## Path2 containing cells in states 3,2,5
sce_p2 <- convert_monocle2(monocle2_obj = cardiac_monocle2,
states = c(3,2,5), expdata = logexpdata)
If we are only interested in the trajectory within a certain range of pseudotime, function subset_pseudotime
can be used to subset the SingleCellExperiment object accordingly, followed by filtering out lowly expressed genes.
### Subset cells to pseudotime range from 10 to 25
#sce_p1_subset <- subset_pseudotime(sce_p1, min_time = 10, max_time = 25, minexp = 0, mincells = 10)
In Part I, we will apply GeneSwitches on a single trajectory, Path1, to demonstrate the general workflow and functions. Comparison of GeneSwitches results from two trajectories (Path1 & 2) will be shown in Part II.
PART I. GeneSwitches on a single trajectory
I-1. Binarize gene expression
由于我們關注的是打開或關閉的基因,因此我們首先將基因表達數據二值化為 1(on) 或 0(off) 狀態。 為了實現這一點,對于每個基因,我們將兩個高斯分布的混合模型擬合到輸入基因表達中,以計算用于二值化的基因特異性閾值。 在擬合之前,我們在基因表達中添加了均值為零和標準差為 0.1 的高斯噪聲,這確保了基因表達擬合的數值穩定性。 然后去除不具有明顯雙峰“開-關”分布的基因。 對于使用 3 個內核的 2000 個細胞,此步驟可能需要 2 分鐘。
### binarize gene expression using gene-specific thresholds
sce_p1 <- binarize_exp(sce_p1, ncores = 3)
Alternatively, we can use a global threshold for fast binarization. We plot a histogram of expression of all the genes in all cells and look for a break between the zero and expressed distributions to identify the global threshold.
### check the threshold for binarization
#h <- hist(assays(sce_p1)$expdata, breaks = 200, plot = FALSE)
#{plot(h, freq = FALSE, xlim = c(0,2), ylim = c(0,1), main = "Histogram of gene expression",
#xlab = "Gene expression", col = "darkgoldenrod2", border = "grey")
#abline(v=0.2, col="blue")}
###In this example, we choose 0.2 (blue line, also set as default) as the threshold.
# bn_cutoff <- 0.2
# sce_p1 <- binarize_exp(sce_p1, fix_cutoff = TRUE, binarize_cutoff = bn_cutoff)
I-2. Fit logistic regression & estimate switching time
Logistic regression is applied to model the binary states (on or off) of gene expression. Then the switching pseudo-time point is determined by the time at which the fitted line crosses the probability threshold 0.5. We use random downsampling of zero expressions (downsample = TRUE) to rescue the prediction of switching time for genes with high zero inflation.
## fit logistic regression and find the switching pseudo-time point for each gene
## with downsampling. This step takes less than 1 mins
sce_p1 <- find_switch_logistic_fastglm(sce_p1, downsample = TRUE, show_warning = FALSE)
I-3. Visualize ordering of switching genes
First, we filter poorly fitted genes based on zero-expression percentage (>90%), FDR (>0.05) and McFadden’s Pseudo R^2 (<0.03). We can then the number of top best fitting (high McFadden’s Pseudo R^2) genes to plot. One can also extract specific gene type(s) to plot, with provided gene type lists containing surface proteins (downloaded from here) and transcription factors (TFs, downloaded from here). Users are allowed to pass their own gene type lists as a data frame to parameter genelists, with rows as genes (non-duplicated) and two columns with name genenames
and genetypes
.
## filter top 15 best fitting switching genes among all the genes
sg_allgenes <- filter_switchgenes(sce_p1, allgenes = TRUE, topnum = 15)
## filter top 15 best fitting switching genes among surface proteins and TFs only
sg_gtypes <- filter_switchgenes(sce_p1, allgenes = FALSE, topnum = 20,
genelists = gs_genelists, genetype = c("Surface proteins", "TFs"))
## combine switching genes and remove duplicated genes from sg_allgenes
sg_vis <- rbind(sg_gtypes, sg_allgenes[setdiff(rownames(sg_allgenes), rownames(sg_gtypes)),])
Finally, plot the selected genes along the pseudo-timeline. Genes that are switched on are plotted above the line, while those switching off are below the line.
plot_timeline_ggplot(sg_vis, timedata = sce_p1$Pseudotime, txtsize = 3)
It is possible to use the dimensionality reduction provided from the user to visualise the gene expression and logistic regression fitting plots if needed.
plot_gene_exp(sce_p1, gene = "VIM", reduction = "monocleRD", downsample = F)
I-4. Order pathways along the pseudo-timeline
GeneSwitches can be used to order pathways or genesets as well. We include the pathways provided by MSigDB hallmark (Liberzon,A. et al., 2015), C2 curated and C5 gene ontology geneset collections. A Hypergeometric test is first applied to extract the pathways that are significantly overrepresented amongst those that are changing along the trajectory. The Switching time of the pathway is then determined by the median switching time of genes in that pathway.
## filter genes for pathway analysis using r^2 cutoff 0.1
sg_pw <- filter_switchgenes(sce_p1, allgenes = TRUE, r2cutoff = 0.1)
## apply hypergeometric test and determine the switching time
switch_pw <- find_switch_pathway(rowData(sce_p1), sig_FDR = 0.05,
pathways = msigdb_h_c2_c5, sg_pw)
## remove redundant pathways
switch_pw_reduce <- reduce_pathways(switch_pw, pathways = msigdb_h_c2_c5,
redundant_pw_rate = 0.8)
To better visualise the functional changes ridge plots of pathways genes show the density of switching genes along the pseudo-time. Top 10 significantly changed pathways are plotted here, ordered by the switching time.
plot_pathway_density(switch_pw_reduce[1:10,], sg_pw, orderbytime = TRUE)
#> Picking joint bandwidth of 2.49
We can also select specific pathway(s) to plot the switching genes in it. Among top 10 significantly changed pathways, we plot genes related to myogenesis and cardiac muscle tissue development.
sg_vis <- filter_switchgenes(sce_p1, topnum = 50, pathway_name = c("HALLMARK_MYOGENESIS",
"GO_CARDIAC_MUSCLE_TISSUE_DEVELOPMENT"))
plot_timeline_ggplot(sg_vis, timedata=sce_p1$Pseudotime, txtsize=3)
“Multiple” lables the genes in more than one pathways.
PART II. Comparing switching genes from two trajectories
Before comparison, we need to apply same steps in I-1 and I-2 on the cells from Path2 to identify switching pseudo-time point for each gene.
sce_p2 <- binarize_exp(sce_p2)
sce_p2 <- find_switch_logistic_fastglm(sce_p2, downsample = TRUE, show_warnings = FALSE)
And we filter out poorly fitted genes for both paths using the same cutoff.
sg_p1 <- filter_switchgenes(sce_p1, allgenes = TRUE, r2cutoff = 0.03)
sg_p2 <- filter_switchgenes(sce_p2, allgenes = TRUE, r2cutoff = 0.03)
We then plot common switching genes between two paths to compare their ordering.
sg_com <- common_genes(sg_p1, sg_p2, r2cutoff = 0.4,
path1name = "Definitive CM", path2name = "non-contractile")
common_genes_plot(sg_com, timedata = sce_p1$Pseudotime)
More importantly, we can plot the distinct switching genes of the two paths.
sg_disgs <- distinct_genes(sg_p1, sg_p2, r2cutoff = 0.52,
path1name = "Definitive CM", path2name = "non-contractile",
path1time = sce_p1$Pseudotime, path2time = sce_p2$Pseudotime)
plot_timeline_ggplot(sg_disgs, timedata = sce_p1$Pseudotime, color_by = "Paths",
iffulltml = FALSE, txtsize = 3)
We can also scale the timelines to be the same length (default number of bins is 100) so that differences are based on percentage of the trajectory covered rather than pseudo-time.
sg_disgs_scale <- distinct_genes(sg_p1, sg_p2, r2cutoff = 0.52,
path1name = "Definitive CM", path2name = "non-contractile",
path1time = sce_p1$Pseudotime, path2time = sce_p2$Pseudotime,
scale_timeline = T, bin = 100)
# timedata need to be 1 to (number of bins)
plot_timeline_ggplot(sg_disgs_scale, timedata = 1:100, color_by = "Paths",
iffulltml = FALSE, txtsize = 3)
這兩個不同轉換基因的圖只顯示了一系列發生轉換事件的偽時間線。 這個范圍實際上是在軌跡的末端,而共同基因大多處于早期(共同基因圖)。
Similarly, we can check the gene expression plots for the two paths.
gn <- "DCN"
p <- plot_gene_exp(sce_p1, gene = gn, reduction = "monocleRD",
downsample = FALSE, fitting = TRUE)
#> Warning: glm.fit: algorithm did not converge
p <- plot_gene_exp(sce_p2, gene = gn, reduction = "monocleRD",
downsample = FALSE, fitting = TRUE)
生活很好,等你超越