這一篇接上一篇,我們來分享Cellrank的基礎分析代碼,我們直接開始
使用 RNA 速度和轉錄組學相似性來估計細胞-細胞轉換概率。即使沒有 RNA 速度信息,也可以應用 CellRank(但還是推薦大家RNA 速率和相似度方法聯合使用)。
加載
import sys
if "google.colab" in sys.modules:
!pip install -q git+https://github.com/theislab/cellrank@dev
!pip install python-igraph
import scvelo as scv
import scanpy as sc
import cellrank as cr
import numpy as np
scv.settings.verbosity = 3
scv.settings.set_figure_params("scvelo")
cr.settings.verbosity = 2
import warnings
warnings.simplefilter("ignore", category=UserWarning)
warnings.simplefilter("ignore", category=FutureWarning)
warnings.simplefilter("ignore", category=DeprecationWarning)
首先,需要獲取數據。 以下命令將下載 adata 對象并將其保存在 datasets/endocrinogenesis_day15.5.h5ad
下(示例數據,大家可以下載下來自己研究)。 我們還將顯示拼接/未拼接讀數的比例,需要用它來估計 RNA 速度。
adata = cr.datasets.pancreas()
scv.pl.proportions(adata)
adata
AnnData object with n_obs × n_vars = 2531 × 27998
obs: 'day', 'proliferation', 'G2M_score', 'S_score', 'phase', 'clusters_coarse', 'clusters', 'clusters_fine', 'louvain_Alpha', 'louvain_Beta', 'palantir_pseudotime'
var: 'highly_variable_genes'
uns: 'clusters_colors', 'clusters_fine_colors', 'day_colors', 'louvain_Alpha_colors', 'louvain_Beta_colors', 'neighbors', 'pca'
obsm: 'X_pca', 'X_umap'
layers: 'spliced', 'unspliced'
obsp: 'connectivities', 'distances'
前處理數據
過濾掉沒有足夠剪接/未剪接計數的基因,對數據進行歸一化和對數變換,并限制在高度可變的基因上。 此外,計算速度估計的主成分和矩。
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)
sc.tl.pca(adata)
sc.pp.neighbors(adata, n_pcs=30, n_neighbors=30)
scv.pp.moments(adata, n_pcs=None, n_neighbors=None)
Run scVelo
我們將使用來自 scVelo 的動力學模型來估計速度。
scv.tl.recover_dynamics(adata, n_jobs=8)
一旦有了參數,就可以使用這些參數來計算速度和速度圖。 速度圖是一個加權圖,它指定了兩個cell在給定速度向量和相對位置的情況下轉換為另一個cell的可能性。
scv.tl.velocity(adata, mode="dynamical")
scv.tl.velocity_graph(adata)
scv.pl.velocity_embedding_stream(
adata, basis="umap", legend_fontsize=12, title="", smooth=0.8, min_mass=4
)
運行Cellrank
CellRank 提供了多種將方向性注入單細胞數據的方法。 在這里,方向信息來自 RNA 速度,使用這些信息來計算胰腺發育動態過程的初始和終止狀態以及fate probabilities。
Identify terminal states
cr.tl.terminal_states(adata, cluster_key="clusters", weight_connectivities=0.2)
The most important parameters in the above function are:
estimator
: this determines what’s going to behind the scenes to compute the terminal states. Options arecr.tl.estimators.CFLARE
(“Clustering and Filtering of Left and Right Eigenvectors”) orcr.tl.estimators.GPCCA
(“Generalized Perron Cluster Cluster Analysis, [Reuter et al., 2018] and [Reuter et al., 2019], see also our pyGPCCA implementation). The latter is the default, it computes terminal states by coarse graining the velocity-derived Markov chain into a set of macrostates that represent the slow-time scale dynamics of the process, i.e. it finds the states that you are unlikely to leave again, once you have entered them.cluster_key
: takes a key fromadata.obs
to retrieve pre-computed cluster labels, i.e. ‘clusters’ or ‘louvain’. These labels are then mapped onto the set of terminal states, to associate a name and a color with each state.n_states
: number of expected terminal states. This parameter is optional - if it’s not provided, this number is estimated from the so-called ‘eigengap heuristic’ of the spectrum of the transition matrix.method
: This is only relevant for the estimatorGPCCA
. It determines the way in which we compute and sort the real Schur decomposition. The default,krylov
, is an iterative procedure that works with sparse matrices which allows the method to scale to very large cell numbers. It relies on the libraries SLEPc and PETSc, which you will have to install separately, see our installation instructions. If your dataset is small (<5k cells), and you don’t want to install these at the moment, usemethod='brandts'
[Brandts, 2002]. The results will be the same, the difference is thatbrandts
works with dense matrices and won’t scale to very large cells numbers.weight_connectivities
: weight given to cell-cell similarities to account for noise in velocity vectors.
cr.pl.terminal_states(adata)
Identify initial states
cr.tl.initial_states(adata, cluster_key="clusters")
cr.pl.initial_states(adata, discrete=True)
Compute fate maps
一旦知道終端狀態,就可以計算相關的命運圖——對于每個細胞,尋求細胞朝著每個確定的終端狀態發展的可能性有多大。
cr.tl.lineages(adata)
cr.pl.lineages(adata, same_plot=False)
可以將上述內容聚合成一個單一的全局命運圖,其中將每個終端狀態與顏色相關聯,并使用該顏色的強度來顯示每個單個細胞的命運:
cr.pl.lineages(adata, same_plot=True)
Directed PAGA
我們可以使用具有有向邊的 [Wolf et al., 2019] 的改編版本將個體命運圖進一步聚合成集群級別的命運圖。 我們首先使用 CellRank 識別的 root_key 和 end_key 計算 scVelo 的潛伏時間,它們分別是初始狀態或終止狀態的概率。
scv.tl.recover_latent_time(
adata, root_key="initial_states_probs", end_key="terminal_states_probs"
)
Next, we can use the inferred pseudotime along with the initial and terminal states probabilities to compute the directed PAGA.
scv.tl.paga(
adata,
groups="clusters",
root_key="initial_states_probs",
end_key="terminal_states_probs",
use_time_prior="velocity_pseudotime",
)
cr.pl.cluster_fates(
adata,
mode="paga_pie",
cluster_key="clusters",
basis="umap",
legend_kwargs={"loc": "top right out"},
legend_loc="top left out",
node_size_scale=5,
edge_width_scale=1,
max_edge_width=4,
title="directed PAGA",
)
Compute lineage drivers
可以計算所有譜系或部分譜系的驅動基因。 還可以通過指定cluster=...將其限制為某些cluster。 在生成的數據框中,還看到了 p 值、校正后的 p 值(q 值)和相關統計量的 95% 置信區間。
Afterwards, we can plot the top 5 driver genes (based on the correlation), e.g. for the Alpha lineage
cr.pl.lineage_drivers(adata, lineage="Alpha", n_genes=5)
Gene expression trends
上面演示的功能是 CellRank 的主要功能:計算初始和終止狀態以及概率命運圖。 現在可以使用計算出的概率來例如 沿譜系平滑基因表達趨勢。
從細胞的時間順序開始。 為了得到這個,可以計算 scVelo 的潛伏時間,如前所述,或者,我們可以只使用 CellRank 的初始狀態來計算 。
# compue DPT, starting from CellRank defined root cell
root_idx = np.where(adata.obs["initial_states"] == "Ngn3 low EP")[0][0]
adata.uns["iroot"] = root_idx
sc.tl.dpt(adata)
scv.pl.scatter(
adata,
color=["clusters", root_idx, "latent_time", "dpt_pseudotime"],
fontsize=16,
cmap="viridis",
perc=[2, 98],
colorbar=True,
rescale_color=[0, 1],
title=["clusters", "root cell", "latent time", "dpt pseudotime"],
)
We can plot dynamics of genes in pseudotime along individual trajectories, defined via the fate maps we computed above.
model = cr.ul.models.GAM(adata)
cr.pl.gene_trends(
adata,
model=model,
data_key="X",
genes=["Pak3", "Neurog3", "Ghrl"],
ncols=3,
time_key="latent_time",
same_plot=True,
hide_cells=True,
figsize=(15, 4),
n_test_points=200,
)
還可以在熱圖中可視化上面計算的譜系驅動程序。 下面,對 Alpha 譜系執行此操作,即在偽時間中平滑假定的 Alpha 驅動程序的基因表達,將 Alpha 命運概率用作細胞級別的權重。 根據它們在偽時間中的峰值對基因進行排序,從而揭示了基因表達事件的級聯。
cr.pl.heatmap(
adata,
model,
genes=adata.varm['terminal_lineage_drivers']["Alpha_corr"].sort_values(ascending=False).index[:100],
show_absorption_probabilities=True,
lineages="Alpha",
n_jobs=1,
backend="loky",
)
生活很好,有你更好