又是周五,馬上元旦,不知道大家打算怎么跨年呢?但是,今天還是工作日,我們還是要學習學習,不負韶華。
這一篇還是想詳細梳理一下cell2location,包括算法以及代碼的寫法,有很多學習的地方。
這里演示了如何使用 cell2location 模型將單個細胞參考細胞類型映射到空間轉錄組數據集。 在這里,使用 10X 單核 RNA 測序 (snRNAseq) 和從小鼠大腦的相鄰組織切片生成的 Visium 空間轉錄組數據(單核 + 空間)
Cell2location 是一種貝葉斯模型,它集成了單細胞 RNA-seq (scRNA-seq) 和多細胞空間轉錄組學,以高效地繪制大型綜合細胞類型參考
第一部分,計算單細胞細胞類型的表達特征
第一步,從 scRNA-seq 譜中估計參考細胞類型特征,例如使用常規聚類來識別細胞類型和亞群,然后估計平均cluster基因表達譜。 Cell2location 基于負二項式回歸實現了這個估計步驟,它允許跨技術和批次穩健地組合數據(多個樣本還涉及到批次).
Loading packages
import sys
import scanpy as sc
import anndata
import pandas as pd
import numpy as np
import os
data_type = 'float32'
import cell2location
import matplotlib as mpl
from matplotlib import rcParams
import matplotlib.pyplot as plt
import seaborn as sns
# silence scanpy that prints a lot of warnings
import warnings
warnings.filterwarnings('ignore')
Loading single cell reference data
使用小鼠大腦的配對 Visium 和 snRNAseq 參考數據集(即從相鄰組織切片生成)。 該數據集由來自 2 只小鼠的 3 個切片的細胞組成。 已經注釋了多個大腦區域的 59 個細胞神經元和膠質細胞群,包括 10 個區域星形膠質細胞亞型(單細胞的細胞定義必須先做好)。
sc_data_folder = './data/'
results_folder = './results/mouse_brain_snrna/'
if os.path.exists(sc_data_folder) is not True:
os.mkdir(sc_data_folder)
os.system(f'cd {sc_data_folder} && wget https://cell2location.cog.sanger.ac.uk/tutorial/mouse_brain_snrna/all_cells_20200625.h5ad')
os.system(f'cd {sc_data_folder} && wget https://cell2location.cog.sanger.ac.uk/tutorial/mouse_brain_snrna/snRNA_annotation_astro_subtypes_refined59_20200823.csv')
if os.path.exists(results_folder) is not True:
os.mkdir('./results')
os.mkdir(results_folder)
讀取單細胞數據和細胞注釋結果
## snRNA reference (raw counts)
adata_snrna_raw = anndata.read_h5ad(sc_data_folder + "all_cells_20200625.h5ad")
## Cell type annotations
labels = pd.read_csv(sc_data_folder + 'snRNA_annotation_astro_subtypes_refined59_20200823.csv', index_col=0)
Add cell type labels as columns in adata.obs
####reindex函數,這是第一個需要注意的函數
labels = labels.reindex(index=adata_snrna_raw.obs_names)
adata_snrna_raw.obs[labels.columns] = labels
adata_snrna_raw = adata_snrna_raw[~adata_snrna_raw.obs['annotation_1'].isna(), :]
Reduce the number of genes by discarding lowly expressed genes(通過丟棄低表達基因來減少基因數量)
這是使用 2 個閾值執行的,以去除盡可能多的低表達基因,同時避免容易刪除稀有種群標記的高度可變基因選擇 (HVG):
- 包括至少 3% 的細胞表達的所有基因 (cell_count_cutoff2)
- 包括由至少 0.05% 的細胞表達的基因 (cell_count_cutoff),當它們在非零細胞 (nonz_mean_cutoff) 中具有高計數時
偏向于第二種選擇基因的方式,因為第 2 步允許保留由稀有細胞群表達但水平很高的基因,而標準的 HVG 選擇方法可以過濾掉這些基因,因為它們的全局均值和方差較低。
# remove cells and genes with 0 counts everywhere
sc.pp.filter_cells(adata_snrna_raw, min_genes=1)
sc.pp.filter_genes(adata_snrna_raw, min_cells=1)
# calculate the mean of each gene across non-zero cells
adata_snrna_raw.var['n_cells'] = (adata_snrna_raw.X.toarray() > 0).sum(0)
adata_snrna_raw.var['nonz_mean'] = adata_snrna_raw.X.toarray().sum(0) / adata_snrna_raw.var['n_cells']
plt.hist2d(np.log10(adata_snrna_raw.var['nonz_mean']),
np.log10(adata_snrna_raw.var['n_cells']), bins=100,
norm=mpl.colors.LogNorm(),
range=[[0,0.5], [1,4.5]]);
nonz_mean_cutoff = np.log10(1.12) # cut off for expression in non-zero cells
cell_count_cutoff = np.log10(adata_snrna_raw.shape[0] * 0.0005) # cut off percentage for cells with higher expression
cell_count_cutoff2 = np.log10(adata_snrna_raw.shape[0] * 0.03)# cut off percentage for cells with small expression
plt.vlines(nonz_mean_cutoff, cell_count_cutoff, cell_count_cutoff2, color = 'orange');
plt.hlines(cell_count_cutoff, nonz_mean_cutoff, 1, color = 'orange');
plt.hlines(cell_count_cutoff2, 0, nonz_mean_cutoff, color = 'orange');
plt.xlabel('Mean count in cells with mRNA count > 0 (log10)');
plt.ylabel('Count of cells with mRNA count > 0 (log10)');
Show the number of selected cells and genes:
adata_snrna_raw[:,(np.array(np.log10(adata_snrna_raw.var['nonz_mean']) > nonz_mean_cutoff)
| np.array(np.log10(adata_snrna_raw.var['n_cells']) > cell_count_cutoff2))
& np.array(np.log10(adata_snrna_raw.var['n_cells']) > cell_count_cutoff)].shape
###(40532, 12844)
Filter the object
# select genes based on mean expression in non-zero cells
adata_snrna_raw = adata_snrna_raw[:,(np.array(np.log10(adata_snrna_raw.var['nonz_mean']) > nonz_mean_cutoff)
| np.array(np.log10(adata_snrna_raw.var['n_cells']) > cell_count_cutoff2))
& np.array(np.log10(adata_snrna_raw.var['n_cells']) > cell_count_cutoff)
& np.array(~adata_snrna_raw.var['SYMBOL'].isna())]
Add counts matrix as adata.raw
adata_snrna_raw.raw = adata_snrna_raw
Show UMAP of cells(圖形展示)
可以通過使用標準的 scanpy 工作流程來檢查數據的細胞組成,以生成單個細胞數據的 UMAP 表示。
這個地方要注意一點,變異度最大的第一個PC軸被去掉了;去除批次效應采用的是bbknn
#########################
adata_snrna_raw.X = adata_snrna_raw.raw.X.copy()
sc.pp.log1p(adata_snrna_raw)
sc.pp.scale(adata_snrna_raw, max_value=10)
sc.tl.pca(adata_snrna_raw, svd_solver='arpack', n_comps=80, use_highly_variable=False)
# Plot total counts over PC to check whether PC is indeed associated with total counts
#sc.pl.pca_variance_ratio(adata_snrna_raw, log=True)
#sc.pl.pca(adata_snrna_raw, color=['total_counts'],
# components=['0,1', '2,3', '4,5', '6,7', '8,9', '10,11', '12,13'],
# color_map = 'RdPu', ncols = 3, legend_loc='on data',
# legend_fontsize=10, gene_symbols='SYMBOL')
# remove the first PC which explains large amount of variance in total UMI count (likely technical variation)
adata_snrna_raw.obsm['X_pca'] = adata_snrna_raw.obsm['X_pca'][:, 1:]
adata_snrna_raw.varm['PCs'] = adata_snrna_raw.varm['PCs'][:, 1:]
#########################
# Here BBKNN (https://github.com/Teichlab/bbknn) is used to align batches (10X experiments)
import bbknn
bbknn.bbknn(adata_snrna_raw, neighbors_within_batch = 3, batch_key = 'sample', n_pcs = 79)
sc.tl.umap(adata_snrna_raw, min_dist = 0.8, spread = 1.5)
#########################
adata_snrna_raw = adata_snrna_raw[adata_snrna_raw.obs['annotation_1'].argsort(),:]
with mpl.rc_context({'figure.figsize': [10, 10],
'axes.facecolor': 'white'}):
sc.pl.umap(adata_snrna_raw, color=['annotation_1'], size=15,
color_map = 'RdPu', ncols = 1, legend_loc='on data',
legend_fontsize=10)
Estimating expression signatures
模型的簡單介紹
Model-based estimation of reference expression signatures of cell types :math:g_{f,g}
using a regularised Negative Binomial regression. This model robustly derives reference expression signatures of cell types (g_{f,g}) using the data composed of multiple batches (e={1..E}) and technologies (t={1..T}). Adapting the assumptions of a range of computational methods for scRNA-seq, we assume that the expression count matrix follows a Negative Binomial distribution with unobserved expression levels (rates) (\mu_{c,g}) and a gene-specific over-dispersion (\alpha_g). We model (\mu_{c,g}) as a linear function of reference cell type signatures and technical effects: - (e_e) denotes a multiplicative global scaling parameter between experiments/batches (e) (e.g. differences in sequencing depth); - (t_{t,g}) accounts for multiplicative gene-specific difference in sensitivity between technologies; - (b_{e,g}) accounts for additive background shift of each gene in each experiment (e) (proxy for free-floating RNA).
Training the model(訓練模型)
這里展示如何執行包裝到單個管道函數調用中的此模型的訓練,如何評估此模型的質量并提取細胞類型的參考簽名以與 cell2location 一起使用: (這些參數很值得深入探討一下)
# Run the pipeline:
from cell2location import run_regression
r, adata_snrna_raw = run_regression(adata_snrna_raw, # input data object]
verbose=True, return_all=True,
train_args={
'covariate_col_names': ['annotation_1'], # column listing cell type annotation
'sample_name_col': 'sample', # column listing sample ID for each cell
# column listing technology, e.g. 3' vs 5',
# when integrating multiple single cell technologies corresponding
# model is automatically selected
'tech_name_col': None,
'stratify_cv': 'annotation_1', # stratify cross-validation by cell type annotation
'n_epochs': 100, 'minibatch_size': 1024, 'learning_rate': 0.01,
'use_cuda': True, # use GPU?
'train_proportion': 0.9, # proportion of cells in the training set (for cross-validation)
'l2_weight': True, # uses defaults for the model
'readable_var_name_col': 'SYMBOL', 'use_raw': True},
model_kwargs={}, # keep defaults
posterior_args={}, # keep defaults
export_args={'path': results_folder + 'regression_model/', # where to save results
'save_model': True, # save pytorch model?
'run_name_suffix': ''})
reg_mod = r['mod']
Saved anndata object and the trained model object can be read later using
reg_mod_name = 'RegressionGeneBackgroundCoverageTorch_65covariates_40532cells_12819genes'
reg_path = f'{results_folder}regression_model/{reg_mod_name}/'
## snRNAseq reference (raw counts)
adata_snrna_raw = sc.read(f'{reg_path}sc.h5ad')
## model
r = pickle.load(file = open(f'{reg_path}model_.p', "rb"))
reg_mod = r['mod']
Export reference expression signatures of cell types(導出細胞類型的參考表達特征),re的用法我們也要好好學習一下
# Export cell type expression signatures:
covariate_col_names = 'annotation_1'
inf_aver = adata_snrna_raw.raw.var.copy()
inf_aver = inf_aver.loc[:, [f'means_cov_effect_{covariate_col_names}_{i}' for i in adata_snrna_raw.obs[covariate_col_names].unique()]]
from re import sub
inf_aver.columns = [sub(f'means_cov_effect_{covariate_col_names}_{i}', '', i) for i in adata_snrna_raw.obs[covariate_col_names].unique()]
inf_aver = inf_aver.iloc[:, inf_aver.columns.argsort()]
# scale up by average sample scaling factor
inf_aver = inf_aver * adata_snrna_raw.uns['regression_mod']['post_sample_means']['sample_scaling'].mean()
將估計的特征(y 軸)與分析計算的平均表達(x 軸)進行比較:
# compute mean expression of each gene in each cluster
aver = cell2location.cluster_averages.cluster_averages.get_cluster_averages(adata_snrna_raw, covariate_col_names)
aver = aver.loc[adata_snrna_raw.var_names, inf_aver.columns]
# compare estimated signatures (y-axis) to analytically computed mean expression (x-axis)
with mpl.rc_context({'figure.figsize': [5, 5]}):
plt.hist2d(np.log10(aver.values.flatten()+1), np.log10(inf_aver.values.flatten()+1),
bins=50, norm=mpl.colors.LogNorm());
plt.xlabel('Mean expression in each cluster');
plt.ylabel('Inferred expression in each cluster');
評估估計的特征是否因為混淆樣本背景已被移除而降低相關性:
# Look at how correlated are the signatures obtained by computing mean expression
with mpl.rc_context({'figure.figsize': [5, 5]}):
reg_mod.align_plot_stability(aver, aver, 'cluster_average', 'cluster_average', align=False)
# Look at how correlated are the signatures inferred by regression model - they should be less correlated than above
with mpl.rc_context({'figure.figsize': [5, 5]}):
reg_mod.align_plot_stability(inf_aver, inf_aver, 'inferred_signature', 'inferred_signature', align=False)
將每個實驗的細胞計數與估計的背景(湯、自由漂浮的 RNA)進行比較:
# Examine how many mRNA per cell on average are background
sample_name_col = 'sample'
cell_count = adata_snrna_raw.obs[sample_name_col].value_counts()
cell_count.index = [f'means_sample_effect{sample_name_col}_{i}' for i in cell_count.index]
soup_amount = reg_mod.sample_effects.sum(0)
with mpl.rc_context({'figure.figsize': [5, 5]}):
plt.scatter(cell_count[soup_amount.index].values.flatten(),
soup_amount.values.flatten());
plt.xlabel('Cell count per sample'); # fraction of reads in cells
plt.ylabel('Inferred sum of sample effects');
Additional quality control: removing technical effects and performing standard scanpy single cell analysis workflow(去除批次效應)
這允許通過檢查從每個單個細胞中刪除這些因素是否會導致合并樣本/批次來確定模型是否成功地考慮了技術因素,同時在 UMAP 空間中保留了分離良好的細胞類型。這里關注一下函數del,del刪除的是變量,而不是數據。當然這里的教程主要還是有一些注意的地方,remove the first PC which explains large amount of variance in total UMI count (首個PC都去掉了)。
adata_snrna_raw_cor = adata_snrna_raw.copy()
del adata_snrna_raw_cor.uns['log1p']
adata_snrna_raw_cor.X = np.array(reg_mod.normalise(adata_snrna_raw_cor.raw.X.copy()))
sc.pp.log1p(adata_snrna_raw_cor)
sc.pp.scale(adata_snrna_raw_cor, max_value=10)
# when all RNA of a given gene are additive background this results in NaN after scaling
adata_snrna_raw_cor.X[np.isnan(adata_snrna_raw_cor.X)] = 0
sc.tl.pca(adata_snrna_raw_cor, svd_solver='arpack', n_comps=80, use_highly_variable=False)
adata_snrna_raw.obs['total_counts'] = np.array(adata_snrna_raw.raw.X.sum(1)).flatten()
adata_snrna_raw_cor.obs['total_counts'] = adata_snrna_raw.obs['total_counts'].values.copy()
sc.pl.pca(adata_snrna_raw_cor, color=['total_counts'],
components=['1,2'],
color_map = 'RdPu', ncols = 2, legend_loc='on data', vmax='p99.9',
legend_fontsize=10)
# remove the first PC which explains large amount of variance in total UMI count (likely technical variation)
adata_snrna_raw_cor.obsm['X_pca'] = adata_snrna_raw_cor.obsm['X_pca'][:, 1:]
adata_snrna_raw_cor.varm['PCs'] = adata_snrna_raw_cor.varm['PCs'][:, 1:]
# here we use standard neighbors function rather than bbknn
# to show that the regression model can merge batches / experiments
sc.pp.neighbors(adata_snrna_raw_cor, n_neighbors = 15, n_pcs = 79, metric='cosine')
sc.tl.umap(adata_snrna_raw_cor, min_dist = 0.8, spread = 1)
with mpl.rc_context({'figure.figsize': [7, 7],
'axes.facecolor': 'white'}):
sc.pl.umap(adata_snrna_raw_cor, color=['annotation_1', 'sample', 'total_counts'],
color_map = 'RdPu', ncols = 1, size=13, #legend_loc='on data',
legend_fontsize=10, palette=sc.pl.palettes.default_102)
第二部分,Spatial mapping of cell types across the mouse brain (2/3) - cell2location
Loading packages and setting up GPU
首先,我們需要加載相關的包并告訴cell2location使用GPU。 cell2location 是用 pymc3 語言編寫的,用于概率建模,它使用名為 theano 的深度學習庫進行大量計算。 雖然該包適用于 GPU 和 CPU,但使用 GPU 可顯著縮短 10X Visium 數據集的計算時間。 對于空間位置較少的較小數據集(例如 Nanostring WTA 技術),使用 CPU 更可行。
import sys
import scanpy as sc
import anndata
import pandas as pd
import numpy as np
import os
import gc
# this line forces theano to use the GPU and should go before importing cell2location
os.environ["THEANO_FLAGS"] = 'device=cuda0,floatX=float32,force_device=True'
# if using the CPU uncomment this:
#os.environ["THEANO_FLAGS"] = 'device=cpu,floatX=float32,openmp=True,force_device=True'
import cell2location
import matplotlib as mpl
from matplotlib import rcParams
import matplotlib.pyplot as plt
import seaborn as sns
# silence scanpy that prints a lot of warnings
import warnings
warnings.filterwarnings('ignore')
Loading Visium data
首先,需要從數據門戶下載并解壓縮空間數據,以及下載參考細胞類型的注釋文件:
# Set paths to data and results used through the document:
sp_data_folder = './data/mouse_brain_visium_wo_cloupe_data/'
results_folder = './results/mouse_brain_snrna/'
regression_model_output = 'RegressionGeneBackgroundCoverageTorch_65covariates_40532cells_12819genes'
reg_path = f'{results_folder}regression_model/{regression_model_output}/'
# Download and unzip spatial data
if os.path.exists('./data') is not True:
os.mkdir('./data')
os.system('cd ./data && wget https://cell2location.cog.sanger.ac.uk/tutorial/mouse_brain_visium_wo_cloupe_data.zip')
os.system('cd ./data && unzip mouse_brain_visium_wo_cloupe_data.zip')
# Download and unzip snRNA-seq data with signatures of reference cell types
# (if the output folder was not created by tutorial 1/3)
if os.path.exists(reg_path) is not True:
os.mkdir('./results')
os.mkdir(f'{results_folder}')
os.mkdir(f'{results_folder}regression_model')
os.mkdir(f'{reg_path}')
os.system(f'cd {reg_path} && wget https://cell2location.cog.sanger.ac.uk/tutorial/mouse_brain_snrna/regression_model/RegressionGeneBackgroundCoverageTorch_65covariates_40532cells_12819genes/sc.h5ad')
現在,從 10X Space Ranger 輸出中讀取空間 Visium 數據并檢查幾個 QC 圖。 在這里,將 Visium 小鼠大腦實驗(即切片)和相應的組織學圖像加載到單個 anndata 對象 adata 中。
定義函數的用法多學學,別定義的那么難看,??
def read_and_qc(sample_name, path=sp_data_folder + 'rawdata/'):
r""" This function reads the data for one 10X spatial experiment into the anndata object.
It also calculates QC metrics. Modify this function if required by your workflow.
:param sample_name: Name of the sample
:param path: path to data
"""
adata = sc.read_visium(path + str(sample_name),
count_file='filtered_feature_bc_matrix.h5', load_images=True)
adata.obs['sample'] = sample_name
adata.var['SYMBOL'] = adata.var_names
adata.var.rename(columns={'gene_ids': 'ENSEMBL'}, inplace=True)
adata.var_names = adata.var['ENSEMBL']
adata.var.drop(columns='ENSEMBL', inplace=True)
# Calculate QC metrics
from scipy.sparse import csr_matrix
adata.X = adata.X.toarray()
sc.pp.calculate_qc_metrics(adata, inplace=True)
adata.X = csr_matrix(adata.X)
adata.var['mt'] = [gene.startswith('mt-') for gene in adata.var['SYMBOL']]
adata.obs['mt_frac'] = adata[:, adata.var['mt'].tolist()].X.sum(1).A.squeeze()/adata.obs['total_counts']
# add sample name to obs names
adata.obs["sample"] = [str(i) for i in adata.obs['sample']]
adata.obs_names = adata.obs["sample"] \
+ '_' + adata.obs_names
adata.obs.index.name = 'spot_id'
return adata
def select_slide(adata, s, s_col='sample'):
r""" This function selects the data for one slide from the spatial anndata object.
:param adata: Anndata object with multiple spatial experiments
:param s: name of selected experiment
:param s_col: column in adata.obs listing experiment name for each location
"""
slide = adata[adata.obs[s_col].isin([s]), :]
s_keys = list(slide.uns['spatial'].keys())
s_spatial = np.array(s_keys)[[s in k for k in s_keys]][0]
slide.uns['spatial'] = {s_spatial: slide.uns['spatial'][s_spatial]}
return slide
#######################
# Read the list of spatial experiments
sample_data = pd.read_csv(sp_data_folder + 'Visium_mouse.csv')
# Read the data into anndata objects
slides = []
for i in sample_data['sample_name']:
slides.append(read_and_qc(i, path=sp_data_folder + 'rawdata/'))
# Combine anndata objects together
adata = slides[0].concatenate(
slides[1:],
batch_key="sample",
uns_merge="unique",
batch_categories=sample_data['sample_name'],
index_unique=None
)
#######################
注意! 線粒體編碼的基因(基因名稱以前綴 mt- 或 MT- 開頭)與空間映射無關,因為它們的表達代表了單細胞和細胞核數據中的技術產物,而不是線粒體的生物學豐度。 然而,這些基因在每個位置構成了 15-40% 的 mRNA。 因此,為了避免映射偽影,我們強烈建議去除線粒體基因。
# mitochondria-encoded (MT) genes should be removed for spatial mapping
adata.obsm['mt'] = adata[:, adata.var['mt'].values].X.toarray()
adata = adata[:, ~adata.var['mt'].values]###這個方法不錯
Look at QC metrics
現在讓我們看看 QC:Visium 實驗中每個位置的總計數和基因總數。
python的enumerate函數,也是值得學習的
# PLOT QC FOR EACH SAMPLE
fig, axs = plt.subplots(len(slides), 4, figsize=(15, 4*len(slides)-4))
for i, s in enumerate(adata.obs['sample'].unique()):
#fig.suptitle('Covariates for filtering')
slide = select_slide(adata, s)
sns.distplot(slide.obs['total_counts'],
kde=False, ax = axs[i, 0])
axs[i, 0].set_xlim(0, adata.obs['total_counts'].max())
axs[i, 0].set_xlabel(f'total_counts | {s}')
sns.distplot(slide.obs['total_counts']\
[slide.obs['total_counts']<20000],
kde=False, bins=40, ax = axs[i, 1])
axs[i, 1].set_xlim(0, 20000)
axs[i, 1].set_xlabel(f'total_counts | {s}')
sns.distplot(slide.obs['n_genes_by_counts'],
kde=False, bins=60, ax = axs[i, 2])
axs[i, 2].set_xlim(0, adata.obs['n_genes_by_counts'].max())
axs[i, 2].set_xlabel(f'n_genes_by_counts | {s}')
sns.distplot(slide.obs['n_genes_by_counts']\
[slide.obs['n_genes_by_counts']<6000],
kde=False, bins=60, ax = axs[i, 3])
axs[i, 3].set_xlim(0, 6000)
axs[i, 3].set_xlabel(f'n_genes_by_counts | {s}')
plt.tight_layout()
Visualise Visium data in spatial 2D and UMAP coordinates
Visualising data in spatial coordinates with scanpy
Next, we show how to plot these QC values over the histology image using standard scanpy tools
slide = select_slide(adata, 'ST8059048')
with mpl.rc_context({'figure.figsize': [6,7],
'axes.facecolor': 'white'}):
sc.pl.spatial(slide, img_key = "hires", cmap='magma',
library_id=list(slide.uns['spatial'].keys())[0],
color=['total_counts', 'n_genes_by_counts'], size=1,
gene_symbols='SYMBOL', show=False, return_fig=True)
Here we show how to use scanpy to plot the expression of individual genes without the histology image.
with mpl.rc_context({'figure.figsize': [6,7],
'axes.facecolor': 'black'}):
sc.pl.spatial(slide,
color=["Rorb", "Vip"], img_key=None, size=1,
vmin=0, cmap='magma', vmax='p99.0',
gene_symbols='SYMBOL'
)
Add counts matrix as adata.raw
adata_vis = adata.copy()
adata_vis.raw = adata_vis
######## Select two Visium sections to speed up the analysis
選擇兩個 Visium 部分,也稱為實驗/批次,以加快分析速度,每個生物復制一個。
s = ['ST8059048', 'ST8059052']
adata_vis = adata_vis[adata_vis.obs['sample'].isin(s),:]
Construct and examine UMAP of locations
現在,我們將標準的掃描處理pipeline應用于空間 Visium 數據,以顯示實驗數據中的可變性。 重要的是,此工作流程將顯示數據中批次差異的程度.
在這個小鼠大腦數據集中,切片之間只有少數區域應該不同,因為我們使用了來自生物復制品的 2 個樣本,這些樣本在小鼠大腦中沿前后軸的位置略有不同。 我們從兩個實驗和一些不匹配中看到了位置的一般對齊,這里的實驗之間的大部分差異來自批次效應,cell2location 可以解釋這一點。
adata_vis_plt = adata_vis.copy()
# Log-transform (log(data + 1))
sc.pp.log1p(adata_vis_plt)
# Find highly variable genes within each sample
adata_vis_plt.var['highly_variable'] = False
for s in adata_vis_plt.obs['sample'].unique():
adata_vis_plt_1 = adata_vis_plt[adata_vis_plt.obs['sample'].isin([s]), :]
sc.pp.highly_variable_genes(adata_vis_plt_1, min_mean=0.0125, max_mean=5, min_disp=0.5, n_top_genes=1000)
hvg_list = list(adata_vis_plt_1.var_names[adata_vis_plt_1.var['highly_variable']])
adata_vis_plt.var.loc[hvg_list, 'highly_variable'] = True
# Scale the data ( (data - mean) / sd )
sc.pp.scale(adata_vis_plt, max_value=10)
# PCA, KNN construction, UMAP
sc.tl.pca(adata_vis_plt, svd_solver='arpack', n_comps=40, use_highly_variable=True)
sc.pp.neighbors(adata_vis_plt, n_neighbors = 20, n_pcs = 40, metric='cosine')
sc.tl.umap(adata_vis_plt, min_dist = 0.3, spread = 1)
with mpl.rc_context({'figure.figsize': [8, 8],
'axes.facecolor': 'white'}):
sc.pl.umap(adata_vis_plt, color=['sample'], size=30,
color_map = 'RdPu', ncols = 1, #legend_loc='on data',
legend_fontsize=10)
Load reference cell type signature from snRNA-seq data and show UMAP of cells
接下來,我們加載預處理過的 snRNAseq 參考 anndata 對象,其中包含參考細胞類型的估計表達特征
## snRNAseq reference (raw counts)
adata_snrna_raw = sc.read(f'{reg_path}sc.h5ad')
Export reference expression signatures of cell types
# Column name containing cell type annotations
covariate_col_names = 'annotation_1'
# Extract a pd.DataFrame with signatures from anndata object
inf_aver = adata_snrna_raw.raw.var.copy()
inf_aver = inf_aver.loc[:, [f'means_cov_effect_{covariate_col_names}_{i}' for i in adata_snrna_raw.obs[covariate_col_names].unique()]]
from re import sub
inf_aver.columns = [sub(f'means_cov_effect_{covariate_col_names}_{i}', '', i) for i in adata_snrna_raw.obs[covariate_col_names].unique()]
inf_aver = inf_aver.iloc[:, inf_aver.columns.argsort()]
# normalise by average experiment scaling factor (corrects for sequencing depth)
inf_aver = inf_aver * adata_snrna_raw.uns['regression_mod']['post_sample_means']['sample_scaling'].mean()
Quick look at the cell type composition in our reference data in UMAP coordinates
with mpl.rc_context({'figure.figsize': [10, 10],
'axes.facecolor': 'white'}):
sc.pl.umap(adata_snrna_raw, color=['annotation_1'], size=15,
color_map = 'RdPu', ncols = 1, legend_loc='on data',
legend_fontsize=10)
Cell2location model description and analysis pipeline
Cell2location 被實現為一個可解釋的分層貝葉斯模型,從而 (1) 提供了解釋模型不確定性的原則方法; (2) 考慮細胞類型豐度的線性相關性,(3) 對跨技術測量靈敏度的差異進行建模,以及 (4) 通過采用靈活的基于計數的誤差模型來考慮無法解釋的/殘留變化。 最后,(5)cell2location 支持多個實驗/批次的聯合建模。
Brief description of the model
Briefly, cell2location is a Bayesian model, which estimates absolute cell density of cell types by decomposing mRNA counts (d_{s,g}) of each gene (g={1, .., G}) at locations (s={1, .., S}) into a set of predefined reference signatures of cell types (g_{fg}). Joint modelling mode works across experiments (e={1,..,E}), such as 10X Visium chips (i.e. square capture areas) and Slide-Seq V2 pucks (i.e. beads).
Cell2location models the elements of (d_{s,g}) as Negative Binomial distributed, given an unobserved rate (\mu_{s,g}) and a gene-specific over-dispersion parameter (\alpha_{eg}):
[\begin{split}D_{s,g} \sim \mathtt{NB}(\mu_{s,g}, \alpha_{eg}) \\end{split}]
The expression level of genes (\mu_{s,g}) in the mRNA count space is modelled as a linear function of expression signatures of reference cell types:
[\mu_{s,g} = \underbrace{m_{g}}{\text{technology sensitivity}} \cdot \underbrace{\left (\sum{f} {w_{s,f} : g_{f,g}} \right)}{\text{cell type contributions}} + \underbrace{l_s + s{eg}}_{\text{additive shift}}]
where, (w_{s,f}) denotes regression weight of each reference signature (f) at location (s), which can be interpreted as the number of cells at location (s) that express reference signature (f); (m_{g}) a gene-specific scaling parameter, which adjusts for global differences in sensitivity between technologies; (l_s) and (s_{eg}) are additive variables that account for gene- and location-specific shift, such as due to contaminating or free-floating RNA.
To account for the similarity of location patterns across cell types, (w_{s,f}) is modelled using another layer of decomposition (factorization) using (r={1, .., R}) groups of cell types, that can be interpreted as cellular compartments or tissue zones (Suppl. Methods). Unless stated otherwise, (R) is set to 50.
Selecting hyper-parameters
Note! While the scaling parameter (m_{g}) facilitates the integration across technologies, it leads to non-identifiability between (m_{g}) and (w_{s,f}), unless the informative priors on both variables are used. To address this, we employ informative prior distributions on (w_{s,f}) and (m_{g}), which are controlled by 4 used-provided hyper-parameters. For guidance on selecting these hyper-parameters see below and Suppl. Methods (Section 1.3).
For the mouse brain we suggest using the following values for 4 used-provided hyper-parameters: 1. (\hat{N} = 8), the expected number of cells per location, estimated based on comparison to histology image; 2. (\hat{A} = 9), the expected number of cell types per location, assuming that most cells in a given location belong to a different type and that many locations contain cell processes rather than complete cells; 3. (\hat{Y} = 5), the expected number of co-located cell type groups per location, assuming that very few cell types have linearly dependent abundance patterns, except for the regional astrocytes and corresponding neurons such that on average about 2 cell types per group are expected (\hat{A}/\hat{Y}=1.8); 4. mean and variance that define hyperprior on gene-specific scaling parameter (m_{g}), allowing the user to define prior beliefs on the sensitivity of spatial technology compared to the scRNA-seq reference.
Joing modelling of multiple experiments
Joint modelling of spatial data sets from multiple experiments provides the several benefits due to sharing information between experiments (such as 10X Visium chips (i.e. square capture areas) and Slide-Seq V2 pucks (i.e. beads)):
Increasing accuracy by improving the ability of the model to distinguish low sensitivity (m_{g}) from zero cell abundance (w_{r,f}), which is achieved by sharing the change in sensitivity between technologies (m_{g}) across experiments. Similarly to common practice in single cell data analysis, this is equivalent to regressing out the effect of technology but not the effect of individual experiment.
Increasing sensitivity by sharing information on cell types with co-varying abundances during decomposition of (w_{s,f}) into groups of cell types (r={1, .., R}).
Training cell2location: specifying data input and hyper-parameters
在這里,展示了如何訓練 cell2location 模型來估計每個位置的細胞豐度。 此工作流包裝在單個pipeline中:
sc.settings.set_figure_params(dpi = 100, color_map = 'viridis', dpi_save = 100,
vector_friendly = True, format = 'pdf',
facecolor='white')
r = cell2location.run_cell2location(
# Single cell reference signatures as pd.DataFrame
# (could also be data as anndata object for estimating signatures
# as cluster average expression - `sc_data=adata_snrna_raw`)
sc_data=inf_aver,
# Spatial data as anndata object
sp_data=adata_vis,
# the column in sc_data.obs that gives cluster idenitity of each cell
summ_sc_data_args={'cluster_col': "annotation_1",
},
train_args={'use_raw': True, # By default uses raw slots in both of the input datasets.
'n_iter': 40000, # Increase the number of iterations if needed (see QC below)
# Whe analysing the data that contains multiple experiments,
# cell2location automatically enters the mode which pools information across experiments
'sample_name_col': 'sample'}, # Column in sp_data.obs with experiment ID (see above)
export_args={'path': results_folder, # path where to save results
'run_name_suffix': '' # optinal suffix to modify the name the run
},
model_kwargs={ # Prior on the number of cells, cell types and co-located groups
'cell_number_prior': {
# - N - the expected number of cells per location:
'cells_per_spot': 8, # < - change this
# - A - the expected number of cell types per location (use default):
'factors_per_spot': 7,
# - Y - the expected number of co-located cell type groups per location (use default):
'combs_per_spot': 7
},
# Prior beliefs on the sensitivity of spatial technology:
'gene_level_prior':{
# Prior on the mean
'mean': 1/2,
# Prior on standard deviation,
# a good choice of this value should be at least 2 times lower that the mean
'sd': 1/4
}
}
####### Cell2location model output
The results are saved to:
results_folder + r['run_name']
The absolute abundances of cell types are added to sp_data as columns of sp_data.obs. The estimates of all parameters in the model are exported to sp_data.uns['mod']
List of output files:
- sp.h5ad - Anndata object with all results and spatial data.
- W_cell_density.csv - absolute abundances of cell types, mean of the posterior distribution.
- (default) - W_cell_density_q05.csv - absolute abundances of cell types, 5% quantile of the posterior distribution representing confident cell abundance level.
- W_mRNA_count.csv - absolute mRNA abundance for each cell types, mean of the posterior distribution.
- (useful for QC, selecting mapped cell types) - W_mRNA_count_q05.csv - absolute mRNA abundance for each cell types, 5% quantile of the posterior distribution representing confident cell abundance level.
Evaluating training
需要通過檢查一些診斷圖來檢查我們的模型是否訓練成功。
首先,我們看一下訓練迭代中的 ELBO 損失/成本函數。 該圖省略了前 20% 的訓練迭代,在此期間損失發生了許多數量級的變化。 在這里我們看到模型在訓練結束時收斂,ELBO 損失函數中的一些噪聲是可以接受的。 如果在最近的幾千次迭代中有很大的變化,我們建議增加 'n_iter' 參數。 (需要很多的數學知識)
訓練迭代之間 ELBO 損失的差異表明訓練問題可能是由于細胞類型的參考不完整或不夠詳細造成的。
from IPython.display import Image
Image(filename=results_folder +r['run_name']+'/plots/training_history_without_first_20perc.png',
width=400)
We also need to evaluate the reconstruction accuracy: how well reference cell type signatures explain spatial data by comparing expected value of the model (\mu_{s,g}) (Negative Binomial mean) to observed count of each gene across locations. The ideal case is a perfect diagonal 2D histogram plot (across genes and locations).
A very fuzzy diagonal or large deviations of some genes and locations from the diagonal plot indicate that the reference signatures are incomplete. The reference could be missing certain cell types entirely (e.g. FACS-sorting one cell lineage) or clustering could be not sufficiently granular (e.g. mapping 5-10 broad cell types to a complex tissue). Below is an example of good performance:
Image(filename=results_folder +r['run_name']+'/plots/data_vs_posterior_mean.png',
width=400)
最后,需要通過比較兩次獨立訓練重新啟動(X 軸和 Y 軸)之間估計細胞豐度的一致性來評估識別位置的穩健性。 下圖顯示了 2 次訓練重啟中細胞豐度曲線之間的相關性(顏色)。 某些細胞類型可能是相關的,但與對角線的過度偏差將表明解決方案的不穩定性。
Image(filename=results_folder +r['run_name']+'/plots/evaluate_stability.png',
width=400)
第三部分,單細胞空間聯合
Loading cell2location model output
首先,讓我們加載 cell2location 結果。 在 cell2location 管道的導出步驟中,將跨位置的細胞類型豐度作為 sp_data.obs 的列添加到 sp_data 中,并將模型的所有參數導出到 sp_data.uns['mod']。 這個 anndata 對象和一個帶有spot位置的 csv 文件 W.csv / W_q05.csv 被保存到結果目錄中。
Normally, you would have the output on your system (e.g. by running tutorial 2/3), however, you could also start with the output from our data portal:
results_folder = './results/mouse_brain_snrna/'
r = {'run_name': 'LocationModelLinearDependentWMultiExperiment_2experiments_59clusters_5563locations_12809genes'}
# defining useful function
def select_slide(adata, s, s_col='sample'):
r""" Select data for one slide from the spatial anndata object.
:param adata: Anndata object with multiple spatial samples
:param s: name of selected sample
:param s_col: column in adata.obs listing sample name for each location
"""
slide = adata[adata.obs[s_col].isin([s]), :]
s_keys = list(slide.uns['spatial'].keys())
s_spatial = np.array(s_keys)[[s in k for k in s_keys]][0]
slide.uns['spatial'] = {s_spatial: slide.uns['spatial'][s_spatial]}
return slide
if os.path.exists(f'{results_folder}{r["run_name"]}') is not True:
os.mkdir('./results')
os.mkdir(f'{results_folder}')
os.system(f'cd {results_folder} && wget https://cell2location.cog.sanger.ac.uk/tutorial/mouse_brain_visium_results/{["run_name"]}.zip')
os.system(f'cd {results_folder} && unzip {r["run_name"]}.zip')
We load the results of the model saved into the adata_vis Anndata object:
sp_data_file = results_folder +r['run_name']+'/sp.h5ad'
adata_vis = anndata.read(sp_data_file)
Visualisation of cell type locations
首先,我們學習如何使用標準掃描繪圖工具 sc.pl.spatial 和我們的自定義工具可視化細胞類型位置,該工具使用顏色插值 cell2location.plt.mapping_video.plot_spatial 在一個圖中可視化幾種細胞類型。
Cell2location 估計參考細胞類型的絕對細胞和 mRNA 豐度。 對于這兩種測量,后驗分布的 5% 分位數用于顯示結果,代表細胞豐度和 mRNA 計數的置信水平。
For completeness, for each visium section, sc.pl.spatial was used to produce 4 figure panels showing the locations of all cell types (cell and mRNA abundance, 5% and the mean of the posterior distribution), saved to r['run_name']/plots/spatial/.
在這里,使用絕對細胞密度(5% 分位數)在一張圖中可視化多種細胞類型的位置,代表這一點的模型參數稱為 q05_spot_factors。 顯示了映射到小鼠大腦 6 個不同區域的 6 種神經元和神經膠質細胞類型。
from cell2location.plt.mapping_video import plot_spatial
# select up to 6 clusters
sel_clust = ['Oligo_2', 'Inh_Meis2_3', 'Inh_4', 'Ext_Thal_1', 'Ext_L23', 'Ext_L56']
sel_clust_col = ['q05_spot_factors' + str(i) for i in sel_clust]
slide = select_slide(adata_vis, 'ST8059048')
with mpl.rc_context({'figure.figsize': (15, 15)}):
fig = plot_spatial(slide.obs[sel_clust_col], labels=sel_clust,
coords=slide.obsm['spatial'] \
* list(slide.uns['spatial'].values())[0]['scalefactors']['tissue_hires_scalef'],
show_img=True, img_alpha=0.8,
style='fast', # fast or dark_background
img=list(slide.uns['spatial'].values())[0]['images']['hires'],
circle_diameter=6, colorbar_position='right')
We can produce this visualisation in dark background by setting style='dark_background' and hiding the image img_alpha=0.
with mpl.rc_context({'figure.figsize': (15, 15)}):
fig = plot_spatial(slide.obs[sel_clust_col], labels=sel_clust,
coords=slide.obsm['spatial'] \
* list(slide.uns['spatial'].values())[0]['scalefactors']['tissue_hires_scalef'],
show_img=True, img_alpha=0,
style='dark_background', # fast or dark_background
img=list(slide.uns['spatial'].values())[0]['images']['hires'],
circle_diameter=6, colorbar_position='right')
現在,我們將細胞豐度估計(上圖)與每種細胞類型的估計 mRNA 豐度進行比較。 這對于識別哪些細胞類型沒有映射到特定組織通常很有用(mRNA 計數 < 50 - 注意顏色條上的最大值),代表這一點的模型參數稱為 q05_nUMI_factors。
# select up to 6 clusters
sel_clust = ['Oligo_2', 'Inh_Meis2_3', 'Inh_4', 'Ext_Thal_1', 'Ext_L23', 'Ext_L56']
sel_clust_col = ['q05_nUMI_factors' + str(i) for i in sel_clust]
slide = select_slide(adata_vis, 'ST8059048')
with mpl.rc_context({'figure.figsize': (15, 15)}):
fig = plot_spatial(slide.obs[sel_clust_col], labels=sel_clust,
coords=slide.obsm['spatial'] \
* list(slide.uns['spatial'].values())[0]['scalefactors']['tissue_hires_scalef'],
show_img=True, img_alpha=0.8, max_color_quantile=0.98,
img=list(slide.uns['spatial'].values())[0]['images']['hires'],
circle_diameter=6, colorbar_position='right')
#sel_clust = ['Oligo_2', 'Inh_Meis2_3', 'Inh_4', 'Ext_Thal_1', 'Ext_L23', 'Ext_L56']
#sel_clust_col = ['q05_spot_factors' + str(i) for i in sel_clust]
# select one section correctly subsetting histology image data
slide = select_slide(adata_vis, 'ST8059048')
# plot with nice names
with mpl.rc_context({'figure.figsize': (10, 10), "font.size": 18}):
# add slide.obs with nice names
slide.obs[sel_clust] = (slide.obs[sel_clust_col])
sc.pl.spatial(slide, cmap='magma',
color=sel_clust[0:6], # limit size in this notebook
ncols=3,
size=0.8, img_key='hires',
alpha_img=0.9,
vmin=0, vmax='p98'
)
Next, we show how to use the standard scanpy pipeline to plot cell locations over histology images (for more extensive information refer to scanpy):
sel_clust = ['Oligo_2', 'Inh_Meis2_3', 'Inh_4', 'Ext_Thal_1', 'Ext_L23', 'Ext_L56']
sel_clust_col = ['q05_spot_factors' + str(i) for i in sel_clust]
# select one section correctly subsetting histology image data
slide = select_slide(adata_vis, 'ST8059048')
# plot with nice names
with mpl.rc_context({'figure.figsize': (10, 10), "font.size": 18}):
# add slide.obs with nice names
slide.obs[sel_clust] = (slide.obs[sel_clust_col])
sc.pl.spatial(slide, cmap='magma',
color=sel_clust[0:6], # limit size in this notebook
ncols=3,
size=0.8, img_key='hires',
alpha_img=0.9,
vmin=0, vmax='p99.2'
)
Identifying tissue regions by clustering
We identify tissue regions that differ in their cell composition by clustering locations using cell abundance estimated by cell2location.
通過使用每種細胞類型的估計細胞豐度對 Visium 點進行聚類來找到組織區域。 我們構建了一個 K-nearest neigbour (KNN) 圖,表示估計細胞豐度中位置的相似性,然后應用 Leiden 聚類。 KNN 鄰居的數量應適應數據集的大小和解剖學定義區域的大小(即海馬區域相當小,因此可能被大型 n_neighbors 掩蓋)。 這可以針對范圍 KNN 鄰居和 Leiden 聚類分辨率完成,直到獲得與組織解剖結構匹配的聚類。
The clustering is done jointly across all Visium sections / batches, hence the region identities are directly comparable. When there are strong technical effects between multiple batches (not the case here) sc.external.pp.bbknn can be used to account for those effects during the KNN construction.
The resulting clusters are saved in adata_vis.obs['region_cluster']
.
sample_type = 'q05_nUMI_factors'
col_ind = [sample_type in i for i in adata_vis.obs.columns.tolist()]
adata_vis.obsm[sample_type] = adata_vis.obs.loc[:,col_ind].values
# compute KNN using the cell2location output
sc.pp.neighbors(adata_vis, use_rep=sample_type,
n_neighbors = 20)
# Cluster spots into regions using scanpy
sc.tl.leiden(adata_vis, resolution=1)
# add region as categorical variable
adata_vis.obs["region_cluster"] = adata_vis.obs["leiden"]
adata_vis.obs["region_cluster"] = adata_vis.obs["region_cluster"].astype("category")
Visualise the regions in UMAP based on cell abundances and in 2D
在這里,我們使用相同的 KNN 圖表示在細胞豐度方面的相似性位置,以執行所有位置的 UMAP 投影。 我們可以看到 cell2location 成功地整合了 2 個部分。 您可以看到皮層中具有類似位置的區域(下方的 2D)由來自兩個樣本的點組成(例如區域cluster 14、16、0 - 皮質層 L4、L5 和 L6)。
sc.tl.umap(adata_vis, min_dist = 0.3, spread = 1)
with mpl.rc_context({'figure.figsize': (8, 8)}):
sc.pl.umap(adata_vis, color=['sample', 'region_cluster'], size=30,
color_map = 'RdPu', ncols = 2, legend_loc='on data',
legend_fontsize=10)
# Plot the region identity of each location in 2D space
# Plotting UMAP of integrated datasets before 2D plots of separate sections ensures
# consistency of colour scheme via `adata_vis.uns['region_cluster_colors']`.
with mpl.rc_context({'figure.figsize': (5, 6)}):
sc.pl.spatial(select_slide(adata_vis, 'ST8059048'),
color=["region_cluster"], img_key=None
);
sc.pl.spatial(select_slide(adata_vis, 'ST8059052'),
color=["region_cluster"], img_key=None
)
######## Export regions for import to 10X Loupe Browser
我們的區域圖可以在組織學圖像上可視化,并使用 10X 放大鏡瀏覽器進行交互探索(請參閱 10X 網站了解說明)。
# save maps for each sample separately
sam = np.array(adata_vis.obs['sample'])
for i in np.unique(sam):
s1 = adata_vis.obs[['region_cluster']]
s1 = s1.loc[sam == i]
s1.index = [x[10:] for x in s1.index]
s1.index.name = 'Barcode'
s1.to_csv(results_folder +r['run_name']+'/region_cluster29_' + i + '.csv')
Identify groups of co-located cell types using matrix factorisation(識別共定位的細胞類型)
在這里,我們使用估計的細胞豐度作為非負矩陣分解的輸入來識別共同定位的細胞類型 (R) 的組,這可以解釋為細胞區室或組織區域。直觀地,我們假設細胞相互作用可以驅動細胞類型豐度的線性依賴性,此外,我們觀察到具有高度空間交錯的細胞比對的組織,如人類淋巴結,用 NMF 比用離散cluster蛋白更好地描述。
Tip If you want to find a few most disctinct cellular compartments, use a small number of factors. If you want to find very strong co-location signal and assume that most cell types don’t co-locate, use a lot of factors (> 30 - used here). In practice, it is better to train NMF for a range of factors (R={5, .., 30}) and select (R) as a balance between capturing fine tissue zones and splitting known compartments
# number of cell type combinations - educated guess assuming that most cell types don't co-locate
n_fact = int(30)
# extract cell abundance from cell2location
X_data = adata_vis.uns['mod']['post_sample_q05']['spot_factors']
import cell2location.models as c2l
# create model class
mod_sk = c2l.CoLocatedGroupsSklearnNMF(n_fact, X_data,
n_iter = 10000,
verbose = True,
var_names=adata_vis.uns['mod']['fact_names'],
obs_names=adata_vis.obs_names,
fact_names=['fact_' + str(i) for i in range(n_fact)],
sample_id=adata_vis.obs['sample'],
init='random', random_state=0,
nmf_kwd_args={'tol':0.0001})
# train 5 times to evaluate stability
mod_sk.fit(n=5, n_type='restart')
現在,讓我們檢查一些診斷圖。 首先,您可以看到大多數細胞類型組合在此模型的訓練重新啟動之間是一致的(具有高相關性的對角線)。 使用第一次重新啟動(y 軸),因此我們可以注意到因子 21、23、25(基于 0)不是很穩健。
## Do some diagnostics
# evaluate stability by comparing trainin restarts
with mpl.rc_context({'figure.figsize': (10, 8)}):
mod_sk.evaluate_stability('cell_type_factors', align=True)
接下來,我們評估 NMF 細胞類型組在解釋單個細胞類型的豐富程度方面的準確性。 您應該會看到一個對角線 2D 直方圖,其中比較了輸入細胞密度數據(X 軸)和模型的估算值(Y 軸)。 在這里,我們對低豐度細胞類型進行了一些小偏差。
# evaluate accuracy of the model
mod_sk.compute_expected()
mod_sk.plot_posterior_mu_vs_data()
Finally, let’s investigate the composition of each NMF cell type group. We use our model to compute the relative contribution of NMF groups to each cell type ('cell_type_fractions' e.g. 45% of cell abundance of Astro_THAL_hab can be explained by fact_10). Note: factors are exchangeable so while you find consistent factors, each model training restart will output those factors in a different order.
Here we export these parameters from the model into adata_vis.uns['mod_sklearn']
in the spatial anndata object, and print the cell types most specific to each NMF group:
# extract parameters into DataFrames
mod_sk.sample2df(node_name='nUMI_factors', ct_node_name = 'cell_type_factors')
# export results to scanpy object
adata_vis = mod_sk.annotate_adata(adata_vis) # as columns to .obs
adata_vis = mod_sk.export2adata(adata_vis, slot_name='mod_sklearn') # as a slot in .uns
# print the fraction of cells of each type located to each combination
mod_sk.print_gene_loadings(loadings_attr='cell_type_fractions',
gene_fact_name='cell_type_fractions')
# make nice names
from re import sub
mod_sk.cell_type_fractions.columns = [sub('mean_cell_type_factors', '', i)
for i in mod_sk.cell_type_fractions.columns]
# plot co-occuring cell type combinations
mod_sk.plot_gene_loadings(mod_sk.var_names_read, mod_sk.var_names_read,
fact_filt=mod_sk.fact_filt,
loadings_attr='cell_type_fractions',
gene_fact_name='cell_type_fractions',
cmap='RdPu', figsize=[10, 15])
Finally, we need to examine the abundance of each cell type group across locations:
# plot cell density in each combination
with mpl.rc_context({'figure.figsize': (5, 6), 'axes.facecolor': 'black'}):
# select one section correctly subsetting histology image data
slide = select_slide(adata_vis, 'ST8059048')
sc.pl.spatial(slide,
cmap='magma',
color=mod_sk.location_factors_df.columns,
ncols=6,
size=1, img_key='hires',
alpha_img=0,
vmin=0, vmax='p99.2'
)
Now we save the NMF model object to work with later (rememeber, every time you train the model, factors with the same composition will have a different order):
# save co-location models object
def pickle_model(mod, path, file_suffix=''):
file = path + 'model_' + str(mod.__class__.__name__) + '_' + str(mod.n_fact) + '_' + file_suffix + ".p"
pickle.dump({'mod': mod, 'fact_names': mod.fact_names}, file = open(file, "wb"))
print(file)
pickle_model(mod_sk, results_folder +r['run_name'] + '/', file_suffix='')
To aid this analysis, we wrapped the analysis shown above into a pipeline that automates training the NMF model with varying number of factors (including export of the same plots and data as shown above).
from cell2location import run_colocation
res_dict, adata_vis = run_colocation(
adata_vis, model_name='CoLocatedGroupsSklearnNMF',
verbose=False, return_all=True,
train_args={
'n_fact': np.arange(10, 40), # IMPORTANT: range of number of factors (10-40 here)
'n_iter': 20000, # maximum number of training iterations
'sample_name_col': 'sample', # columns in adata_vis.obs that identifies sample
'mode': 'normal',
'n_type': 'restart', 'n_restarts': 5 # number of training restarts
},
model_kwargs={'init': 'random', 'random_state': 0, 'nmf_kwd_args': {'tol': 0.00001}},
posterior_args={},
export_args={'path': results_folder + 'std_model/'+r['run_name']+'/CoLocatedComb/',
'run_name_suffix': ''})
生活很好,有你更好