2020-04-13
參考文獻(xiàn):A Cellular Taxonomy of the Bone Marrow Stroma in Homeostasis and Leukemia
目標(biāo):利用文章中描述的工具和方法,復(fù)現(xiàn)文中主要的scRNA-seq分析圖表
前言
最有效的學(xué)習(xí)方法就是在實(shí)踐中學(xué)習(xí),今天開始要嘗試復(fù)現(xiàn)的文章是A Cellular Taxonomy of the Bone Marrow Stroma in Homeostasis and Leukemia,該文章于2019年發(fā)表在Cell 雜志上。文章的分析代碼并未開源,因此只能根據(jù)原文描述來盡量復(fù)現(xiàn)其結(jié)果。
摘要
Stroma is a poorly defined non-parenchymal component of virtually every organ with key roles in organ development, homeostasis, and repair. Studies of the bone marrow stroma have defined individual populations in the stem cell niche regulating hematopoietic regeneration and capable of initiating leukemia. Here, we use single-cell RNA sequencing (scRNAseq) to define a cellular taxonomy of the mouse bone marrow stroma and its perturbation by malignancy. We identified seventeen stromal subsets expressing distinct hematopoietic regulatory genes spanning new fibroblastic and osteoblastic subpopulations including distinct osteoblast differentiation trajectories. Emerging acute myeloid leukemia impaired mesenchymal osteogenic differentiation and reduced regulatory molecules necessary for normal hematopoiesis. These data suggest that tissue stroma responds to malignant cells by disadvantaging normal parenchymal cells. Our taxonomy of the stromal compartment provides a comprehensive bone marrow cell census and experimental support for cancer cell crosstalk with specific stromal elements to impair normal tissue function and thereby enable emergent cancer.
文章思路
- 利用單細(xì)胞測序分析小鼠正常骨髓基質(zhì),鑒定了17個細(xì)胞亞群及其基因表達(dá)特征,以及在穩(wěn)定造血狀態(tài)下表達(dá)關(guān)鍵龕位(niche)因子的基質(zhì)細(xì)胞。
- 進(jìn)一步推斷細(xì)胞亞群的分化關(guān)系。
- 描繪了急性髓細(xì)胞白血?。ˋcute myeloid leukemia , AML)對骨髓微環(huán)境的整體影響及相應(yīng)的細(xì)胞和分子異常。
scRNA-seq
- 建庫方式:Chromium Single Cell 30 v2 Reagent Kit (10x Genomics)
- 上游分析:Cellranger toolkit (version 2.0.1, 10X Genomics)
- 表達(dá)矩陣下載:GSE128423
- 下游分析:Seurat v2.3.4、destiny等
這么多樣本,重新跑Cellranger流程比較花時間,我們直接從下載表達(dá)矩陣開始。按照Cellranger輸出格式來組織文件,每個數(shù)據(jù)集放在單獨(dú)的文件夾,內(nèi)含matrix.mtx.gz
、barcodes.tsv.gz
和genes.tsv.gz
(CellRanger 3.0以上版本改為features.tsv.gz
)三個文件。這里我手動改成了3.0的形式,以便于保持.gz
壓縮格式讀入Seurat
lyc@lyc-VirtualBox:~/lyc-1995/BM_Mouse/data$ tree
.
├── b1
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── b2
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── b3
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── b4
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── bm1
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── bm2
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── bm3
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── bm4
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── ctrl_10May
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── ctrl_16May
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── ctrl_26May
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── ctrl_7Jun
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── ctrl_8May
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── MLL_10May
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── MLL_26May
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── MLL_31May
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── MLL_8May
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── std1
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── std2
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── std3
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── std4
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── std5
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
└── std6
├── barcodes.tsv.gz
├── features.tsv.gz
└── matrix.mtx.gz
安裝Seurat
Seurat目前已經(jīng)更新到3.1.4版,但基本流程和2.3.4并沒有太大的差別。詳見:單細(xì)胞Seurat包升級,換湯不換藥
我們直接安裝最新版Seurat:
> install.packages('Seurat')
> library(Seurat)
分析穩(wěn)態(tài)下的小鼠骨髓(n = 6)
主要目的是得到Figure S1中的兩張細(xì)胞分群tSNE圖:
標(biāo)注的std7和std8可能是筆誤吧……
讀取表達(dá)矩陣
> getwd()
[1] "/home/lyc/lyc-1995/BM_Mouse"
批量讀取表達(dá)矩陣,并在細(xì)胞barcode前添加樣本標(biāo)簽:
sample.id <- paste0('std', 1:6)
raw.data <- lapply(sample.id, function(x) {
counts <- Read10X(data.dir = file.path('data', x))
colnames(counts) <- paste0(x, '_', colnames(counts))
return(counts)
})
names(raw.data) <- sample.id
數(shù)據(jù)質(zhì)控
原文對數(shù)據(jù)質(zhì)控的描述:
Pre-processing of scRNA-seq data
ScRNA-Seq data were demultiplexed, aligned to the mouse genome, version mm10, and UMI-collapsed with the Cellranger toolkit (version 2.0.1, 10X Genomics). We excluded cells with fewer than 500 detected genes (where each gene had to have at least one UMI aligned). Gene expression was represented as the fraction of its UMI count with respect to total UMI in the cell and then multiplied by 10,000. We denoted it by TP10K – transcripts per 10K transcripts.
Filtering hematopoietic clusters and doublets
Based on cluster annotations with characteristic genes, we removed hematopoietic clusters from further analysis. It is further expected that a small fraction of data should consist of cell doublets (and to an even lesser extent of higher order multiplets) due to co-encapsulation into droplets and/or as occasional pairs of cells that were not dissociated in sample preparation. Therefore, when we found small clusters of cells expressing both hematopoietic and stromal markers we removed them from further analysis (original cluster 14). A small number of additional clusters and subclusters was marked by genes differentially expressed in at least two larger stromal clusters and were annotated as doublets if their average number of expressed genes was higher than the averages for corresponding suspected singlet cluster sources and/or they were not characterized by specific differentially expressed genes (original clusters 18 and 19). All marked doublets were removed from the discussion.
按照原文描述,先過濾掉檢測基因數(shù)少于500的細(xì)胞,再根據(jù)下游聚類分析移除一些推定的doublets。我們先據(jù)此構(gòu)建Seurat對象:
seu <- lapply(raw.data, CreateSeuratObject, min.features = 500)
檢查了第一步質(zhì)控,留下36,000+細(xì)胞,和作者原文給出的30,543數(shù)量差得有點(diǎn)多啊。由于沒有源代碼,只能猜測是作者省略了一些方法學(xué)上的描述。我們先按照常規(guī)流程走一遍看看吧:
seu.merge <- merge(seu[[1]], seu[2:length(seu)])
> dim(seu.merge)
[1] 27998 36181
進(jìn)一步檢查UMI數(shù)、基因數(shù)和線粒體相關(guān)UMI比例:
seu.merge[["percent.mt"]] <- PercentageFeatureSet(seu.merge, pattern = "^mt-")
VlnPlot(seu.merge, features = c('nCount_RNA', 'nFeature_RNA', 'percent.mt'))
線粒體基因高比例的細(xì)胞還是不少的,通常代表低質(zhì)量細(xì)胞。而且也存在UMI數(shù)和基因數(shù)異常高的細(xì)胞,可能提示潛在的doublets。
再檢查一下UMI數(shù)、基因數(shù)和線粒體基因比例的相關(guān)性:
plot1 <- FeatureScatter(seu.merge, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot2 <- FeatureScatter(seu.merge, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot1 + plot2
根據(jù)以上質(zhì)控指標(biāo),設(shè)置進(jìn)一步過濾:
seu.clean <- subset(seu.merge, nCount_RNA < 50000 & nFeature_RNA < 6000 & percent.mt < 7.5)
rm(seu, seu.merge);gc(reset = TRUE)
> dim(seu.clean)
[1] 27998 34062
呃……還是多出來很多細(xì)胞。文章對這部分質(zhì)控沒有更多描述了,我們先暫且擱置這個問題。
Normalization
用的是最經(jīng)典的LogNormalize
,這個過程簡單來說就是先將每個細(xì)胞內(nèi)的每個基因的原始UMI counts除以每個細(xì)胞的總UMI counts,再乘以一個常數(shù)(默認(rèn)為10,000)進(jìn)行縮放,最后用log1p
函數(shù)進(jìn)行對數(shù)轉(zhuǎn)換,以達(dá)到拉齊細(xì)胞間測序深度的目的。
> seu.clean <- NormalizeData(seu.clean)
Performing log-normalization
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Feature selection and Dimensionality reduction
Dimensionality reduction
We performed dimensionality reduction using gene expression data for a subset of variable genes. The variable genes were selected based on dispersion of binned variance to mean expression ratios using FindVariableGenes function of Seurat package (Satija et al., 2015) followed by filtering of cell-cycle, ribosomal protein, and mitochondrial genes. Next, we performed principal component analysis (PCA) and reduced the data to the top 50 PCA components (number of components was chosen based on standard deviations of the principal components – in a plateau region of an ‘‘elbow plot’’).
利用Seurat 2.3.4 版的FindVariableGenes
函數(shù),基于每個基因的平均表達(dá)量和離散程度,選擇高可變基因進(jìn)行下游的降維和聚類分析。在Seurat 3 中對應(yīng)FindVariableFeatures
函數(shù),將selection.method
參數(shù)改為'mvp'
來對應(yīng) 2.3.4 版中的方法。
原文沒有給出具體參數(shù),這里使用默認(rèn)參數(shù):
> seu.clean <- FindVariableFeatures(seu.clean, selection.method = 'mvp')
Calculating gene means
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating gene variance to mean ratios
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
> length(VariableFeatures(seu.clean))
[1] 1127
top10 <- head(VariableFeatures(seu.clean), 10)
plot1 <- VariableFeaturePlot(seu.clean)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot2
篩選到了1,127個高可變基因。作者同時從其中剔除了線粒體基因、核糖體基因和細(xì)胞周期相關(guān)的基因,但是并沒有指出挑選基因的規(guī)則,原文附件也沒有給定基因集,這里我們同樣只能按照一般做法來嘗試:
mt.genes <- grep(pattern = '^mt-', rownames(seu.clean), value = TRUE)
ribo.genes <- grep(pattern = '^(Rpl[0-9]|Rps[0-9])', rownames(seu.clean), value = TRUE)
# 細(xì)胞周期基因
install.packages('org.Mm.eg.db')
library(org.Mm.eg.db)
cellcycle <- select(org.Mm.eg.db, keys = "GO:0007049", columns = "SYMBOL", keytype = "GOALL")
cellcycle <- unique(cellcycle$SYMBOL)
rm.genes <- c(mt.genes, ribo.genes, cellcycle)
VariableFeatures(seu.clean) <- setdiff(VariableFeatures(seu.clean), rm.genes)
> length(VariableFeatures(seu.clean))
[1] 1043
主成分分析(PCA),將1,000多個高可變基因通過線性變換投射到低維空間,默認(rèn)計(jì)算50個主成分。PCA之前需要將表達(dá)矩陣中心化(centered)和標(biāo)準(zhǔn)化(standardized),使得每個基因均值為0、方差為1:
> seu.clean <- ScaleData(seu.clean)
Centering and scaling data matrix
|============================================================================================================================================| 100%
> seu.clean <- RunPCA(seu.clean)
PC_ 1
Positive: Col1a2, Cst3, Mgp, Igfbp5, Abi3bp, Comp, Dcn, Cxcl14, Clu, Serping1
1500015O10Rik, Fam46a, Htra1, Fmod, Spp1, Ibsp, Mt2, Fibin, Chad, Pam
Gsn, Scara3, Itm2a, Olfml3, Col1a1, Col3a1, Ogn, Meg3, Pth1r, Col11a1
Negative: Fabp4, Cldn5, Cdh5, Lrg1, Tfpi, Kdr, Stab2, Gpihbp1, Mmrn2, Emcn
Fam167b, Esam, Ctla2a, Gpm6a, Stab1, Cd93, Apold1, Flt1, Gm1673, Cd36
Kcnj8, Ptprb, Mrc1, Flt4, Cyp4b1, Pecam1, Ecscr, Dnase1l3, Ushbp1, Abcc9
PC_ 2
Positive: Tmem176b, Cxcl12, Vcam1, Gas6, Lpl, Gdpd2, Fbln5, Esm1, Nrp1, Kitl
Adipoq, Serping1, Hp, Dpep1, Pappa, Rarres2, Cxcl14, Epas1, Cdh11, Cyp1b1
Ebf3, 1500009L16Rik, Sfrp4, Tnc, Lepr, Angptl4, Trf, Arrdc4, Agt, Chrdl1
Negative: Chchd10, Rac2, Vpreb3, Cd79a, Cd79b, Ptprcap, Coro1a, Cd37, Mzb1, Lrmp
Pafah1b3, Blnk, Cnp, Arl5c, Rhoh, Laptm5, Cd72, Pou2af1, Gmfg, Cd53
Siglecg, Atp1b1, Fcrla, Xrcc6, Dusp2, Cytip, Tifa, Spib, Dnajc7, Bcl7a
PC_ 3
Positive: Comp, Chad, Fmod, Pcolce2, 1500015O10Rik, Col11a1, Meg3, Anxa8, Ndufa4l2, Cilp2
Mfge8, Dcn, Hapln1, Acan, Scrg1, Cilp, Fibin, Ucma, 3110079O15Rik, Mgp
Col2a1, Igfbp6, Nbl1, Prg4, Tnfrsf11b, Dhx58os, Crispld1, Col11a2, S100a4, Tppp3
Negative: Ebf1, Vpreb3, Cd79a, Cd79b, Zeb2, Hp, Tifa, Ptprcap, Rac2, Gdpd2
Coro1a, Chchd10, Esm1, Adipoq, Cxcl12, Mzb1, Dpep1, Cd37, Blnk, Lpl
Lrmp, Pappa, Arl5c, Cd72, Kitl, Pou2af1, Atp1b1, Siglecg, Cyp1b1, Rhoh
PC_ 4
Positive: Igfbp6, Nbl1, Crip1, Col3a1, Tppp3, Dcn, S100a4, Col1a1, Abi3bp, Cilp2
Mustn1, Cdh13, Ly6c1, Cilp, Tnxb, Ebf1, Angptl7, Lgals1, Col1a2, Ly6a
Vpreb3, Thbs4, Anxa8, Medag, Cd79a, Slurp1, Cd79b, Htra1, Clec3b, Cav1
Negative: Alox5ap, Lyz2, Slpi, Pglyrp1, Rgs18, Plek, Bin2, Wfdc21, Tmem40, Tyrobp
Hcst, Lcn2, S100a8, Prkar2b, Ppbp, Fcer1g, Ncf1, S100a9, Mcemp1, Ifitm6
Gp1bb, Nfe2, Ngp, Pf4, Gp9, Chil3, Camp, Fermt3, Clec1b, Ly6c2
PC_ 5
Positive: Col9a2, Col9a1, Col9a3, Mia, 3110079O15Rik, Matn3, Fxyd2, Col11a2, Lect1, Col27a1
Hapln1, Col2a1, Acan, Scrg1, Ucma, Epyc, Col11a1, Prkg2, Il17b, Pth1r
Ppa1, Panx3, Serpina1a, Dhx58os, C1qtnf3, Serpina1d, Bhlhe41, Calml3, Pla2g5, Tnni2
Negative: Igfbp6, S100a4, Alox5ap, Tppp3, Lyz2, Nbl1, Slpi, Col3a1, Rgs18, Plek
Bin2, Pglyrp1, Tmem40, Ppbp, Col1a1, Mustn1, Gp9, Tnxb, Stx11, Dcn
Wfdc21, Hcst, Clec1b, Tyrobp, Itga2b, Rgs10, Fcer1g, Abi3bp, Pf4, Fermt3
查看碎石圖:
ElbowPlot(seu.clean, ndims = 50)
文章沒說選擇多少個PC來進(jìn)行下游分析,根據(jù)碎石圖拐點(diǎn),我們大致選擇25個PC左右,進(jìn)行tSNE降維:
seu.clean <- RunTSNE(seu.clean, dims = 1:25)
原文的tSNE圖顯示數(shù)據(jù)集之間幾乎沒有批次效應(yīng),我們來檢查一下:
DimPlot(seu.clean, reduction = 'tsne')
DimPlot(seu.clean, reduction = 'tsne', split.by = 'orig.ident', ncol = 3)
可以看到6個數(shù)據(jù)集之間確實(shí)沒有明顯的批次效應(yīng)。
無監(jiān)督聚類
原文對于無監(jiān)督聚類策略的描述:
Clustering and sub-clustering
We used graph-based clustering of the PCA reduced data with the Louvain Method (Blondel et al., 2008) after computing a shared nearest neighbor graph (Satija et al., 2015). We visualized the clusters on a 2D map produced with t-distributed stochastic neighbor embedding (t-SNE) (van der Maaten and Hinton, 2008). For sub-clustering, we applied the same procedure of finding variable genes, dimensionality reduction, and clustering to the restricted set of data (usually restricted to one initial cluster).
作者在PCA降維后的數(shù)據(jù)空間內(nèi)計(jì)算 shared nearest neighbor graph(SNN graph),再利用Louvain算法基于圖聚類(graph-based clustering)獲得細(xì)胞分群。作者針對初始聚類結(jié)果的子集重復(fù)上述的“特征選擇-降維-聚類”流程進(jìn)行亞聚類(sub-clustering)分析。原文沒有給出具體的分辨率(resolution)參數(shù),但是根據(jù)Figure S1C可以看出似乎是先得到了10個初始分群,亞聚類分析后再得出33個亞群。
Seurat 3把原Seurat 2的FindClusters
函數(shù)拆分成FindNeighbors
和FindClusters
,分別用于計(jì)算SNN graph和聚類,其實(shí)并沒有太大的變化。
和tSNE降維時一樣,我們使用1-25個主成分構(gòu)建SNN graph,FindClusters
的默認(rèn)參數(shù)是algorithm = 1
(Louvain),分辨率是resolution = 0.8
:
> seu.clean <- FindNeighbors(seu.clean, dims = 1:25)
Computing nearest neighbor graph
Computing SNN
> seu.clean <- FindClusters(seu.clean)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 34062
Number of edges: 1270143
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.9290
Number of communities: 33
Elapsed time: 8 seconds
哎喲,直接得到了和原文FigureS1C一樣的亞群數(shù)(33個)?可視化檢查一下:
DimPlot(seu.clean, reduction = 'tsne', label = TRUE, repel = TRUE) + NoLegend()
當(dāng)然也不能排除作者是得到了33個聚類后根據(jù)已知marker基因在FigureS1中做了手動標(biāo)注
過濾造血干細(xì)胞和doublets
由于作者關(guān)注的是骨髓基質(zhì)細(xì)胞,因此對造血干祖細(xì)胞相關(guān)的亞群進(jìn)行了過濾,并將同時表達(dá)兩種或以上細(xì)胞類型marker的聚類判斷為doublets加以去除:
Filtering hematopoietic clusters and doublets
Based on cluster annotations with characteristic genes, we removed hematopoietic clusters from further analysis. It is further expected that a small fraction of data should consist of cell doublets (and to an even lesser extent of higher order multiplets) due to co-encapsulation into droplets and/or as occasional pairs of cells that were not dissociated in sample preparation. Therefore, when we found small clusters of cells expressing both hematopoietic and stromal markers we removed them from further analysis (original cluster 14). A small number of additional clusters and subclusters was marked by genes differentially expressed in at least two larger stromal clusters and were annotated as doublets if their average number of expressed genes was higher than the averages for corresponding suspected singlet cluster sources and/or they were not characterized by specific differentially expressed genes (original clusters 18 and 19). All marked doublets were removed from the discussion.
1. 基于差異表達(dá)分析得到各亞群marker基因
Differential expression of gene signatures
For each cluster, we used the Wilcoxon Rank-Sum Test to find genes that had significantly different RNA-seq TP10K expression when compared to the remaining clusters (paired tests when indicated) (after multiple hypothesis testing correction). As a support measure for ranking differentially expressed genes we also used the area under receiver operating characteristic (ROC) curve.
作者基于LogNormalize
后的表達(dá)值,使用了Wilcoxon Rank-Sum Test的統(tǒng)計(jì)檢驗(yàn)方式進(jìn)行差異表達(dá)。Seurat中通過FindAllMarkers
函數(shù)實(shí)現(xiàn),默認(rèn)參數(shù)test.use = "wilcox"
。
利用future
包的plan
函數(shù)開啟并行運(yùn)算(根據(jù)自己的系統(tǒng)配置合理選擇核心數(shù)),詳見 Parallelization in Seurat with future。
最后不要忘記調(diào)回單核運(yùn)算,并釋放內(nèi)存。
library(future)
plan('multiprocess', workers = 8)
all.markers <- FindAllMarkers(seu.clean)
plan('sequential');gc(reset = TRUE)
即使是在并行運(yùn)算下,差異表達(dá)也花了超過了1小時。接下來根據(jù)p值對差異基因進(jìn)行過濾(p_val
或p_val_adj
都行)。盡管文中沒有明說是否做了這一步,但單細(xì)胞測序數(shù)據(jù)本身就具有高噪聲的特點(diǎn),根據(jù)統(tǒng)計(jì)學(xué)顯著性篩選基因可以減少假陽性結(jié)果:
all.markers <- subset(all.markers, p_val_adj < 0.05)
作者同時還使用了接收者操作特征(receiver operating characteristic curve, ROC)曲線的檢驗(yàn)方法作為輔助。我們令test.use = "roc"
:
all.markers.roc <- FindAllMarkers(seu.clean, test.use = 'roc')
需要注意的是ROC方法不支持future
的并行計(jì)算,這真是要算到地老天荒了……
關(guān)于ROC方法,Seurat幫助文檔里面已經(jīng)說得比較清楚了:
"roc" : Identifies 'markers' of gene expression using ROC analysis. For each gene, evaluates (using AUC) a classifier built on that gene alone, to classify between two groups of cells. An AUC value of 1 means that expression values for this gene alone can perfectly classify the two groupings (i.e. Each of the cells in cells.1 exhibit a higher level than each of the cells in cells.2). An AUC value of 0 also means there is perfect classification, but in the other direction. A value of 0.5 implies that the gene has no predictive power to classify the two groups. Returns a 'predictive power' (
abs(AUC-0.5) * 2
) ranked matrix of putative differentially expressed genes.
ROC曲線可以用來評估給定二分類模型的優(yōu)劣。簡單地說,就是對于每一個給定的基因,評估其能否較準(zhǔn)確地識別出目的細(xì)胞亞群。ROC的曲線下面積(AUC)可以反映分類模型的正確率,大于或小于0.5都是有意義的(小于0.5時說明基于該基因的分類模型總是給出錯誤預(yù)測,此時我們?nèi)∠喾唇Y(jié)果即可得到正確預(yù)測,也就意味著該基因可能在目的亞群中表達(dá)下調(diào)),而AUC在0.5附近時說明基于該基因的分類模型近似于隨機(jī)分類,該基因可能在兩群細(xì)胞之間沒有顯著差異表達(dá)。Seurat根據(jù)每個基因的AUC計(jì)算了分類power代替p值。
2. 基于已知的亞群marker基因
作者同時還比較了已知的不同細(xì)胞亞群marker基因的表達(dá)分布(Figure S1D):
我們知道Seurat可以在降維圖中展示基因表達(dá)特征:
FeaturePlot(seu.clean, features = c('Cd79a', 'Cd79b'))
同樣可以利用
AddModuleScore
來評估一個基因集的表達(dá)情況。我們根據(jù)Figure S1D選擇基因集:
cd <- list(
C1 = c('Cd79a', 'Cd79b'),
C2 = c('Gypa', 'Hbb-bt', 'Rhag', 'Rhd', 'Tfrc'),
C3 = c('Cd52', 'Cd177', 'Plaur', 'Clec4a2'),
C4 = c('Cd52', 'Selplg', 'Ms4a6c', 'Cd53'),
C5 = c('Gp9', 'Itga2b', 'Cd9', 'Gp1bb'),
C6 = c('Ms4a3', 'Clec12a', 'Fcgr3')
)
seu.clean <- AddModuleScore(
object = seu.clean,
features = cd,
ctrl = 5,
name = 'Known_markers'
)
FeaturePlot(seu.clean, features = paste0('Known_markers', 1:6), ncol = 3)
基本和原文吻合。
我們先把重要的數(shù)據(jù)保存下來:
save(seu.clean, all.markers, all.markers.roc, file = file.path('output', 'tmp1.RData'))