scRNA-seq文章復(fù)現(xiàn)：A Cellular Taxonomy of the Bone Marrow Stroma in Homeostasis and Leukemia （1）

2020-04-13
參考文獻(xiàn)：A Cellular Taxonomy of the Bone Marrow Stroma in Homeostasis and Leukemia
目標(biāo)：利用文章中描述的工具和方法，復(fù)現(xiàn)文中主要的scRNA-seq分析圖表

前言

最有效的學(xué)習(xí)方法就是在實(shí)踐中學(xué)習(xí)，今天開始要嘗試復(fù)現(xiàn)的文章是A Cellular Taxonomy of the Bone Marrow Stroma in Homeostasis and Leukemia，該文章于2019年發(fā)表在Cell 雜志上。文章的分析代碼并未開源，因此只能根據(jù)原文描述來盡量復(fù)現(xiàn)其結(jié)果。

摘要

Stroma is a poorly defined non-parenchymal component of virtually every organ with key roles in organ development, homeostasis, and repair. Studies of the bone marrow stroma have defined individual populations in the stem cell niche regulating hematopoietic regeneration and capable of initiating leukemia. Here, we use single-cell RNA sequencing (scRNAseq) to define a cellular taxonomy of the mouse bone marrow stroma and its perturbation by malignancy. We identified seventeen stromal subsets expressing distinct hematopoietic regulatory genes spanning new fibroblastic and osteoblastic subpopulations including distinct osteoblast differentiation trajectories. Emerging acute myeloid leukemia impaired mesenchymal osteogenic differentiation and reduced regulatory molecules necessary for normal hematopoiesis. These data suggest that tissue stroma responds to malignant cells by disadvantaging normal parenchymal cells. Our taxonomy of the stromal compartment provides a comprehensive bone marrow cell census and experimental support for cancer cell crosstalk with specific stromal elements to impair normal tissue function and thereby enable emergent cancer.

文章思路

利用單細(xì)胞測序分析小鼠正常骨髓基質(zhì)，鑒定了17個細(xì)胞亞群及其基因表達(dá)特征，以及在穩(wěn)定造血狀態(tài)下表達(dá)關(guān)鍵龕位（niche）因子的基質(zhì)細(xì)胞。
進(jìn)一步推斷細(xì)胞亞群的分化關(guān)系。
描繪了急性髓細(xì)胞白血?。ˋcute myeloid leukemia , AML）對骨髓微環(huán)境的整體影響及相應(yīng)的細(xì)胞和分子異常。

scRNA-seq

建庫方式：Chromium Single Cell 30 v2 Reagent Kit (10x Genomics)
上游分析：Cellranger toolkit (version 2.0.1, 10X Genomics)
表達(dá)矩陣下載：GSE128423
下游分析：Seurat v2.3.4、destiny等

這么多樣本，重新跑Cellranger流程比較花時間，我們直接從下載表達(dá)矩陣開始。按照Cellranger輸出格式來組織文件，每個數(shù)據(jù)集放在單獨(dú)的文件夾，內(nèi)含matrix.mtx.gz、barcodes.tsv.gz和genes.tsv.gz（CellRanger 3.0以上版本改為features.tsv.gz）三個文件。這里我手動改成了3.0的形式，以便于保持.gz壓縮格式讀入Seurat

lyc@lyc-VirtualBox:~/lyc-1995/BM_Mouse/data$ tree
.
├── b1
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── b2
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── b3
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── b4
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── bm1
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── bm2
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── bm3
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── bm4
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── ctrl_10May
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── ctrl_16May
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── ctrl_26May
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── ctrl_7Jun
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── ctrl_8May
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── MLL_10May
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── MLL_26May
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── MLL_31May
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── MLL_8May
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── std1
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── std2
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── std3
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── std4
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── std5
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
└── std6
    ├── barcodes.tsv.gz
    ├── features.tsv.gz
    └── matrix.mtx.gz

安裝Seurat

Seurat目前已經(jīng)更新到3.1.4版，但基本流程和2.3.4并沒有太大的差別。詳見：單細(xì)胞Seurat包升級，換湯不換藥
我們直接安裝最新版Seurat：

> install.packages('Seurat')
> library(Seurat)

分析穩(wěn)態(tài)下的小鼠骨髓（n = 6）

主要目的是得到Figure S1中的兩張細(xì)胞分群tSNE圖：

標(biāo)注的std7和std8可能是筆誤吧……

讀取表達(dá)矩陣

> getwd()
[1] "/home/lyc/lyc-1995/BM_Mouse"

批量讀取表達(dá)矩陣，并在細(xì)胞barcode前添加樣本標(biāo)簽：

sample.id <- paste0('std', 1:6)
raw.data <- lapply(sample.id, function(x) {
  counts <- Read10X(data.dir = file.path('data', x))
  colnames(counts) <- paste0(x, '_', colnames(counts))
  return(counts)
})
names(raw.data) <- sample.id

數(shù)據(jù)質(zhì)控

原文對數(shù)據(jù)質(zhì)控的描述：

Pre-processing of scRNA-seq data
ScRNA-Seq data were demultiplexed, aligned to the mouse genome, version mm10, and UMI-collapsed with the Cellranger toolkit (version 2.0.1, 10X Genomics). We excluded cells with fewer than 500 detected genes (where each gene had to have at least one UMI aligned). Gene expression was represented as the fraction of its UMI count with respect to total UMI in the cell and then multiplied by 10,000. We denoted it by TP10K – transcripts per 10K transcripts.

Filtering hematopoietic clusters and doublets
Based on cluster annotations with characteristic genes, we removed hematopoietic clusters from further analysis. It is further expected that a small fraction of data should consist of cell doublets (and to an even lesser extent of higher order multiplets) due to co-encapsulation into droplets and/or as occasional pairs of cells that were not dissociated in sample preparation. Therefore, when we found small clusters of cells expressing both hematopoietic and stromal markers we removed them from further analysis (original cluster 14). A small number of additional clusters and subclusters was marked by genes differentially expressed in at least two larger stromal clusters and were annotated as doublets if their average number of expressed genes was higher than the averages for corresponding suspected singlet cluster sources and/or they were not characterized by specific differentially expressed genes (original clusters 18 and 19). All marked doublets were removed from the discussion.

按照原文描述，先過濾掉檢測基因數(shù)少于500的細(xì)胞，再根據(jù)下游聚類分析移除一些推定的doublets。我們先據(jù)此構(gòu)建Seurat對象：

seu <- lapply(raw.data, CreateSeuratObject, min.features = 500)

檢查了第一步質(zhì)控，留下36,000+細(xì)胞，和作者原文給出的30,543數(shù)量差得有點(diǎn)多啊。由于沒有源代碼，只能猜測是作者省略了一些方法學(xué)上的描述。我們先按照常規(guī)流程走一遍看看吧：

seu.merge <- merge(seu[[1]], seu[2:length(seu)])
> dim(seu.merge)
[1] 27998 36181

進(jìn)一步檢查UMI數(shù)、基因數(shù)和線粒體相關(guān)UMI比例：

seu.merge[["percent.mt"]] <- PercentageFeatureSet(seu.merge, pattern = "^mt-")
VlnPlot(seu.merge, features = c('nCount_RNA', 'nFeature_RNA', 'percent.mt'))

線粒體基因高比例的細(xì)胞還是不少的，通常代表低質(zhì)量細(xì)胞。而且也存在UMI數(shù)和基因數(shù)異常高的細(xì)胞，可能提示潛在的doublets。
再檢查一下UMI數(shù)、基因數(shù)和線粒體基因比例的相關(guān)性：

plot1 <- FeatureScatter(seu.merge, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot2 <- FeatureScatter(seu.merge, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot1 + plot2

根據(jù)以上質(zhì)控指標(biāo)，設(shè)置進(jìn)一步過濾：

seu.clean <- subset(seu.merge, nCount_RNA < 50000 & nFeature_RNA < 6000 & percent.mt < 7.5)
rm(seu, seu.merge);gc(reset = TRUE)
> dim(seu.clean)
[1] 27998 34062

呃……還是多出來很多細(xì)胞。文章對這部分質(zhì)控沒有更多描述了，我們先暫且擱置這個問題。

Normalization

用的是最經(jīng)典的LogNormalize，這個過程簡單來說就是先將每個細(xì)胞內(nèi)的每個基因的原始UMI counts除以每個細(xì)胞的總UMI counts，再乘以一個常數(shù)（默認(rèn)為10,000）進(jìn)行縮放，最后用log1p函數(shù)進(jìn)行對數(shù)轉(zhuǎn)換，以達(dá)到拉齊細(xì)胞間測序深度的目的。

> seu.clean <- NormalizeData(seu.clean)
Performing log-normalization
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|

Feature selection and Dimensionality reduction

Dimensionality reduction
We performed dimensionality reduction using gene expression data for a subset of variable genes. The variable genes were selected based on dispersion of binned variance to mean expression ratios using FindVariableGenes function of Seurat package (Satija et al., 2015) followed by filtering of cell-cycle, ribosomal protein, and mitochondrial genes. Next, we performed principal component analysis (PCA) and reduced the data to the top 50 PCA components (number of components was chosen based on standard deviations of the principal components – in a plateau region of an ‘‘elbow plot’’).

利用Seurat 2.3.4 版的FindVariableGenes函數(shù)，基于每個基因的平均表達(dá)量和離散程度，選擇高可變基因進(jìn)行下游的降維和聚類分析。在Seurat 3 中對應(yīng)FindVariableFeatures函數(shù)，將selection.method參數(shù)改為'mvp'來對應(yīng) 2.3.4 版中的方法。
原文沒有給出具體參數(shù)，這里使用默認(rèn)參數(shù)：

> seu.clean <- FindVariableFeatures(seu.clean, selection.method = 'mvp')
Calculating gene means
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating gene variance to mean ratios
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
> length(VariableFeatures(seu.clean))
[1] 1127

top10 <- head(VariableFeatures(seu.clean), 10)

plot1 <- VariableFeaturePlot(seu.clean)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot2

篩選到了1,127個高可變基因。作者同時從其中剔除了線粒體基因、核糖體基因和細(xì)胞周期相關(guān)的基因，但是并沒有指出挑選基因的規(guī)則，原文附件也沒有給定基因集，這里我們同樣只能按照一般做法來嘗試：

mt.genes <- grep(pattern = '^mt-', rownames(seu.clean), value = TRUE)
ribo.genes <- grep(pattern = '^(Rpl[0-9]|Rps[0-9])', rownames(seu.clean), value = TRUE)
# 細(xì)胞周期基因
install.packages('org.Mm.eg.db')
library(org.Mm.eg.db)
cellcycle <- select(org.Mm.eg.db, keys = "GO:0007049", columns = "SYMBOL", keytype = "GOALL")
cellcycle <- unique(cellcycle$SYMBOL)

rm.genes <- c(mt.genes, ribo.genes, cellcycle)
VariableFeatures(seu.clean) <- setdiff(VariableFeatures(seu.clean), rm.genes)
> length(VariableFeatures(seu.clean))
[1] 1043

主成分分析（PCA），將1,000多個高可變基因通過線性變換投射到低維空間，默認(rèn)計(jì)算50個主成分。PCA之前需要將表達(dá)矩陣中心化（centered）和標(biāo)準(zhǔn)化（standardized），使得每個基因均值為0、方差為1：

> seu.clean <- ScaleData(seu.clean)
Centering and scaling data matrix
  |============================================================================================================================================| 100%
> seu.clean <- RunPCA(seu.clean)
PC_ 1 
Positive:  Col1a2, Cst3, Mgp, Igfbp5, Abi3bp, Comp, Dcn, Cxcl14, Clu, Serping1 
       1500015O10Rik, Fam46a, Htra1, Fmod, Spp1, Ibsp, Mt2, Fibin, Chad, Pam 
       Gsn, Scara3, Itm2a, Olfml3, Col1a1, Col3a1, Ogn, Meg3, Pth1r, Col11a1 
Negative:  Fabp4, Cldn5, Cdh5, Lrg1, Tfpi, Kdr, Stab2, Gpihbp1, Mmrn2, Emcn 
       Fam167b, Esam, Ctla2a, Gpm6a, Stab1, Cd93, Apold1, Flt1, Gm1673, Cd36 
       Kcnj8, Ptprb, Mrc1, Flt4, Cyp4b1, Pecam1, Ecscr, Dnase1l3, Ushbp1, Abcc9 
PC_ 2 
Positive:  Tmem176b, Cxcl12, Vcam1, Gas6, Lpl, Gdpd2, Fbln5, Esm1, Nrp1, Kitl 
       Adipoq, Serping1, Hp, Dpep1, Pappa, Rarres2, Cxcl14, Epas1, Cdh11, Cyp1b1 
       Ebf3, 1500009L16Rik, Sfrp4, Tnc, Lepr, Angptl4, Trf, Arrdc4, Agt, Chrdl1 
Negative:  Chchd10, Rac2, Vpreb3, Cd79a, Cd79b, Ptprcap, Coro1a, Cd37, Mzb1, Lrmp 
       Pafah1b3, Blnk, Cnp, Arl5c, Rhoh, Laptm5, Cd72, Pou2af1, Gmfg, Cd53 
       Siglecg, Atp1b1, Fcrla, Xrcc6, Dusp2, Cytip, Tifa, Spib, Dnajc7, Bcl7a 
PC_ 3 
Positive:  Comp, Chad, Fmod, Pcolce2, 1500015O10Rik, Col11a1, Meg3, Anxa8, Ndufa4l2, Cilp2 
       Mfge8, Dcn, Hapln1, Acan, Scrg1, Cilp, Fibin, Ucma, 3110079O15Rik, Mgp 
       Col2a1, Igfbp6, Nbl1, Prg4, Tnfrsf11b, Dhx58os, Crispld1, Col11a2, S100a4, Tppp3 
Negative:  Ebf1, Vpreb3, Cd79a, Cd79b, Zeb2, Hp, Tifa, Ptprcap, Rac2, Gdpd2 
       Coro1a, Chchd10, Esm1, Adipoq, Cxcl12, Mzb1, Dpep1, Cd37, Blnk, Lpl 
       Lrmp, Pappa, Arl5c, Cd72, Kitl, Pou2af1, Atp1b1, Siglecg, Cyp1b1, Rhoh 
PC_ 4 
Positive:  Igfbp6, Nbl1, Crip1, Col3a1, Tppp3, Dcn, S100a4, Col1a1, Abi3bp, Cilp2 
       Mustn1, Cdh13, Ly6c1, Cilp, Tnxb, Ebf1, Angptl7, Lgals1, Col1a2, Ly6a 
       Vpreb3, Thbs4, Anxa8, Medag, Cd79a, Slurp1, Cd79b, Htra1, Clec3b, Cav1 
Negative:  Alox5ap, Lyz2, Slpi, Pglyrp1, Rgs18, Plek, Bin2, Wfdc21, Tmem40, Tyrobp 
       Hcst, Lcn2, S100a8, Prkar2b, Ppbp, Fcer1g, Ncf1, S100a9, Mcemp1, Ifitm6 
       Gp1bb, Nfe2, Ngp, Pf4, Gp9, Chil3, Camp, Fermt3, Clec1b, Ly6c2 
PC_ 5 
Positive:  Col9a2, Col9a1, Col9a3, Mia, 3110079O15Rik, Matn3, Fxyd2, Col11a2, Lect1, Col27a1 
       Hapln1, Col2a1, Acan, Scrg1, Ucma, Epyc, Col11a1, Prkg2, Il17b, Pth1r 
       Ppa1, Panx3, Serpina1a, Dhx58os, C1qtnf3, Serpina1d, Bhlhe41, Calml3, Pla2g5, Tnni2 
Negative:  Igfbp6, S100a4, Alox5ap, Tppp3, Lyz2, Nbl1, Slpi, Col3a1, Rgs18, Plek 
       Bin2, Pglyrp1, Tmem40, Ppbp, Col1a1, Mustn1, Gp9, Tnxb, Stx11, Dcn 
       Wfdc21, Hcst, Clec1b, Tyrobp, Itga2b, Rgs10, Fcer1g, Abi3bp, Pf4, Fermt3

查看碎石圖：

ElbowPlot(seu.clean, ndims = 50)

文章沒說選擇多少個PC來進(jìn)行下游分析，根據(jù)碎石圖拐點(diǎn)，我們大致選擇25個PC左右，進(jìn)行tSNE降維：

seu.clean <- RunTSNE(seu.clean, dims = 1:25)

原文的tSNE圖顯示數(shù)據(jù)集之間幾乎沒有批次效應(yīng)，我們來檢查一下：

DimPlot(seu.clean, reduction = 'tsne')

DimPlot(seu.clean, reduction = 'tsne', split.by = 'orig.ident', ncol = 3)

可以看到6個數(shù)據(jù)集之間確實(shí)沒有明顯的批次效應(yīng)。

無監(jiān)督聚類

原文對于無監(jiān)督聚類策略的描述：

Clustering and sub-clustering
We used graph-based clustering of the PCA reduced data with the Louvain Method (Blondel et al., 2008) after computing a shared nearest neighbor graph (Satija et al., 2015). We visualized the clusters on a 2D map produced with t-distributed stochastic neighbor embedding (t-SNE) (van der Maaten and Hinton, 2008). For sub-clustering, we applied the same procedure of finding variable genes, dimensionality reduction, and clustering to the restricted set of data (usually restricted to one initial cluster).

作者在PCA降維后的數(shù)據(jù)空間內(nèi)計(jì)算 shared nearest neighbor graph（SNN graph），再利用Louvain算法基于圖聚類（graph-based clustering）獲得細(xì)胞分群。作者針對初始聚類結(jié)果的子集重復(fù)上述的“特征選擇-降維-聚類”流程進(jìn)行亞聚類（sub-clustering）分析。原文沒有給出具體的分辨率（resolution）參數(shù)，但是根據(jù)Figure S1C可以看出似乎是先得到了10個初始分群，亞聚類分析后再得出33個亞群。
Seurat 3把原Seurat 2的FindClusters函數(shù)拆分成FindNeighbors和FindClusters，分別用于計(jì)算SNN graph和聚類，其實(shí)并沒有太大的變化。
和tSNE降維時一樣，我們使用1-25個主成分構(gòu)建SNN graph，FindClusters的默認(rèn)參數(shù)是algorithm = 1（Louvain），分辨率是resolution = 0.8：

> seu.clean <- FindNeighbors(seu.clean, dims = 1:25)
Computing nearest neighbor graph
Computing SNN
> seu.clean <- FindClusters(seu.clean)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 34062
Number of edges: 1270143

Running Louvain algorithm...
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.9290
Number of communities: 33
Elapsed time: 8 seconds

哎喲，直接得到了和原文FigureS1C一樣的亞群數(shù)（33個）？可視化檢查一下：

DimPlot(seu.clean, reduction = 'tsne', label = TRUE, repel = TRUE) + NoLegend()

當(dāng)然也不能排除作者是得到了33個聚類后根據(jù)已知marker基因在FigureS1中做了手動標(biāo)注

過濾造血干細(xì)胞和doublets

由于作者關(guān)注的是骨髓基質(zhì)細(xì)胞，因此對造血干祖細(xì)胞相關(guān)的亞群進(jìn)行了過濾，并將同時表達(dá)兩種或以上細(xì)胞類型marker的聚類判斷為doublets加以去除：

Filtering hematopoietic clusters and doublets
Based on cluster annotations with characteristic genes, we removed hematopoietic clusters from further analysis. It is further expected that a small fraction of data should consist of cell doublets (and to an even lesser extent of higher order multiplets) due to co-encapsulation into droplets and/or as occasional pairs of cells that were not dissociated in sample preparation. Therefore, when we found small clusters of cells expressing both hematopoietic and stromal markers we removed them from further analysis (original cluster 14). A small number of additional clusters and subclusters was marked by genes differentially expressed in at least two larger stromal clusters and were annotated as doublets if their average number of expressed genes was higher than the averages for corresponding suspected singlet cluster sources and/or they were not characterized by specific differentially expressed genes (original clusters 18 and 19). All marked doublets were removed from the discussion.

1. 基于差異表達(dá)分析得到各亞群marker基因

Differential expression of gene signatures
For each cluster, we used the Wilcoxon Rank-Sum Test to find genes that had significantly different RNA-seq TP10K expression when compared to the remaining clusters (paired tests when indicated) (after multiple hypothesis testing correction). As a support measure for ranking differentially expressed genes we also used the area under receiver operating characteristic (ROC) curve.

作者基于LogNormalize后的表達(dá)值，使用了Wilcoxon Rank-Sum Test的統(tǒng)計(jì)檢驗(yàn)方式進(jìn)行差異表達(dá)。Seurat中通過FindAllMarkers函數(shù)實(shí)現(xiàn)，默認(rèn)參數(shù)test.use = "wilcox"。
利用future包的plan函數(shù)開啟并行運(yùn)算（根據(jù)自己的系統(tǒng)配置合理選擇核心數(shù)），詳見 Parallelization in Seurat with future。
最后不要忘記調(diào)回單核運(yùn)算，并釋放內(nèi)存。

library(future)
plan('multiprocess', workers = 8)
all.markers <- FindAllMarkers(seu.clean)
plan('sequential');gc(reset = TRUE)

即使是在并行運(yùn)算下，差異表達(dá)也花了超過了1小時。接下來根據(jù)p值對差異基因進(jìn)行過濾（p_val 或p_val_adj都行）。盡管文中沒有明說是否做了這一步，但單細(xì)胞測序數(shù)據(jù)本身就具有高噪聲的特點(diǎn)，根據(jù)統(tǒng)計(jì)學(xué)顯著性篩選基因可以減少假陽性結(jié)果：

all.markers <- subset(all.markers, p_val_adj < 0.05)

作者同時還使用了接收者操作特征（receiver operating characteristic curve, ROC）曲線的檢驗(yàn)方法作為輔助。我們令test.use = "roc"：

all.markers.roc <- FindAllMarkers(seu.clean, test.use = 'roc')

需要注意的是ROC方法不支持future的并行計(jì)算，這真是要算到地老天荒了……
關(guān)于ROC方法，Seurat幫助文檔里面已經(jīng)說得比較清楚了：

"roc" : Identifies 'markers' of gene expression using ROC analysis. For each gene, evaluates (using AUC) a classifier built on that gene alone, to classify between two groups of cells. An AUC value of 1 means that expression values for this gene alone can perfectly classify the two groupings (i.e. Each of the cells in cells.1 exhibit a higher level than each of the cells in cells.2). An AUC value of 0 also means there is perfect classification, but in the other direction. A value of 0.5 implies that the gene has no predictive power to classify the two groups. Returns a 'predictive power' (abs(AUC-0.5) * 2) ranked matrix of putative differentially expressed genes.

ROC曲線可以用來評估給定二分類模型的優(yōu)劣。簡單地說，就是對于每一個給定的基因，評估其能否較準(zhǔn)確地識別出目的細(xì)胞亞群。ROC的曲線下面積（AUC）可以反映分類模型的正確率，大于或小于0.5都是有意義的（小于0.5時說明基于該基因的分類模型總是給出錯誤預(yù)測，此時我們?nèi)∠喾唇Y(jié)果即可得到正確預(yù)測，也就意味著該基因可能在目的亞群中表達(dá)下調(diào)），而AUC在0.5附近時說明基于該基因的分類模型近似于隨機(jī)分類，該基因可能在兩群細(xì)胞之間沒有顯著差異表達(dá)。Seurat根據(jù)每個基因的AUC計(jì)算了分類power代替p值。

2. 基于已知的亞群marker基因

作者同時還比較了已知的不同細(xì)胞亞群marker基因的表達(dá)分布（Figure S1D）：

我們知道Seurat可以在降維圖中展示基因表達(dá)特征：

FeaturePlot(seu.clean, features = c('Cd79a', 'Cd79b'))

同樣可以利用AddModuleScore來評估一個基因集的表達(dá)情況。我們根據(jù)Figure S1D選擇基因集：

cd <- list(
  C1 = c('Cd79a', 'Cd79b'),
  C2 = c('Gypa', 'Hbb-bt', 'Rhag', 'Rhd', 'Tfrc'),
  C3 = c('Cd52', 'Cd177', 'Plaur', 'Clec4a2'),
  C4 = c('Cd52', 'Selplg', 'Ms4a6c', 'Cd53'),
  C5 = c('Gp9', 'Itga2b', 'Cd9', 'Gp1bb'),
  C6 = c('Ms4a3', 'Clec12a', 'Fcgr3')
)
seu.clean <- AddModuleScore(
  object = seu.clean,
  features = cd,
  ctrl = 5,
  name = 'Known_markers'
)
FeaturePlot(seu.clean, features = paste0('Known_markers', 1:6), ncol = 3)

基本和原文吻合。

我們先把重要的數(shù)據(jù)保存下來：

save(seu.clean, all.markers, all.markers.roc, file = file.path('output', 'tmp1.RData'))

最后編輯于：2020.04.30 13:35:50

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
平臺聲明：文章內(nèi)容（如有圖片或視頻亦包括在內(nèi)）由作者上傳并發(fā)布，文章內(nèi)容僅代表作者本人觀點(diǎn)，簡書系信息發(fā)布平臺，僅提供信息存儲服務(wù)。

禁止轉(zhuǎn)載，如需轉(zhuǎn)載請通過簡信或評論聯(lián)系作者。

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌，老刑警劉巖，帶你破解...
沈念sama閱讀 228,835評論 6贊 534
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異，居然都是意外死亡，警方通過查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 98,676評論 3贊 419
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人，你說我怎么就攤上這事?！?“怎么了？”我有些...
開封第一講書人閱讀 176,730評論 0贊 380
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長。經(jīng)常有香客問我，道長，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 63,118評論 1贊 314
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘。我一直安慰自己，他們只是感情好，可當(dāng)我...
茶點(diǎn)故事閱讀 71,873評論 6贊 410
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著，像睡著了一般。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 55,266評論 1贊 324
城市分裂傳說
那天，我揣著相機(jī)與錄音，去河邊找鬼。笑死，一個胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播，決...
沈念sama閱讀 43,330評論 3贊 443
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 42,482評論 0贊 289
萬榮殺人案實(shí)錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 49,036評論 1贊 335
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 40,846評論 3贊 356
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時候發(fā)現(xiàn)自己被綠了。大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點(diǎn)故事閱讀 43,025評論 1贊 371
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情，我是刑警寧澤，帶...
沈念sama閱讀 38,575評論 5贊 362
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站，受9級特大地震影響，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 44,279評論 3贊 347
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧，春花似錦、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 34,684評論 0贊 26
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 35,953評論 1贊 289
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留，地道東北人。一個月前我還...
沈念sama閱讀 51,751評論 3贊 394
代替公主和親
正文我出身青樓，卻偏偏與公主長得像，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 48,016評論 2贊 375

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

scRNA-seq文章復(fù)現(xiàn)：A Cellular Taxonomy of the Bone Marrow Stroma in Homeostasis and Leukemia （1）

scRNA-seq文章復(fù)現(xiàn)：A Cellular Taxonomy of the Bone Marrow Stroma in Homeostasis and Leukemia （1）

前言

摘要

文章思路

scRNA-seq

安裝Seurat

分析穩(wěn)態(tài)下的小鼠骨髓（n = 6）

讀取表達(dá)矩陣

數(shù)據(jù)質(zhì)控

Normalization

Feature selection and Dimensionality reduction

無監(jiān)督聚類

過濾造血干細(xì)胞和doublets

1. 基于差異表達(dá)分析得到各亞群marker基因

2. 基于已知的亞群marker基因

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

scRNA-seq文章復(fù)現(xiàn)：A Cellular Taxonomy of the Bone Marrow Stroma in Homeostasis and Leukemia （1）

前言

摘要

文章思路

scRNA-seq

安裝Seurat

分析穩(wěn)態(tài)下的小鼠骨髓（n = 6）

讀取表達(dá)矩陣

數(shù)據(jù)質(zhì)控

Normalization

Feature selection and Dimensionality reduction

無監(jiān)督聚類

過濾造血干細(xì)胞和doublets

1. 基于差異表達(dá)分析得到各亞群marker基因

2. 基于已知的亞群marker基因

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频