隨著越來越多的scRNA-seq數據集可用,對它們進行比較是關鍵。主要的應用程序是比較不同實驗室收集的具有相似生物學來源的數據集,以確保注釋和分析是一致的。此外,隨著大量的參考數據集,如人類細胞圖譜(HCA)的出現,一個重要的應用將是將來自新樣本(如來自疾病組織)的細胞投射到參考數據集上,以表征組成的差異,或檢測新的細胞類型。
scmap是一種將細胞從scRNA-seq實驗投射到不同實驗中識別的細胞類型或細胞的方法。bioRxiv.
scmap建立在Bioconductor的singlecellexper對象之上。請閱讀如何從你自己的數據創建一個SingleCellExperiment。在這里,我們將展示一個關于如何做到這一點的小例子,但請注意,它不是一個全面的指南。
如果你已經有一個SingleCellExperiment對象,那么繼續下一章。
如果您有一個表達矩陣,那么您首先需要創建一個包含您的數據的singlecellexper對象。為了便于說明,我們將使用scmap提供的示例表達式矩陣。數據集(yan)表示來自人類胚胎的90個細胞的FPKM基因表達。作者(Yan等人)在原始出版物(ann數據框架)中定義了所有細胞的發育階段。我們稍后將在投影中使用這些階段。
library(SingleCellExperiment)
library(scmap)
head(ann)
## cell_type1
## Oocyte..1.RPKM. zygote
## Oocyte..2.RPKM. zygote
## Oocyte..3.RPKM. zygote
## Zygote..1.RPKM. zygote
## Zygote..2.RPKM. zygote
## Zygote..3.RPKM. zygote
yan[1:3, 1:3]
## Oocyte..1.RPKM. Oocyte..2.RPKM. Oocyte..3.RPKM.
## C9orf152 0.0 0.0 0.0
## RPS11 1219.9 1021.1 931.6
## ELMO2 7.0 12.2 9.3
Note that the cell type information has to be stored in the cell_type1 column of the rowData slot of the SingleCellExperiment object.
sce <- SingleCellExperiment(assays = list(normcounts = as.matrix(yan)), colData = ann)
logcounts(sce) <- log2(normcounts(sce) + 1)
# use gene names as feature symbols
rowData(sce)$feature_symbol <- rownames(sce)
isSpike(sce, "ERCC") <- grepl("^ERCC-", rownames(sce))
# remove features with duplicated names
sce <- sce[!duplicated(rownames(sce)), ]
sce
## class: SingleCellExperiment
## dim: 20214 90
## metadata(0):
## assays(2): normcounts logcounts
## rownames(20214): C9orf152 RPS11 ... CTSC AQP7
## rowData names(1): feature_symbol
## colnames(90): Oocyte..1.RPKM. Oocyte..2.RPKM. ...
## Late.blastocyst..3..Cell.7.RPKM. Late.blastocyst..3..Cell.8.RPKM.
## colData names(1): cell_type1
## reducedDimNames(0):
## spikeNames(1): ERCC
Feature selection
一旦我們有了一個單獨的實驗對象,我們就可以運行scmap了。首先,我們需要從我們的輸入數據集中選擇信息最豐富的特征(基因):
sce <- selectFeatures(sce, suppress_plot = FALSE)
## Warning in linearModel(object, n_features): Your object does not contain
## counts() slot. Dropouts were calculated using logcounts() slot...
用紅色突出顯示的特征將用于進一步的分析(投影)。
特性存儲在輸入對象的rowData槽的scmap_features列中。默認scmap選擇500個功能(也可以通過設置n_features參數來控制):
table(rowData(sce)$scmap_features)
##
## FALSE TRUE
## 19714 500
scmap-cluster
參考數據集的scmap-cluster索引是通過查找每個集群的中間基因表達來創建的。默認情況下,scmap使用引用中colData的cell_type1列來標識集群。其他列可以通過調整cluster_col參數手動選擇:
sce <- indexCluster(sce)
函數indexCluster自動寫入引用數據集元數據槽的scmap_cluster_index項。
head(metadata(sce)$scmap_cluster_index)
## zygote 2cell 4cell 8cell 16cell blast
## ABCB4 5.788589 6.2258580 5.935134 0.6667119 0.000000 0.000000
## ABCC6P1 7.863625 7.7303559 8.322769 7.4303689 4.759867 0.000000
## ABT1 0.320773 0.1315172 0.000000 5.9787977 6.100671 4.627798
## ACCSL 7.922318 8.4274290 9.662611 4.5869260 1.768026 0.000000
## ACOT11 0.000000 0.0000000 0.000000 6.4677243 7.147798 4.057444
## ACOT9 4.877394 4.2196038 5.446969 4.0685468 3.827819 0.000000
heatmap(as.matrix(metadata(sce)$scmap_cluster_index))
一旦生成了scmap-cluster索引,我們就可以使用它將數據集投射到自身(僅用于說明目的)。這可以通過一次一個索引來實現,但是如果以列表的形式提供,scmap也允許同時投影到多個索引:
scmapCluster_results <- scmapCluster(
projection = sce,
index_list = list(
yan = metadata(sce)$scmap_cluster_index
)
)
scmap-cluster將查詢數據集投射到index_list中定義的所有投影。細胞標簽分配的結果合并為一個矩陣:
head(scmapCluster_results$scmap_cluster_labs)
## yan
## [1,] "zygote"
## [2,] "zygote"
## [3,] "zygote"
## [4,] "2cell"
## [5,] "2cell"
## [6,] "2cell"
對應的相似性存儲在scmap_cluster_siml項中:
head(scmapCluster_results$scmap_cluster_siml)
## yan
## [1,] 0.9947609
## [2,] 0.9951257
## [3,] 0.9955916
## [4,] 0.9934012
## [5,] 0.9953694
## [6,] 0.9871041
scmap還提供所有參考數據集的組合結果(選擇對應于參考數據集之間最大相似性的標簽):
head(scmapCluster_results$combined_labs)
## [1] "zygote" "zygote" "zygote" "2cell" "2cell" "2cell"
可以將scmap-cluster的結果可視化為Sankey圖,以顯示如何匹配cell-cluster (getSankey()函數)。請注意,只有在查詢和引用數據集都已聚類的情況下,Sankey圖才會提供信息,但是沒有必要為查詢分配有意義的標簽(cluster1、cluster2等就足夠了):
plot(
getSankey(
colData(sce)$cell_type1,
scmapCluster_results$scmap_cluster_labs[,'yan'],
plot_height = 400
)
)
scmap-cell
與scmap-cluster不同,scmap-cell將輸入數據集的單元投射到引用的單個細胞,而不是群。
scmap-cell包含k-means步驟,這使得它是隨機的,即多次運行它將提供略有不同的結果。因此,我們將固定一個隨機種子,以便用戶能夠準確地復制我們的結果:
···
set.seed(1)
···
在scmap-cell中,索引是由product quantiser算法創建的,該算法使用一組子中心來標識引用中的每個單元,這些子中心是通過基于特征子集的k-means聚類找到的。
···
sce <- indexCell(sce)
···
與scmap-cluster索引不同,scmap-cell索引包含關于每個細胞的信息,因此不容易可視化。scmap-cell索引由兩項組成:
···
names(metadata(sce)$scmap_cell_index)
[1] "subcentroids" "subclusters"
···
subcentroids包含由product quantiser算法的選定特征、k和M參數定義的低維子空間的subcentroids的坐標(參見?indexCell)。
length(metadata(sce)$scmap_cell_index$subcentroids)
## [1] 50
dim(metadata(sce)$scmap_cell_index$subcentroids[[1]])
## [1] 10 9
metadata(sce)$scmap_cell_index$subcentroids[[1]][,1:5]
## 1 2 3 4 5
## ZAR1L 0.072987697 0.2848353 0.33713297 0.26694708 0.3051086
## SERPINF1 0.179135680 0.3784345 0.35886481 0.39453521 0.4326297
## GRB2 0.439712934 0.4246024 0.23308320 0.43238208 0.3247221
## GSTP1 0.801498298 0.1464230 0.14880665 0.19900079 0.0000000
## ABCC6P1 0.005544482 0.4358565 0.46276591 0.40280401 0.3989602
## ARGFX 0.341212258 0.4284664 0.07629512 0.47961460 0.1296112
## DCT 0.004323311 0.1943568 0.32117489 0.21259776 0.3836451
## C15orf60 0.006681366 0.1862540 0.28346531 0.01123282 0.1096438
## SVOPL 0.003004345 0.1548237 0.33551596 0.12691677 0.2525819
## NLRP9 0.101524942 0.3223963 0.40624639 0.30465156 0.4640308
In the case of our yan dataset:
yan dataset contains N=90
cells
We selected f=500
features (scmap default)
M was calculated as f/10=50
(scmap default for f≤1000
). M is the number of low dimensional subspaces
Number of features in any low dimensional subspace equals to f/M=10
k was calculated as k=N??√≈9
(scmap default).
子簇包含每個給定細胞所屬的亞中心的低維子空間索引:
dim(metadata(sce)$scmap_cell_index$subclusters)
## [1] 50 90
metadata(sce)$scmap_cell_index$subclusters[1:5,1:5]
## Oocyte..1.RPKM. Oocyte..2.RPKM. Oocyte..3.RPKM. Zygote..1.RPKM.
## [1,] 6 6 6 6
## [2,] 5 5 5 5
## [3,] 5 5 5 5
## [4,] 3 3 3 3
## [5,] 6 6 6 6
## Zygote..2.RPKM.
## [1,] 6
## [2,] 5
## [3,] 5
## [4,] 3
## [5,] 6
一旦生成了scmap-cell索引,我們就可以使用它們來投影baron數據集。這可以用一個索引一次完成,但是scmap允許同時投影到多個索引,如果它們以列表的形式提供:
scmapCell_results <- scmapCell(
sce,
list(
yan = metadata(sce)$scmap_cell_index
)
)
每個數據集有兩個母系。細胞矩陣包含投影數據集的給定細胞最接近的參考數據集的前10個(scmap默認值)細胞id:
scmapCell_results$yan$cells[,1:3]
## Oocyte..1.RPKM. Oocyte..2.RPKM. Oocyte..3.RPKM.
## [1,] 1 1 1
## [2,] 2 2 2
## [3,] 3 3 3
## [4,] 11 11 11
## [5,] 5 5 5
## [6,] 6 6 6
## [7,] 7 7 7
## [8,] 12 8 12
## [9,] 9 9 9
## [10,] 10 10 10
similarities matrix contains corresponding cosine similarities:
scmapCell_results$yan$similarities[,1:3]
## Oocyte..1.RPKM. Oocyte..2.RPKM. Oocyte..3.RPKM.
## [1,] 0.9742737 0.9736593 0.9748542
## [2,] 0.9742274 0.9737083 0.9748995
## [3,] 0.9742274 0.9737083 0.9748995
## [4,] 0.9693955 0.9684169 0.9697731
## [5,] 0.9698173 0.9688538 0.9701976
## [6,] 0.9695394 0.9685904 0.9699759
## [7,] 0.9694336 0.9686058 0.9699198
## [8,] 0.9694091 0.9684312 0.9697699
## [9,] 0.9692544 0.9684312 0.9697358
## [10,] 0.9694336 0.9686058 0.9699198
如果cell cluster注釋可用于參考數據集,除了查找前10位最近鄰之外,scmap-cell還允許使用引用的標簽來注釋投影數據集的單細胞。它通過查看前3個最近的鄰居(scmap默認值),如果它們都屬于參考中的相同集群,并且它們的最大相似度高于閾值(0.5是scmap默認值),則將一個投影細胞分配給相應的參考群:
scmapCell_clusters <- scmapCell2Cluster(
scmapCell_results,
list(
as.character(colData(sce)$cell_type1)
)
)
scmap-cell results are in the same format as the ones provided by scmap-cluster (see above):
head(scmapCell_clusters$scmap_cluster_labs)
## yan
## [1,] "zygote"
## [2,] "zygote"
## [3,] "zygote"
## [4,] "unassigned"
## [5,] "unassigned"
## [6,] "unassigned"
對應的相似性存儲在scmap_cluster_siml項中:
head(scmapCell_clusters$scmap_cluster_siml)
## yan
## [1,] 0.9742737
## [2,] 0.9737083
## [3,] 0.9748995
## [4,] NA
## [5,] NA
## [6,] NA
head(scmapCell_clusters$combined_labs)
## [1] "zygote" "zygote" "zygote" "unassigned" "unassigned"
## [6] "unassigned"
plot(
getSankey(
colData(sce)$cell_type1,
scmapCell_clusters$scmap_cluster_labs[,"yan"],
plot_height = 400
)
)
scmap: projection of single-cell RNA-seq data across data sets
http://bioconductor.org/packages/release/bioc/vignettes/scmap/inst/doc/scmap.html