最近在學習芬蘭CSC-IT科學中心主講的生物信息課程(https://www.csc.fi/web/training/-/scrnaseq)視頻,官網上還提供了練習素材以及詳細代碼,今天就來練習一下單細胞數據整合的過程。跟著官網的代碼走一遍:
https://github.com/NBISweden/excelerate-scRNAseq/blob/master/session-integration/Data_Integration.md
該練習中使用兩種方法進行多個單細胞測序dataset的整合,之后進行批次效應的去除,并且定量評估整合后的數據質量。練習中的datasets分別來自:CelSeq (GSE81076) CelSeq2 (GSE85241), Fluidigm C1 (GSE86469), and SMART-Seq2 (E-MTAB-5061)。原始矩陣和相關metadata在這里下載。(這里需要注意的是,作者上傳的這個矩陣是已經經過整合的,但是并沒有去除批次效應,后面代碼里會將這個矩陣拆分成4個datasets,然后再進行整合)
開始之前,加載R包:
> library("Seurat")
> library("ggplot2")
> library("cowplot")
> library("scater")
> library("scran")
> library("BiocParallel")
> library("BiocNeighbors")
(一)利用Seurat (anchors and CCA) 方法進行數據整合以及批次效應處理
加載表達矩陣和metadata,其中metadata里包含測序平臺(列),細胞類型注釋(列)
> pancreas.data <- readRDS(file = "pancreas_expression_matrix.rds")
> metadata <- readRDS(file = "pancreas_metadata.rds")
看一下這個metadata:
創建seurat對象:
> pancreas <- CreateSeuratObject(pancreas.data, meta.data = metadata)
在做任何批次效應處理之前,都要先查看一下dataset,我們先做標準的預處理(log-標準化),然后識別變量(“vst”),接下來scale整合后的data,跑PCA和可視化,再將整合后的細胞分群(cluster)
# 標準化并且尋找變量(variable features)
> pancreas <- NormalizeData(pancreas, verbose = FALSE)
> pancreas <- FindVariableFeatures(pancreas, selection.method = "vst", nfeatures = 2000, verbose = FALSE)
# 跑標準的流程(可視化和clustering)
> pancreas <- ScaleData(pancreas, verbose = FALSE)
> pancreas <- RunPCA(pancreas, npcs = 30, verbose = FALSE)
> pancreas <- RunUMAP(pancreas, reduction = "pca", dims = 1:30)
> p1 <- DimPlot(pancreas, reduction = "umap", group.by = "tech")
> p2 <- DimPlot(pancreas, reduction = "umap", group.by = "celltype", label = TRUE, repel = TRUE) +
NoLegend()
> plot_grid(p1, p2)
下面作者將這個整合的數據拆分成一個列表(包含4個不同的datasets),每一個dataset作為一個元素。進行標準的預處理(log-normalization),識別每一個datset的變量特征("vst"):
> pancreas.list <- SplitObject(pancreas, split.by = "tech")
> for (i in 1:length(pancreas.list)) {
pancreas.list[[i]] <- NormalizeData(pancreas.list[[i]], verbose = FALSE)
pancreas.list[[i]] <- FindVariableFeatures(pancreas.list[[i]], selection.method = "vst", nfeatures = 2000,
verbose = FALSE)
}
整合4個胰島細胞的datasets
利用FindIntegrationAnchors功能識別anchor,seurat對象列表作為輸入:
> reference.list <- pancreas.list[c("celseq", "celseq2", "smartseq2", "fluidigmc1")]
> pancreas.anchors <- FindIntegrationAnchors(object.list = reference.list, dims = 1:30)
Computing 2000 integration features
Scaling features for provided objects
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s
Finding all pairwise anchors
| | 0 % ~calculating Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 3499 anchors
Filtering anchors
Retained 2821 anchors
Extracting within-dataset neighbors
|+++++++++ | 17% ~01m 01s Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 3515 anchors
Filtering anchors
Retained 2701 anchors
Extracting within-dataset neighbors
|+++++++++++++++++ | 33% ~49s Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 6173 anchors
Filtering anchors
Retained 4634 anchors
Extracting within-dataset neighbors
|+++++++++++++++++++++++++ | 50% ~50s Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 2176 anchors
Filtering anchors
Retained 1841 anchors
Extracting within-dataset neighbors
|++++++++++++++++++++++++++++++++++ | 67% ~27s Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 2774 anchors
Filtering anchors
Retained 2478 anchors
Extracting within-dataset neighbors
|++++++++++++++++++++++++++++++++++++++++++ | 83% ~12s Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 2723 anchors
Filtering anchors
Retained 2410 anchors
Extracting within-dataset neighbors
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01m 10s
然后將上面這些anchors傳遞給IntegrateData函數,該函數返回一個Seurat對象:
> pancreas.integrated <- IntegrateData(anchorset = pancreas.anchors, dims = 1:30)
運行IntegrateData后,Seurat對象將包含一個新的整合后的(或“批量校正”)表達矩陣的Assay,請注意,原始矩陣(未修正的值)仍然存儲在Seurat對象的RNA Assay中,因此可以來回切換。
然后我們可以使用這個新的整合的矩陣進行下游分析和可視化。在這里,我們scale整合的數據,運行PCA,并使用UMAP可視化結果。整合的數據集按細胞類型cluster,而不是按技術。
#切換到整合后的assay
> DefaultAssay(pancreas.integrated) <- "integrated"
跑標準流程(可視化和clustering):
> pancreas.integrated <- ScaleData(pancreas.integrated, verbose = FALSE)
> pancreas.integrated <- RunPCA(pancreas.integrated, npcs = 30, verbose = FALSE)
> pancreas.integrated <- RunUMAP(pancreas.integrated, reduction = "pca", dims = 1:30)
> p3 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "tech")
> p4 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "celltype", label = TRUE, repel = TRUE) +
NoLegend()
> plot_grid(p3, p4)
(二)利用Mutual Nearest Neighbor (MNN)方法進行數據整合
你可以用count矩陣創建一個singlecellexper(SCE)對象,也可以從Seurat轉換成SCE對象:
> celseq.data <- as.SingleCellExperiment(pancreas.list$celseq)
> celseq2.data <- as.SingleCellExperiment(pancreas.list$celseq2)
> fluidigmc1.data <- as.SingleCellExperiment(pancreas.list$fluidigmc1)
> smartseq2.data <- as.SingleCellExperiment(pancreas.list$smartseq2)
尋找共同的基因,并且把每個dataset簡化成由那些共同基因組成的dataset:
> keep_genes <- Reduce(intersect, list(rownames(celseq.data),rownames(celseq2.data),
+ rownames(fluidigmc1.data),rownames(smartseq2.data)))
> celseq.data <- celseq.data[match(keep_genes, rownames(celseq.data)), ]
> celseq2.data <- celseq2.data[match(keep_genes, rownames(celseq2.data)), ]
> fluidigmc1.data <- fluidigmc1.data[match(keep_genes, rownames(fluidigmc1.data)), ]
> smartseq2.data <- smartseq2.data[match(keep_genes, rownames(smartseq2.data)), ]
接下來使用calculateQCMetrics()計算質量控制特征,通過發現異常count數低的或可檢測到的基因總數少的異常值來確定低質量細胞:
# 處理celseq.data
> celseq.data <- calculateQCMetrics(celseq.data)
> low_lib_celseq.data <- isOutlier(celseq.data$log10_total_counts, type="lower", nmad=3)
> low_genes_celseq.data <- isOutlier(celseq.data$log10_total_features_by_counts, type="lower", nmad=3)
> celseq.data <- celseq.data[, !(low_lib_celseq.data | low_genes_celseq.data)]
# 處理celseq2.data
> celseq2.data <- calculateQCMetrics(celseq2.data)
> low_lib_celseq2.data <- isOutlier(celseq2.data$log10_total_counts, type="lower", nmad=3)
> low_genes_celseq2.data <- isOutlier(celseq2.data$log10_total_features_by_counts, type="lower", nmad=3)
> celseq2.data <- celseq2.data[, !(low_lib_celseq2.data | low_genes_celseq2.data)]
# 處理fluidigmc1.data
> fluidigmc1.data <- calculateQCMetrics(fluidigmc1.data)
> low_lib_fluidigmc1.data <- isOutlier(fluidigmc1.data$log10_total_counts, type="lower", nmad=3)
> low_genes_fluidigmc1.data <- isOutlier(fluidigmc1.data$log10_total_features_by_counts, type="lower", nmad=3)
> fluidigmc1.data <- fluidigmc1.data[, !(low_lib_fluidigmc1.data | low_genes_fluidigmc1.data)]
# 處理smartseq2.data
> smartseq2.data <- calculateQCMetrics(smartseq2.data)
> low_lib_smartseq2.data <- isOutlier(smartseq2.data$log10_total_counts, type="lower", nmad=3)
> low_genes_smartseq2.data <- isOutlier(smartseq2.data$log10_total_features_by_counts, type="lower", nmad=3)
> smartseq2.data <- smartseq2.data[, !(low_lib_smartseq2.data | low_genes_smartseq2.data)]
然后使用computeSumFactors()和scran包的Normalize()函數計算sizefactor來標準化數據:
# Compute sizefactors
> celseq.data <- computeSumFactors(celseq.data)
> celseq2.data <- computeSumFactors(celseq2.data)
> fluidigmc1.data <- computeSumFactors(fluidigmc1.data)
> smartseq2.data <- computeSumFactors(smartseq2.data)
# Normalize
> celseq.data <- normalize(celseq.data)
> celseq2.data <- normalize(celseq2.data)
> fluidigmc1.data <- normalize(fluidigmc1.data)
> smartseq2.data <- normalize(smartseq2.data)
features(基因)選擇:使用trendVar()和decomposeVar()函數來計算每個基因的variance,并將其分為技術variance和生物學的variance:
# celseq.data
> fit_celseq.data <- trendVar(celseq.data, use.spikes=FALSE)
> dec_celseq.data <- decomposeVar(celseq.data, fit_celseq.data)
> dec_celseq.data$Symbol_TENx <- rowData(celseq.data)$Symbol_TENx
> dec_celseq.data <- dec_celseq.data[order(dec_celseq.data$bio, decreasing = TRUE), ]
# celseq2.data
> fit_celseq2.data <- trendVar(celseq2.data, use.spikes=FALSE)
> dec_celseq2.data <- decomposeVar(celseq2.data, fit_celseq2.data)
> dec_celseq2.data$Symbol_TENx <- rowData(celseq2.data)$Symbol_TENx
> dec_celseq2.data <- dec_celseq2.data[order(dec_celseq2.data$bio, decreasing = TRUE), ]
# fluidigmc1.data
> fit_fluidigmc1.data <- trendVar(fluidigmc1.data, use.spikes=FALSE)
> dec_fluidigmc1.data <- decomposeVar(fluidigmc1.data, fit_fluidigmc1.data)
> dec_fluidigmc1.data$Symbol_TENx <- rowData(fluidigmc1.data)$Symbol_TENx
> dec_fluidigmc1.data <- dec_fluidigmc1.data[order(dec_fluidigmc1.data$bio, decreasing = TRUE), ]
# smartseq2.data
> fit_smartseq2.data <- trendVar(smartseq2.data, use.spikes=FALSE)
> dec_smartseq2.data <- decomposeVar(smartseq2.data, fit_smartseq2.data)
> dec_smartseq2.data$Symbol_TENx <- rowData(smartseq2.data)$Symbol_TENx
> dec_smartseq2.data <- dec_smartseq2.data[order(dec_smartseq2.data$bio, decreasing = TRUE), ]
# 選擇最能提供信息的基因,這些基因在所有的dataset里都表達
> universe <- Reduce(intersect, list(rownames(dec_celseq.data),rownames(dec_celseq2.data),
rownames(dec_fluidigmc1.data),rownames(dec_smartseq2.data)))
> mean.bio <- (dec_celseq.data[universe,"bio"] + dec_celseq2.data[universe,"bio"] +
dec_fluidigmc1.data[universe,"bio"] + dec_smartseq2.data[universe,"bio"])/4
> hvg_genes <- universe[mean.bio > 0]
將這些datasets結合到一個統一的SingleCellExperiment里:
# 總原始counts的整合
> counts_pancreas <- cbind(counts(celseq.data), counts(celseq2.data),
counts(fluidigmc1.data), counts(smartseq2.data))
# 總的標準化后的counts整合 (with multibatch normalization)
> logcounts_pancreas <- cbind(logcounts(celseq.data), logcounts(celseq2.data),
logcounts(fluidigmc1.data), logcounts(smartseq2.data))
# 構建整合數據的sce對象
> sce <- SingleCellExperiment(
assays = list(counts = counts_pancreas, logcounts = logcounts_pancreas),
rowData = rowData(celseq.data), # same as rowData(pbmc4k)
colData = rbind(colData(celseq.data), colData(celseq2.data),
colData(fluidigmc1.data), colData(smartseq2.data))
)
# 將前面的hvg_genes存儲到sce對象的metadata slot中
> metadata(sce)$hvg_genes <- hvg_genes
用MNN處理批次效應之前先看一下這些datasets:
> sce <- runPCA(sce,
ncomponents = 20,
feature_set = hvg_genes,
method = "irlba")
>
> names(reducedDims(sce)) <- "PCA_naive"
>
> p5 <- plotReducedDim(sce, use_dimred = "PCA_naive", colour_by = "tech") +
ggtitle("PCA Without batch correction")
> p6 <- plotReducedDim(sce, use_dimred = "PCA_naive", colour_by = "celltype") +
ggtitle("PCA Without batch correction")
> plot_grid(p5, p6)
使用fastMNN() 功能處理批次效應。跑fastMNN()之前,我們需要先rescale每一個批次,來調整不同批次之間的測序深度。用scran包里的multiBatchNorm()功能對size factor進行調整后,重新計算log標準化的表達值,以適應不同SingleCellExperiment對象的系統差異。之前的size factors僅能移除單個批次里細胞之間的bias。現在我們要通過消除批次之間技術差異來提高了校正的質量:
> rescaled <- multiBatchNorm(celseq.data, celseq2.data, fluidigmc1.data, smartseq2.data)
> celseq.data_rescaled <- rescaled[[1]]
> celseq2.data_rescaled <- rescaled[[2]]
> fluidigmc1.data_rescaled <- rescaled[[3]]
> smartseq2.data_rescaled <- rescaled[[4]]
跑fastMNN,把降維的MNN representation存在sce對象的 reducedDims slot里:
> mnn_out <- fastMNN(celseq.data_rescaled,
celseq2.data_rescaled,
fluidigmc1.data_rescaled,
smartseq2.data_rescaled,
subset.row = metadata(sce)$hvg_genes,
k = 20, d = 50, approximate = TRUE,
# BPPARAM = BiocParallel::MulticoreParam(8),
BNPARAM = BiocNeighbors::AnnoyParam())
> reducedDim(sce, "MNN") <- mnn_out$correct
需要注意的是,fastMNN()不會生成批次處理后的表達矩陣。因此,fastMNN()的結果只能作為降維表示,適用于直接繪圖、TSNE/UMAP、聚類和軌跡分析。
畫批次矯正后的圖:
> p7 <- plotReducedDim(sce, use_dimred = "MNN", colour_by = "tech") + ggtitle("MNN Ouput Reduced Dimensions")
> p8 <- plotReducedDim(sce, use_dimred = "MNN", colour_by = "celltype") + ggtitle("MNN Ouput Reduced Dimensions")
> plot_grid(p7, p8)