單細胞測序數據整合練習(詳細代碼)

最近在學習芬蘭CSC-IT科學中心主講的生物信息課程(https://www.csc.fi/web/training/-/scrnaseq)視頻,官網上還提供了練習素材以及詳細代碼,今天就來練習一下單細胞數據整合的過程。跟著官網的代碼走一遍:

https://github.com/NBISweden/excelerate-scRNAseq/blob/master/session-integration/Data_Integration.md

該練習中使用兩種方法進行多個單細胞測序dataset的整合,之后進行批次效應的去除,并且定量評估整合后的數據質量。練習中的datasets分別來自:CelSeq (GSE81076) CelSeq2 (GSE85241), Fluidigm C1 (GSE86469), and SMART-Seq2 (E-MTAB-5061)。原始矩陣和相關metadata在這里下載。(這里需要注意的是,作者上傳的這個矩陣是已經經過整合的,但是并沒有去除批次效應,后面代碼里會將這個矩陣拆分成4個datasets,然后再進行整合)

開始之前,加載R包:

> library("Seurat")
> library("ggplot2")
> library("cowplot")
> library("scater")
> library("scran")
> library("BiocParallel")
> library("BiocNeighbors")

(一)利用Seurat (anchors and CCA) 方法進行數據整合以及批次效應處理

加載表達矩陣和metadata,其中metadata里包含測序平臺(列),細胞類型注釋(列)

> pancreas.data <- readRDS(file = "pancreas_expression_matrix.rds")
> metadata <- readRDS(file = "pancreas_metadata.rds")

看一下這個metadata:

創建seurat對象:

> pancreas <- CreateSeuratObject(pancreas.data, meta.data = metadata)

在做任何批次效應處理之前,都要先查看一下dataset,我們先做標準的預處理(log-標準化),然后識別變量(“vst”),接下來scale整合后的data,跑PCA和可視化,再將整合后的細胞分群(cluster)

# 標準化并且尋找變量(variable features)
> pancreas <- NormalizeData(pancreas, verbose = FALSE)
> pancreas <- FindVariableFeatures(pancreas, selection.method = "vst", nfeatures = 2000, verbose = FALSE)
# 跑標準的流程(可視化和clustering)
> pancreas <- ScaleData(pancreas, verbose = FALSE)
> pancreas <- RunPCA(pancreas, npcs = 30, verbose = FALSE)
> pancreas <- RunUMAP(pancreas, reduction = "pca", dims = 1:30)
> p1 <- DimPlot(pancreas, reduction = "umap", group.by = "tech")
> p2 <- DimPlot(pancreas, reduction = "umap", group.by = "celltype", label = TRUE, repel = TRUE) + 
  NoLegend()
> plot_grid(p1, p2)
這是這個整合后的數據,但是這個數據并沒有去除批次效應。左圖里4個不同平臺測序的結果重合度很低,右圖里根據細胞類型分群也沒有很好的clustering

下面作者將這個整合的數據拆分成一個列表(包含4個不同的datasets),每一個dataset作為一個元素。進行標準的預處理(log-normalization),識別每一個datset的變量特征("vst"):

> pancreas.list <- SplitObject(pancreas, split.by = "tech")

> for (i in 1:length(pancreas.list)) {
  pancreas.list[[i]] <- NormalizeData(pancreas.list[[i]], verbose = FALSE)
  pancreas.list[[i]] <- FindVariableFeatures(pancreas.list[[i]], selection.method = "vst", nfeatures = 2000, 
                                             verbose = FALSE)
}

整合4個胰島細胞的datasets
利用FindIntegrationAnchors功能識別anchor,seurat對象列表作為輸入:

> reference.list <- pancreas.list[c("celseq", "celseq2", "smartseq2", "fluidigmc1")]
> pancreas.anchors <- FindIntegrationAnchors(object.list = reference.list, dims = 1:30)

Computing 2000 integration features
Scaling features for provided objects
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s  
Finding all pairwise anchors
  |                                                  | 0 % ~calculating  Running CCA
Merging objects
Finding neighborhoods
Finding anchors
    Found 3499 anchors
Filtering anchors
    Retained 2821 anchors
Extracting within-dataset neighbors
  |+++++++++                                         | 17% ~01m 01s      Running CCA
Merging objects
Finding neighborhoods
Finding anchors
    Found 3515 anchors
Filtering anchors
    Retained 2701 anchors
Extracting within-dataset neighbors
  |+++++++++++++++++                                 | 33% ~49s          Running CCA
Merging objects
Finding neighborhoods
Finding anchors
    Found 6173 anchors
Filtering anchors
    Retained 4634 anchors
Extracting within-dataset neighbors
  |+++++++++++++++++++++++++                         | 50% ~50s          Running CCA
Merging objects
Finding neighborhoods
Finding anchors
    Found 2176 anchors
Filtering anchors
    Retained 1841 anchors
Extracting within-dataset neighbors
  |++++++++++++++++++++++++++++++++++                | 67% ~27s          Running CCA
Merging objects
Finding neighborhoods
Finding anchors
    Found 2774 anchors
Filtering anchors
    Retained 2478 anchors
Extracting within-dataset neighbors
  |++++++++++++++++++++++++++++++++++++++++++        | 83% ~12s          Running CCA
Merging objects
Finding neighborhoods
Finding anchors
    Found 2723 anchors
Filtering anchors
    Retained 2410 anchors
Extracting within-dataset neighbors
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01m 10s

然后將上面這些anchors傳遞給IntegrateData函數,該函數返回一個Seurat對象:

> pancreas.integrated <- IntegrateData(anchorset = pancreas.anchors, dims = 1:30)

運行IntegrateData后,Seurat對象將包含一個新的整合后的(或“批量校正”)表達矩陣的Assay,請注意,原始矩陣(未修正的值)仍然存儲在Seurat對象的RNA Assay中,因此可以來回切換。

然后我們可以使用這個新的整合的矩陣進行下游分析和可視化。在這里,我們scale整合的數據,運行PCA,并使用UMAP可視化結果。整合的數據集按細胞類型cluster,而不是按技術。

#切換到整合后的assay
> DefaultAssay(pancreas.integrated) <- "integrated"

跑標準流程(可視化和clustering):

> pancreas.integrated <- ScaleData(pancreas.integrated, verbose = FALSE)
> pancreas.integrated <- RunPCA(pancreas.integrated, npcs = 30, verbose = FALSE)
> pancreas.integrated <- RunUMAP(pancreas.integrated, reduction = "pca", dims = 1:30)
> p3 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "tech")
> p4 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "celltype", label = TRUE, repel = TRUE) + 
   NoLegend()
> plot_grid(p3, p4)
這時的圖就是拆分后再次整合、去除批次效應之后的圖了。左圖的4個平臺測序分類的結果重疊度很高,右圖按照細胞類型分類的clustering結果也很好

(二)利用Mutual Nearest Neighbor (MNN)方法進行數據整合

你可以用count矩陣創建一個singlecellexper(SCE)對象,也可以從Seurat轉換成SCE對象:

> celseq.data <- as.SingleCellExperiment(pancreas.list$celseq)
> celseq2.data <- as.SingleCellExperiment(pancreas.list$celseq2)
> fluidigmc1.data <- as.SingleCellExperiment(pancreas.list$fluidigmc1)
> smartseq2.data <- as.SingleCellExperiment(pancreas.list$smartseq2)

尋找共同的基因,并且把每個dataset簡化成由那些共同基因組成的dataset:

> keep_genes <- Reduce(intersect, list(rownames(celseq.data),rownames(celseq2.data),
+                                      rownames(fluidigmc1.data),rownames(smartseq2.data)))
> celseq.data <- celseq.data[match(keep_genes, rownames(celseq.data)), ]
> celseq2.data <- celseq2.data[match(keep_genes, rownames(celseq2.data)), ]
> fluidigmc1.data <- fluidigmc1.data[match(keep_genes, rownames(fluidigmc1.data)), ]
> smartseq2.data <- smartseq2.data[match(keep_genes, rownames(smartseq2.data)), ]

接下來使用calculateQCMetrics()計算質量控制特征,通過發現異常count數低的或可檢測到的基因總數少的異常值來確定低質量細胞:

# 處理celseq.data
> celseq.data <- calculateQCMetrics(celseq.data)
> low_lib_celseq.data <- isOutlier(celseq.data$log10_total_counts, type="lower", nmad=3)
> low_genes_celseq.data <- isOutlier(celseq.data$log10_total_features_by_counts, type="lower", nmad=3)
> celseq.data <- celseq.data[, !(low_lib_celseq.data | low_genes_celseq.data)]
# 處理celseq2.data
> celseq2.data <- calculateQCMetrics(celseq2.data)
> low_lib_celseq2.data <- isOutlier(celseq2.data$log10_total_counts, type="lower", nmad=3)
> low_genes_celseq2.data <- isOutlier(celseq2.data$log10_total_features_by_counts, type="lower", nmad=3)
> celseq2.data <- celseq2.data[, !(low_lib_celseq2.data | low_genes_celseq2.data)]
# 處理fluidigmc1.data
> fluidigmc1.data <- calculateQCMetrics(fluidigmc1.data)
> low_lib_fluidigmc1.data <- isOutlier(fluidigmc1.data$log10_total_counts, type="lower", nmad=3)
> low_genes_fluidigmc1.data <- isOutlier(fluidigmc1.data$log10_total_features_by_counts, type="lower", nmad=3)
> fluidigmc1.data <- fluidigmc1.data[, !(low_lib_fluidigmc1.data | low_genes_fluidigmc1.data)]
# 處理smartseq2.data
> smartseq2.data <- calculateQCMetrics(smartseq2.data)
> low_lib_smartseq2.data <- isOutlier(smartseq2.data$log10_total_counts, type="lower", nmad=3)
> low_genes_smartseq2.data <- isOutlier(smartseq2.data$log10_total_features_by_counts, type="lower", nmad=3)
> smartseq2.data <- smartseq2.data[, !(low_lib_smartseq2.data | low_genes_smartseq2.data)]

然后使用computeSumFactors()和scran包的Normalize()函數計算sizefactor來標準化數據:

# Compute sizefactors
> celseq.data <- computeSumFactors(celseq.data)
> celseq2.data <- computeSumFactors(celseq2.data)
> fluidigmc1.data <- computeSumFactors(fluidigmc1.data)
> smartseq2.data <- computeSumFactors(smartseq2.data)
# Normalize
> celseq.data <- normalize(celseq.data)
> celseq2.data <- normalize(celseq2.data)
> fluidigmc1.data <- normalize(fluidigmc1.data)
> smartseq2.data <- normalize(smartseq2.data)

features(基因)選擇:使用trendVar()和decomposeVar()函數來計算每個基因的variance,并將其分為技術variance和生物學的variance:

# celseq.data
> fit_celseq.data <- trendVar(celseq.data, use.spikes=FALSE) 
> dec_celseq.data <- decomposeVar(celseq.data, fit_celseq.data)
> dec_celseq.data$Symbol_TENx <- rowData(celseq.data)$Symbol_TENx
> dec_celseq.data <- dec_celseq.data[order(dec_celseq.data$bio, decreasing = TRUE), ]
# celseq2.data
> fit_celseq2.data <- trendVar(celseq2.data, use.spikes=FALSE) 
> dec_celseq2.data <- decomposeVar(celseq2.data, fit_celseq2.data)
> dec_celseq2.data$Symbol_TENx <- rowData(celseq2.data)$Symbol_TENx
> dec_celseq2.data <- dec_celseq2.data[order(dec_celseq2.data$bio, decreasing = TRUE), ]
# fluidigmc1.data
> fit_fluidigmc1.data <- trendVar(fluidigmc1.data, use.spikes=FALSE) 
> dec_fluidigmc1.data <- decomposeVar(fluidigmc1.data, fit_fluidigmc1.data)
> dec_fluidigmc1.data$Symbol_TENx <- rowData(fluidigmc1.data)$Symbol_TENx
> dec_fluidigmc1.data <- dec_fluidigmc1.data[order(dec_fluidigmc1.data$bio, decreasing = TRUE), ]
# smartseq2.data
> fit_smartseq2.data <- trendVar(smartseq2.data, use.spikes=FALSE) 
> dec_smartseq2.data <- decomposeVar(smartseq2.data, fit_smartseq2.data)
> dec_smartseq2.data$Symbol_TENx <- rowData(smartseq2.data)$Symbol_TENx
> dec_smartseq2.data <- dec_smartseq2.data[order(dec_smartseq2.data$bio, decreasing = TRUE), ]
# 選擇最能提供信息的基因,這些基因在所有的dataset里都表達
> universe <- Reduce(intersect, list(rownames(dec_celseq.data),rownames(dec_celseq2.data),
                                   rownames(dec_fluidigmc1.data),rownames(dec_smartseq2.data)))
> mean.bio <- (dec_celseq.data[universe,"bio"] + dec_celseq2.data[universe,"bio"] + 
                dec_fluidigmc1.data[universe,"bio"] + dec_smartseq2.data[universe,"bio"])/4
> hvg_genes <- universe[mean.bio > 0]

將這些datasets結合到一個統一的SingleCellExperiment里:

# 總原始counts的整合
> counts_pancreas <- cbind(counts(celseq.data), counts(celseq2.data), 
                          counts(fluidigmc1.data), counts(smartseq2.data))
# 總的標準化后的counts整合 (with multibatch normalization)
> logcounts_pancreas <- cbind(logcounts(celseq.data), logcounts(celseq2.data), 
                             logcounts(fluidigmc1.data), logcounts(smartseq2.data))
# 構建整合數據的sce對象
> sce <- SingleCellExperiment( 
   assays = list(counts = counts_pancreas, logcounts = logcounts_pancreas),  
   rowData = rowData(celseq.data), # same as rowData(pbmc4k) 
   colData = rbind(colData(celseq.data), colData(celseq2.data), 
                   colData(fluidigmc1.data), colData(smartseq2.data)) 
 )
# 將前面的hvg_genes存儲到sce對象的metadata slot中 
> metadata(sce)$hvg_genes <- hvg_genes

用MNN處理批次效應之前先看一下這些datasets:

> sce <- runPCA(sce,
               ncomponents = 20,
               feature_set = hvg_genes,
               method = "irlba")
> 
> names(reducedDims(sce)) <- "PCA_naive" 
> 
> p5 <- plotReducedDim(sce, use_dimred = "PCA_naive", colour_by = "tech") + 
   ggtitle("PCA Without batch correction")
> p6 <- plotReducedDim(sce, use_dimred = "PCA_naive", colour_by = "celltype") + 
   ggtitle("PCA Without batch correction")
> plot_grid(p5, p6)
去除批次效應之前

使用fastMNN() 功能處理批次效應。跑fastMNN()之前,我們需要先rescale每一個批次,來調整不同批次之間的測序深度。用scran包里的multiBatchNorm()功能對size factor進行調整后,重新計算log標準化的表達值,以適應不同SingleCellExperiment對象的系統差異。之前的size factors僅能移除單個批次里細胞之間的bias。現在我們要通過消除批次之間技術差異來提高了校正的質量:

> rescaled <- multiBatchNorm(celseq.data, celseq2.data, fluidigmc1.data, smartseq2.data) 
> celseq.data_rescaled <- rescaled[[1]]
> celseq2.data_rescaled <- rescaled[[2]]
> fluidigmc1.data_rescaled <- rescaled[[3]]
> smartseq2.data_rescaled <- rescaled[[4]]

跑fastMNN,把降維的MNN representation存在sce對象的 reducedDims slot里:

> mnn_out <- fastMNN(celseq.data_rescaled, 
                   celseq2.data_rescaled,
                   fluidigmc1.data_rescaled,
                   smartseq2.data_rescaled,
                   subset.row = metadata(sce)$hvg_genes,
                   k = 20, d = 50, approximate = TRUE,
                   # BPPARAM = BiocParallel::MulticoreParam(8),
                   BNPARAM = BiocNeighbors::AnnoyParam())

> reducedDim(sce, "MNN") <- mnn_out$correct

需要注意的是,fastMNN()不會生成批次處理后的表達矩陣。因此,fastMNN()的結果只能作為降維表示,適用于直接繪圖、TSNE/UMAP、聚類和軌跡分析。
畫批次矯正后的圖:

> p7 <- plotReducedDim(sce, use_dimred = "MNN", colour_by = "tech") + ggtitle("MNN Ouput Reduced Dimensions")
> p8 <- plotReducedDim(sce, use_dimred = "MNN", colour_by = "celltype") + ggtitle("MNN Ouput Reduced Dimensions")
> plot_grid(p7, p8)
去除批次效應之后的圖
最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。
禁止轉載,如需轉載請通過簡信或評論聯系作者。

推薦閱讀更多精彩內容