- Seurat Weekly NO.0 || 開刊詞
- Seurat Weekly NO.1 || 到底分多少個群是合適的?!
- Seurat Weekly NO.2 || 我該如何取子集
- Seurat Weekly NO.3 || 直接用Seurat畫fig2
- Seurat Weekly NO.4 || 高效數據管理
- Seurat Weekly NO.5 pseudocell該如何計算||或談Seurat的擴展
其實,我們在2019年的時候就介紹過單細胞轉錄組數據分析||Seurat3.1教程:Interoperability between single-cell object formats,講了單細胞轉錄組數據對象的轉化。對R語言境內的Seurat,CellDataSet,SingleCellExperiment,loom的格式轉化起來還是比較方便的,但是對于異域的anndata轉化一直不是很友好,所以我借此機會學會了python(在等短信驗證碼的那六十秒之內)。anndata的數據就在python中分析,完事。
但是這樣的轉化總有需求,于是,Seurat團隊開發了SeuratDisk包,希望滿足數據在Seurat和Scanpy之間快速搬家的需求。遇到規范的數據,當然可以一鍵搬家,但是如果有一點不同,就會帶來各種Error。這里我們就以此為契機,看看遇到Error該如何處理,這屬于進階課程了,遇到問題Google解決不了了,我們怎么辦?
library(Seurat)
我從網上下了一個不知道做了什么處理的單細胞數據,只知道是h5ad格式的,當我們用ReadH5AD
讀取時
cellxgene1 <- ReadH5AD(file = "some.processed.h5ad")
給出了這樣的報錯:
# 報錯信息太長,我們只顯示最有用的。
In addition: Warning message:
Functionality for reading and writing H5AD files is being moved to SeuratDisk
For more details, please see https://github.com/mojaveazure/seurat-disk
and https://mojaveazure.github.io/seurat-disk/index.html
顯然,作者這是在提示我們安裝新的R包:seurat-disk,于是我們挺聽話地去安裝了。
remotes::install_github("mojaveazure/seurat-disk")
library(SeuratDisk)
Convert("some.processed.h5ad", dest = "h5seurat", overwrite = TRUE)`
Warning: Unknown file type: h5ad
Warning: 'assay' not set, setting to 'RNA'
Creating h5Seurat file for version 3.1.2
Adding X as data
Adding X as counts
Error: Cannot find feature names in this H5AD file
同樣,轉角處遇到Error,于是直接google Error信息,我們看到Seurat團隊給出的答案:
For this specific H5AD file, it's compressed using the LZF filter. This filter is Python-specific, and cannot easily be used in R. To use this file with Seurat and SeuratDisk, you'll need to read it in Python and save it out using the gzip compression
import anndata
adata = anndata.read("some.processed.h5ad")
adata.write("some.processed.gzip.h5ad", compression="gzip")
這顯然是python語法,在R里面該如何操作呢?自然是在R里調用Python了,所以別問人家用的是R還是python了,這兩個可以相互運行的語言,其實是一種語言。
library(reticulate)
reticulate::py_config()
Sys.which('python') # 該python 下要安裝了anndata
# usethis::edit_r_environ()
filesmy2 ='some.processed.gzip.h5ad'
ad <- import("anndata", convert = FALSE)
some_ad <- ad$read_h5ad(filesmy)
adata = anndata.read("some.processed.h5ad")
some_ad$write("some.processed.gzip.h5ad", compression="gzip")
之后,我們終于可以轉化了。
Convert("some.processed.gzip.h5ad", dest = "h5seurat", overwrite = TRUE)
Warning: Unknown file type: h5ad
Warning: 'assay' not set, setting to 'RNA' # 這里其實要格外小心,看看數據里面是不是 RNA啊
Creating h5Seurat file for version 3.1.5.9900
Adding X as data
Adding raw/X as counts
Adding meta.features from raw/var
Adding X_umap as cell embeddings for umap
Convert 成功后目錄下多了一個文件:some.processed.gzip.h5seurat,就差一個Seurat對象了。
cellxgene <- LoadH5Seurat('some.processed.gzip.h5seurat',
assays ="RNA")
Validating h5Seurat file
Initializing RNA with data
Error in dimnamesGets(x, value) :
invalid dimnames given for “dgCMatrix” object
同樣,在轉角處遇到了Error ,于是我們再次Google 這個Error 。一番瀏覽,我們發現自己遇到了Google解決不了的問題。
決心debug這個函數。
debug(LoadH5Seurat)
cellxgene <- LoadH5Seurat('some.processed.gzip.h5seurat',
assays ="RNA")
# 一波回車
debug( as.Seurat) #因為LoadH5Seurat里面用了這個函數,所以在LoadH5Seurat的debug環境種再debug as.Seurat
# 一波回車
debug(AssembleAssay) #因為as.Seurat里面用了這個函數,所以在as.Seurat的debug環境種再debug AssembleAssay
# 一波回車
我們找到問題了:
slots.assay <- names(x = Filter(f = isTRUE, x = index[[assay]]$slots))
slots <- slots %||% slots.assay
slots <- match.arg(arg = slots, choices = slots.assay, several.ok = TRUE)
if (!any(c("counts", "data") %in% slots)) {
stop("At least one of 'counts' or 'data' must be loaded",
call. = FALSE)
}
assay.group <- file[["assays"]][[assay]]
features <- assay.group[["features"]][]
if ("counts" %in% slots && !"data" %in% slots) {
if (verbose) {
message("Initializing ", assay, " with counts")
}
counts <- as.matrix(x = assay.group[["counts"]])
rownames(x = counts) <- features # 還是第一個features <- assay.group[["features"]][]
colnames(x = counts) <- Cells(x = file)
obj <- CreateAssayObject(counts = counts, min.cells = -1,
min.features = -1)
}
else {
if (verbose) {
message("Initializing ", assay, " with data")
}
data <- as.matrix(x = assay.group[["data"]])
rownames(x = data) <- features # 還是第一個features <- assay.group[["features"]][]
colnames(x = data) <- Cells(x = file)
obj <- CreateAssayObject(data = data)
}
也就是在這部分代碼中作者是認為,slots 之counts和data 的行名是一樣的,其實我們知道Seurat的每部分存的數據其實都是可以獨立操作的,所以可能并不是都一樣。
懷疑就要檢查:
Browse[8]> counts <- as.matrix(x = assay.group[["counts"]])
Browse[8]> dim(counts)
[1] 33567 10550
Browse[8]> data <- as.matrix(x = assay.group[["data"]])
Browse[8]> dim(data)
[1] 33421 10550
Browse[8]> length(features )
[1] 33567
果然不一樣。
既然我們找到了invalid dimnames given for “dgCMatrix” object
Error 的報錯原因我們就可以針對counts來轉化,data的部分我們在Seurat里面做。為什么不找到data的行名,賦值給data呢?這里留作思考題吧。壞。
我們開始改造原函數使之能夠接受slots='counts'
這樣的限定,于是我們找到 as.Seurat源代碼中調用AssembleAssay的部分:
for (assay in names(x = assays)) {
assay.objects[[assay]] <- AssembleAssay(
assay = assay,
file = x,
slots = assays[[assay]], #這里強制對所有的slots進行轉化,我們要讓他接受傳參
verbose = verbose
)
}
除了在函數參數中加入slots = NULL,
之外,這部分改為:
for (assay in names(x = assays)) {
assay.objects[[assay]] <- AssembleAssay(
assay = assay,
file = x,
slots =slots %||% assays[[assay]],
verbose = verbose
)
}
考慮到基本上要改動https://github.com/mojaveazure/seurat-disk/blob/master/R/LoadH5Seurat.R這個腳本的大部分函數,所以我們Fork一份seurat-disk 對應的我們改https://github.com/tuqiang2014/seurat-disk/blob/master/R/LoadH5Seurat.R
至于怎么改的,這里就略過了。如果遇到同樣的問題,你可以安裝這個改過了的。改完之后,我們卸載掉這個官方的seurat-disk ,安裝自己改過的。
detach("package:SeuratDisk", unload = TRUE)
remove.packages('SeuratDisk')
remotes::install_github("tuqiang2014/seurat-disk") # tuqiang2014就是我啦
library(SeuratDisk)
undebug(LoadH5Seurat)
undebug( as.Seurat)
undebug(AssembleAssay) # 這里的提示忽略,不是同一個環境。
cellxgene <- LoadH5Seurat('some.processed.gzip.h5seurat',
assays ="RNA",slots='counts')
# 提示信息:
Validating h5Seurat file
Initializing RNA with counts
Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')
Adding feature-level metadata for RNA
Adding reduction umap
Adding cell embeddings for umap
Adding miscellaneous information for umap
Adding command information
Adding cell-level metadata
于是,一個標致的Seurat對象就呈現在我們面前了:
cellxgene
An object of class Seurat
XXXX features across XXXXX samples within 1 assay
Active assay: RNA (XXXX features, 0 variable features)
1 dimensional reduction calculated: umap
可以在Seurat里面盡情的分析。
library(ggplot2)
ggplot(DimPlot(cellxgene)$data,aes(umap_1 , umap_2 ,fill=ident)) +
geom_point(shape=21,colour="black",stroke=0.25,alpha=0.8) +
DimPlot(cellxgene,label = T)$theme +
theme_bw()+ NoLegend()
單細胞轉錄組數據分析||Seurat3.1教程:Interoperability between single-cell object formats
Convert AnnData to Seurat fails with raw h5ad
conversion_vignette
convert-anndata RMD
Error: Cannot find feature names in this H5AD file #7