劉小澤寫于19.9.2-第三單元第七講:使用scRNA包學(xué)習(xí)Monocle2
筆記目的:根據(jù)生信技能樹的單細(xì)胞轉(zhuǎn)錄組課程探索smart-seq2技術(shù)相關(guān)的分析技術(shù)
課程鏈接在:http://jm.grazy.cn/index/mulitcourse/detail.html?cid=53
前言
關(guān)于monocle2
目前monocle2可以直接利用Bioconductor安裝:https://www.bioconductor.org/packages/release/bioc/html/monocle.html
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("monocle")
# 安裝的版本是2.12.0
關(guān)于這個scRNA包
內(nèi)容在:https://github.com/jmzeng1314/scRNA_smart_seq2/blob/master/scRNA/study_scRNAseq.html
要使用scRNAseq
這個R包,首先要對它進(jìn)行了解,包中內(nèi)置了Pollen et al. 2014 的數(shù)據(jù)集(https://www.nature.com/articles/nbt.2967),到19年8月為止,已經(jīng)有446引用量了。只不過原文完整的數(shù)據(jù)是 23730 個基因, 301 個樣本【這里只有130個樣本文庫(高覆蓋度、低覆蓋度各65個,并且測序深度不同】,這個包中只選取了4種細(xì)胞類型:pluripotent stem cells 分化而成的 neural progenitor cells (NPC,神經(jīng)前體細(xì)胞) ,還有 GW16(radial glia,放射狀膠質(zhì)細(xì)胞) 、GW21(newborn neuron,新生兒神經(jīng)元) 、GW21+3(maturing neuron,成熟神經(jīng)元) ,它們的關(guān)系如下圖(NPC和其他三類存在較大差別):
數(shù)據(jù)大小是50.6 MB,要想知道數(shù)據(jù)怎么處理的,可以看:https://hemberg-lab.github.io/scRNA.seq.datasets/human/tissues/
加載scRNA包中的數(shù)據(jù)
library(scRNAseq)
data(fluidigm)
> fluidigm
class: SummarizedExperiment
dim: 26255 130
metadata(3): sample_info clusters which_qc
assays(4): tophat_counts cufflinks_fpkm rsem_counts rsem_tpm
rownames(26255): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
rowData names(0):
colnames(130): SRR1275356 SRR1274090 ... SRR1275366 SRR1275261
colData names(28): NREADS NALIGNED ... Cluster1 Cluster2
創(chuàng)建CellDataSet對象
=> newCellDataSet()
# 這個對象需要三個要素
# 第一個:RSEM表達(dá)矩陣(ct = count)
assay(fluidigm) <- assays(fluidigm)$rsem_counts
ct <- floor(assays(fluidigm)$rsem_counts)
ct[1:4,1:4]
# 第二個:臨床信息
sample_ann <- as.data.frame(colData(fluidigm))
# 第三個:基因注釋信息(必須包含一列是gene_short_name)
gene_ann <- data.frame(
gene_short_name = row.names(ct),
row.names = row.names(ct)
)
# 然后轉(zhuǎn)換為AnnotatedDataFrame對象
pd <- new("AnnotatedDataFrame",
data=sample_ann)
fd <- new("AnnotatedDataFrame",
data=gene_ann)
# 最后構(gòu)建CDS對象
sc_cds <- newCellDataSet(
ct,
phenoData = pd,
featureData =fd,
expressionFamily = negbinomial.size(),
lowerDetectionLimit=1)
sc_cds
# CellDataSet (storageMode: environment)
# assayData: 26255 features, 130 samples
# element names: exprs
# protocolData: none
# phenoData
# sampleNames: SRR1275356 SRR1274090 ... SRR1275261 (130 total)
# varLabels: NREADS NALIGNED ... Size_Factor (29 total)
# varMetadata: labelDescription
# featureData
# featureNames: A1BG A1BG-AS1 ... ZZZ3 (26255 total)
# fvarLabels: gene_short_name
# fvarMetadata: labelDescription
# experimentData: use 'experimentData(object)'
# Annotation:
注意到構(gòu)建CDS對象過程中有一個參數(shù)是:expressionFamily
,它是選擇了一個數(shù)據(jù)分布,例如FPKM/TPM 值是log-正態(tài)分布的;UMIs和原始count值用負(fù)二項(xiàng)分布模擬的效果更好。負(fù)二項(xiàng)分布有兩種方法,這里選用了negbinomial.size
,另外一種negbinomial
稍微更準(zhǔn)確一點(diǎn),但速度大打折扣,它主要針對非常小的數(shù)據(jù)集
質(zhì)控過濾
=> detectGenes()
cds=sc_cds
cds ## 原始數(shù)據(jù)有: 26255 features, 130 samples
# 設(shè)置一個基因表達(dá)量的過濾閾值,結(jié)果會在cds@featureData@data中新增一列num_cells_expressed,記錄這個基因在多少細(xì)胞中有表達(dá)
cds <- detectGenes(cds, min_expr = 0.1)
# 結(jié)果保存在cds@featureData@data
print(head(cds@featureData@data))
# gene_short_name num_cells_expressed
# A1BG A1BG 10
# A1BG-AS1 A1BG-AS1 2
# A1CF A1CF 1
# A2M A2M 21
# A2M-AS1 A2M-AS1 3
# A2ML1 A2ML1 9
在monocle版本2.12.0中,取消了fData
函數(shù)(此前在2.10版本中還存在),不過在monocle3中又加了回來
如果遇到不能使用fData
的情況,就可以采用備選方案:cds@featureData@data
然后可以進(jìn)行基因過濾 =>subset()
expressed_genes <- row.names(subset(cds@featureData@data,
num_cells_expressed >= 5))
length(expressed_genes)
## [1] 13385
cds <- cds[expressed_genes,]
還可以進(jìn)行細(xì)胞層面的過濾(可選)
# 依然是:如果不支持使用pData()函數(shù),可以使用cds@phenoData@data來獲得各種細(xì)胞注釋信息
print(head(cds@phenoData@data))
# 比如我們看一下細(xì)胞注釋的第一個NREADS信息
tmp=pData(cds)
fivenum(tmp[,1])
## [1] 91616 232899 892209 8130850 14477100
# 如果要過濾細(xì)胞,其實(shí)也是利用subset函數(shù),不過這里不會對細(xì)胞過濾
valid_cells <- row.names(cds@phenoData@data)
cds <- cds[,valid_cells]
cds
## CellDataSet (storageMode: environment)
## assayData: 13385 features, 130 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: SRR1275356 SRR1274090 ... SRR1275261 (130 total)
## varLabels: NREADS NALIGNED ... num_genes_expressed (30 total)
## varMetadata: labelDescription
## featureData
## featureNames: A1BG A2M ... ZZZ3 (13385 total)
## fvarLabels: gene_short_name num_cells_expressed
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
聚類
在monocle3中,聚類使用的是
cluster_cells()
,利用Louvain community detection的非監(jiān)督聚類方法,結(jié)果保存在cds@clusters$UMAP$clusters
不使用marker基因聚類
使用函數(shù)clusterCells()
,根據(jù)整體的表達(dá)量對細(xì)胞進(jìn)行分組。例如,細(xì)胞表達(dá)了大量的與成肌細(xì)胞相關(guān)的基因,但就是沒有成肌細(xì)胞的marker--MYF5 ,我們依然可以判斷這個細(xì)胞屬于成肌細(xì)胞。
step1:dispersionTable()
首先就是判斷使用哪些基因進(jìn)行細(xì)胞分群。當(dāng)然,可以使用全部基因,但這會摻雜很多表達(dá)量不高而檢測不出來的基因,反而會增加噪音。挑有差異的,挑表達(dá)量不太低的
cds <- estimateSizeFactors(cds)
cds <- estimateDispersions(cds)
disp_table <- dispersionTable(cds) # 挑有差異的
unsup_clustering_genes <- subset(disp_table, mean_expression >= 0.1) # 挑表達(dá)量不太低的
cds <- setOrderingFilter(cds, unsup_clustering_genes$gene_id) # 準(zhǔn)備聚類基因名單
plot_ordering_genes(cds)
# 圖中黑色的點(diǎn)就是被標(biāo)記出來一會要進(jìn)行聚類的基因
step2:plot_pc_variance_explained()
然后選一下主成分
plot_pc_variance_explained(cds, return_all = F) # norm_method='log'
step3:
根據(jù)上面??的圖,選擇合適的主成分?jǐn)?shù)量(這個很主觀,可以多試幾次),這里選前6個成分(大概在第一個拐點(diǎn)處)
# 進(jìn)行降維
cds <- reduceDimension(cds, max_components = 2, num_dim = 6,
reduction_method = 'tSNE', verbose = T)
# 進(jìn)行聚類
cds <- clusterCells(cds, num_clusters = 4)
# Distance cutoff calculated to 0.5225779
plot_cell_clusters(cds, 1, 2, color = "Biological_Condition")
> table(cds@phenoData@data$Biological_Condition)
GW16 GW21 GW21+3 NPC
52 16 32 30
需要注意的是,使用的主成分?jǐn)?shù)量會影響結(jié)果
前面使用了6個主成分,分的還不錯?,F(xiàn)在假設(shè)使用前16個主成分:
cds <- reduceDimension(cds, max_components = 2, num_dim = 16,
reduction_method = 'tSNE', verbose = T)
cds <- clusterCells(cds, num_clusters = 4)
plot_cell_clusters(cds, 1, 2, color = "Biological_Condition")
以下是測試代碼!
除此以外,如果有批次效應(yīng)等干擾因素,也可以在降維(reduceDimension()
)的過程中進(jìn)行排除:
if(F){
cds <- reduceDimension(cds, max_components = 2, num_dim = 6,
reduction_method = 'tSNE',
residualModelFormulaStr = "~Biological_Condition + num_genes_expressed",
verbose = T)
cds <- clusterCells(cds, num_clusters = 4)
plot_cell_clusters(cds, 1, 2, color = "Biological_Condition")
}
# 可以看到,去掉本來的生物學(xué)意義后,最后細(xì)胞是會被打散的。所以residualModelFormulaStr這個東西的目的就是磨平它參數(shù)包含的差異
但是,如果是去除其他的效應(yīng):
# 如果去除生物意義以外的效應(yīng)
cds <- reduceDimension(cds, max_components = 2, num_dim = 6,
reduction_method = 'tSNE',
residualModelFormulaStr = "~NREADS + num_genes_expressed",
verbose = T)
cds <- clusterCells(cds, num_clusters = 4)
plot_cell_clusters(cds, 1, 2, color = "Biological_Condition")
關(guān)于處理批次效應(yīng):例如在芯片數(shù)據(jù)中經(jīng)常會利用SVA的combat函數(shù)。
磨平批次效應(yīng)實(shí)際上就是去掉各個組的前幾個主成分
差異分析
=> differentialGeneTest()
start=Sys.time()
diff_test_res <- differentialGeneTest(cds,
fullModelFormulaStr = "~Biological_Condition")
end=Sys.time()
end-start
# 運(yùn)行時(shí)間在幾分鐘至十幾分鐘不等
然后得到差異基因
sig_genes <- subset(diff_test_res, qval < 0.1)
> head(sig_genes[,c("gene_short_name", "pval", "qval")] )
gene_short_name pval qval
A1BG A1BG 4.112065e-04 1.460722e-03
A2M A2M 4.251744e-08 4.266086e-07
AACS AACS 2.881832e-03 8.275761e-03
AADAT AADAT 1.069794e-02 2.621123e-02
AAGAB AAGAB 1.156771e-07 1.021331e-06
AAMP AAMP 7.626789e-05 3.243869e-04
作圖(注意要將基因名變成character)
cg=as.character(head(sig_genes$gene_short_name))
# 普通圖
plot_genes_jitter(cds[cg,],
grouping = "Biological_Condition", ncol= 2)
# 還能上色
plot_genes_jitter(cds[cg,],
grouping = "Biological_Condition",
color_by = "Biological_Condition",
nrow= 3,
ncol = NULL )
我們自己也可以根據(jù)某個基因的表達(dá)量差異和分組信息進(jìn)行作圖(就以A1BG為例):
# 以A1BG為例
boxplot(log10(cds@assayData$exprs["A1BG",]+1) ~ cds@phenoData@data$Biological_Condition)
推斷發(fā)育軌跡
三步走:從差異分析結(jié)果選合適基因=》降維=》細(xì)胞排序
step1: 選合適基因
ordering_genes <- row.names (subset(diff_test_res, qval < 0.01))
cds <- setOrderingFilter(cds, ordering_genes)
plot_ordering_genes(cds)
step2: 降維
# 默認(rèn)使用DDRTree的方法
cds <- reduceDimension(cds, max_components = 2,
method = 'DDRTree')
step3: 細(xì)胞排序
cds <- orderCells(cds)
最后可視化
plot_cell_trajectory(cds, color_by = "Biological_Condition")
這個圖就可以看到細(xì)胞的發(fā)展過程
另外,plot_genes_in_pseudotime
可以對基因在不同細(xì)胞中的表達(dá)量變化進(jìn)行繪圖
plot_genes_in_pseudotime(cds[cg,],
color_by = "Biological_Condition")