Nov 21, 2019更新 => 修正部分function代碼,實現了導出gene2module
目的
最近有個項目又準備跑WGCNA
,之前跑通過一個,然后很久沒碰,都快忘了,這一年多對R又多了一丟丟的理解,干脆給他切分了幾個步驟,分別寫成了function。這里保存記錄下,不過好像只適用于我自己的項目,如果大家有需求的話有些東西還是要改一下的,WGCNA的流程可不是復制一套別人的代碼,然后點點點就出來結果的,還是需要可以讀懂大部分的代碼,腦子清楚自己要干啥。
參考內容
如有侵權,聯系我刪除!!
這里參考了簡書多為大佬的WGCNA教程,之前只是懵懵懂懂ctrl c+v然后套數據,然后跑出個結果,這次折騰了這么久,也看懂了一部分,當然,總體還是ctrl c+v,但是多了一些理解在里面,這里做個記錄,以后再有直接拿來用。
- 輸入數據清洗,樣品PCA,熱圖; 參考:「WGCNA-FAQ」「 生信技能樹-這個WGCNA作業終于有學徒完成了!」
- WGCNA常規流程包括
sft
,module
,module-trait
,hubgene
: 「WGCNA分析,簡單全面的最新教程」「 STEP6:WGCNA相關性分析」
代碼記錄
###########################################
# Assignment: WGCNA
# Date: Nov 17,2019
# Author: Shawn Wang
###########################################
##=====01.準備========
setwd("/Volumes/FileManage/06.360Drive/02.heterosis_project/08.4133BtGroup/03.WGCNA/")
options(stringsAsFactors = F)
library(edgeR)
library(WGCNA)
library(dplyr)
library(reshape2)
library(stringr)
library(tidyverse)
library(FactoMineR)
library(factoextra)
library(pheatmap)
source("/Volumes/FileManage/06.360Drive/02.heterosis_project/09.code/11.01.PCAandHeatmap.R")
source("/Volumes/FileManage/06.360Drive/02.heterosis_project/09.code/11.02.WGCNA.SFT.R")
source("/Volumes/FileManage/06.360Drive/02.heterosis_project/09.code/11.03.WGCNA.module.R")
source("/Volumes/FileManage/06.360Drive/02.heterosis_project/09.code/11.04.WGCNA.moduleTrait.R")
source("/Volumes/FileManage/06.360Drive/02.heterosis_project/09.code/11.05.WGCNA.HubGene.R")
# 原始count數據
rawCount <- read.table("/Volumes/FileManage/06.360Drive/02.heterosis_project/08.4133BtGroup/03.WGCNA/02.Rawdata/mRNA_readcount.xls",
header = T,
sep = "\t")
# head(rawCount)
rownames(rawCount) <- rawCount$transcript_id
rawCount <- rawCount[,-1]
bt.raw <- data.frame(row.names = rownames(rawCount),
select(rawCount, contains("Bt_")),
select(rawCount, contains("J_")),
select(rawCount, contains("SY_")))
bt.raw <- select(bt.raw, -contains("Z12"))
# 按照組織分成3份
leaf <- select(bt.raw, contains("_L"))
ovule <- select(bt.raw, contains("_O"))
fiber <- select(bt.raw, contains("_F"))
##======02.方法======================
# 參數設置
type = "unsigned"
corType = "pearson"
corFnc = ifelse(corType=="pearson", cor, bicor)
maxPOutliers = ifelse(corType=="pearson",1,0.05)
robustY = ifelse(corType=="pearson",T,F)
# 組織
tissue = leaf
# 名字
Title = "leaf"
##======02.1 樣品PCA和Heatmap============
WGCNA.PCAandHeatmap(tissue = tissue,
Title = Title)
##======02.2 軟閾值SFT計算===============
# 表達
datExpr = y
WGCNA.SFT(Title = Title,
datExpr = datExpr)
##======02.3 Module計算===============
# 無向網絡在power小于15或有向網絡power小于30內,沒有一個power值可以使
# 無標度網絡圖譜結構R^2達到0.8,平均連接度較高如在100以上,可能是由于
# 部分樣品與其他樣品差別太大。這可能由批次效應、樣品異質性或實驗條件對
# 表達影響太大等造成。可以通過繪制樣品聚類查看分組信息和有無異常樣品。
# 如果這確實是由有意義的生物變化引起的,也可以使用下面的經驗power值。
if (is.na(power)){
power = ifelse(nSamples<20, ifelse(type == "unsigned", 9, 18),
ifelse(nSamples<30, ifelse(type == "unsigned", 8, 16),
ifelse(nSamples<40, ifelse(type == "unsigned", 7, 14),
ifelse(type == "unsigned", 6, 12))
)
)
}
WGCNA.oneStepNetWork(Title = Title)
##=======表型鑒定========
rownames(datExpr)
phenotype = data.frame(row.names = rownames(datExpr),
yield = rep(c(1,0),times = c(3,12)),
quality = rep(c(0,1,0,1),times = c(6,3,3,3)),
Yield_Quality = rep(c(0,1,0,1,0), times =c(3,3,3,3,3)))
WGCNA.ModuleTrait(Title = Title,
phenotype = phenotype)
##=======選擇感興趣的模塊分析=========
## design設計
datTraits <- data.frame(row.names = rownames(phenotype),
subtype = factor(rep(c("Y","Y_Q","Y","Y_Q","Q"),each = 3),
levels = c("Y","Q","Y_Q")))
design = model.matrix(~0 + datTraits$subtype)
colnames(design) = levels(datTraits$subtype)
## 參數設置
moduleName = "blue"
phenoName = "Y_Q"
cor = 0.6
con = 0.90
##
WGCNA.HubGene(Title = Title,
cor = cor,
con = con,
moduleName = moduleName,
phenoName = phenoName)
Function
這里一個寫了5個:
- PCA和heatmap:
WGCNA.PCAandHeatmap
- SFT :
WGCNA.SFT
- One step network:
WGCNA.oneStepNetWork
- 模塊-表型:
WGCNA.ModuleTrait
, - hubgene:
WGCNA.HubGene
PCA和heatmap
WGCNA.PCAandHeatmap <- function(tissue,Title){
x <- tissue[apply(tissue,1,function(x) sum(x > 10) > (0.9*ncol(tissue))),]
# 01.1.將readcount轉換為logcpm
y <- log10(edgeR::cpm(x)+1)
y[1:4,1:4]
# 01.2.將樣品名稱去掉生物學重復標記起到分組作用
z <- gsub("_.*","",colnames(y))
test <- as.data.frame(t(y))
# 01.3.將分組標記放到表達矩陣最后一列
dat <- cbind(test,z)
# 02.1.準備PCA數據
dat.pca <- PCA(dat[,-ncol(dat)], graph = F)
# 02.2.繪制PCA圖并保存
fviz_pca_ind(dat.pca,
geom.ind = "point",
col.ind = dat$z,
palette = c("#9370DB", "#FF82AB", "#87CEFF", "#2E8B57", "#0000FF"),
addEllipses = T,
legend.title = "Groups")
ggsave(paste(Title,"SamplsPCAplot.pdf",sep = "_"), width = 8, height = 8)
# 將每行表達量最大的前5000個基因拿出來做熱圖
cg = names(tail(sort(apply(y, 1, function(x){sum(x)})),5000))
# pheatmap(pheatmap(y[cg,],show_rownames = F,show_colnames = F),scale = "row")
n=t(scale(t(y[cg,]))) # 'scale'可以對log-ratio數值進行歸一化
n[n>3]=3
n[n< -3]= -3
n[1:4,1:4]
ac=data.frame(g=z)
rownames(ac)=colnames(n) #把ac的行名給到n的列名,即對每一個探針標記上分組信息(是'noTNBC'還是'TNBC')
pheatmap(n,show_colnames =T,show_rownames = F,
annotation_names_col = F,
annotation_col=ac,
filename = paste(Title,'heatmap_top5000.png',sep = "_"),
clustering_distance_rows = "euclidean")
assign("y",value = y, envir = globalenv())
}
SFT
WGCNA.SFT <- function(datExpr, Title){
datExpr <- datExpr
type = "unsigned"
corType = "pearson"
corFnc = ifelse(corType=="pearson", cor, bicor)
maxPOutliers = ifelse(corType=="pearson",1,0.05)
robustY = ifelse(corType=="pearson",T,F)
m.mad <- apply(datExpr,1,mad)
datExprVar <- datExpr[which(m.mad >
max(quantile(m.mad, probs=seq(0, 1, 0.8))[2],0.01)),]
dim(datExprVar)
datExpr <- as.data.frame(t(datExprVar))
## 檢測缺失值
gsg = goodSamplesGenes(datExpr, verbose = 3)
if (!gsg$allOK){
# Optionally, print the gene and sample names that were removed:
if (sum(!gsg$goodGenes)>0)
printFlush(paste("Removing genes:",
paste(names(dataExpr)[!gsg$goodGenes], collapse = ",")));
if (sum(!gsg$goodSamples)>0)
printFlush(paste("Removing samples:",
paste(rownames(dataExpr)[!gsg$goodSamples], collapse = ",")));
# Remove the offending genes and samples from the data:
dataExpr = dataExpr[gsg$goodSamples, gsg$goodGenes]
}
nGenes = ncol(datExpr)
nSamples = nrow(datExpr)
assign("nGenes",value = nGenes, envir = globalenv())
assign("nSamples",value = nSamples, envir = globalenv())
dim(datExpr)
head(datExpr)[,1:8]
sampleTree = hclust(dist(datExpr), method = "average")
pdf(file = paste(Title,"Sample_clustering.pdf",sep = "."),width = 8,height = 5)
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="")
dev.off();
powers = c(c(1:10), seq(from = 12, to=30, by=2))
sft = pickSoftThreshold(datExpr, powerVector=powers,
networkType=type, verbose=5)
pdf(file = paste(Title,"SFTPlot.pdf",sep = "."),width = 10,height = 7)
par(mfrow = c(1,2))
cex1 = 0.9
# 橫軸是Soft threshold (power),縱軸是無標度網絡的評估參數,數值越高,
# 網絡越符合無標度特征 (non-scale)
{plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
xlab="Soft Threshold (power)",
ylab="Scale Free Topology Model Fit,signed R^2",type="n",
main = paste("Scale independence"))
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
labels=powers,cex=cex1,col="red")
# 篩選標準。R-square=0.85
abline(h=0.90,col="red")
abline(h=0.85,col="green")
# Soft threshold與平均連通性
plot(sft$fitIndices[,1], sft$fitIndices[,5],
xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers,
cex=cex1, col="red")
}
dev.off();
power = sft$powerEstimate
power
assign("datExpr",value = datExpr, envir = globalenv())
assign("power",value = power, envir = globalenv())
}
One step network
WGCNA.oneStepNetWork <- function(Title){
#######============一步法網絡構建===========
# power: 上一步計算的軟閾值
# maxBlockSize: 計算機能處理的最大模塊的基因數量 (默認5000);
# 4G內存電腦可處理8000-10000個,16G內存電腦可以處理2萬個,32G內存電腦可
# 以處理3萬個
# 計算資源允許的情況下最好放在一個block里面。
# corType: pearson or bicor
# numericLabels: 返回數字而不是顏色作為模塊的名字,后面可以再轉換為顏色
# saveTOMs:最耗費時間的計算,存儲起來,供后續使用
# mergeCutHeight: 合并模塊的閾值,越大模塊越少
net = blockwiseModules(datExpr, power = power, maxBlockSize = nGenes,
TOMType = type, minModuleSize = 30,
reassignThreshold = 0, mergeCutHeight = 0.25,
numericLabels = TRUE, pamRespectsDendro = FALSE,
saveTOMs=TRUE, corType = corType,
maxPOutliers=maxPOutliers, loadTOMs=TRUE,
#saveTOMFileBase = paste(Title,".tom",sep = ""),
verbose = 3)
table(net$colors)
assign("net",value = net, envir = globalenv())
####============modular construction=================
## 灰色的為**未分類**到模塊的基因。
# Convert labels to colors for plotting
moduleLabels = net$colors
assign("moduleLabels",value = moduleLabels, envir = globalenv())
moduleColors = labels2colors(moduleLabels)
assign("moduleColors",value = moduleColors, envir = globalenv())
# Plot the dendrogram and the module colors underneath
# 如果對結果不滿意,還可以recutBlockwiseTrees,節省計算時間
pdf(file = paste(Title,"Module.pdf",sep = "."),width = 6,height = 6)
plotDendroAndColors(net$dendrograms[[1]], moduleColors[net$blockGenes[[1]]],
"Module colors",
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05)
dev.off()
# module eigengene, 可以繪制線圖,作為每個模塊的基因表達趨勢的展示
MEs = net$MEs
### 不需要重新計算,改下列名字就好
### 官方教程是重新計算的,起始可以不用這么麻煩
MEs_col = MEs
colnames(MEs_col) = paste0("ME", labels2colors(
as.numeric(str_replace_all(colnames(MEs),"ME",""))))
MEs_col = orderMEs(MEs_col)
assign("MEs",value = MEs, envir = globalenv())
assign("MEs_col",value = MEs_col, envir = globalenv())
pdf(file = paste(Title,"Eigeng_adja_heatmap.pdf",sep = "."),width = 7,height = 10)
# 根據基因間表達量進行聚類所得到的各模塊間的相關性圖
# marDendro/marHeatmap 設置下、左、上、右的邊距
plotEigengeneNetworks(MEs_col, "Eigengene adjacency heatmap",
marDendro = c(3,3,2,4),
marHeatmap = c(3,4,2,2), plotDendrograms = T,
xLabelsAngle = 90)
dev.off()
Gene2module <- data.frame(GID = colnames(datExpr),
Module = moduleColors)
write.table(Gene2module,file = paste(Title,"Gene2module.xls",sep = "_"),
row.names = F,
quote = F,
sep = "\t")
}
模塊-表型
WGCNA.ModuleTrait <- function(Title,phenotype){
traitData <- phenotype
dim(traitData)
### 模塊與表型數據關聯
if (corType=="pearson") {
modTraitCor = cor(MEs_col, traitData, use = "p")
modTraitP = corPvalueStudent(modTraitCor, nSamples)
} else {
modTraitCorP = bicorAndPvalue(MEs_col, traitData, robustY=robustY)
modTraitCor = modTraitCorP$bicor
modTraitP = modTraitCorP$p
}
## Warning in bicor(x, y, use = use, ...): bicor: zero MAD in variable 'y'.
## Pearson correlation was used for individual columns with zero (or missing)
## MAD.
# signif表示保留幾位小數
textMatrix = paste(signif(modTraitCor, 2), "\n(", signif(modTraitP, 1), ")", sep = "")
dim(textMatrix) = dim(modTraitCor)
pdf(file = paste(Title,"Module_trait.pdf",sep = "."),width = 8,height = 10)
labeledHeatmap(Matrix = modTraitCor, xLabels = colnames(traitData),
yLabels = colnames(MEs_col),
cex.lab = 0.7, xLabelsAngle = 0, xLabelsAdj = 0.5,
ySymbols = substr(colnames(MEs_col),3,1000), colorLabels = FALSE,
colors = blueWhiteRed(50),
textMatrix = textMatrix, setStdMargins = FALSE,
cex.text = 0.6, zlim = c(-1,1),
main = paste("Module-trait relationships"))
dev.off()
}
Hubgene 提取
WGCNA.HubGene <- function(Title,cor,con,moduleName,phenoName){
## 聯通性計算
# (1) Intramodular connectivity
connet=abs(cor(datExpr,use="p"))^6
Alldegrees1=intramodularConnectivity(connet, moduleColors)
###(3) Generalizing intramodular connectivity for all genes on the array
datKME=signedKME(datExpr, MEs_col, outputColumnName="MM.")
write.table(datKME, paste(Title,"Conectivity_of_each_modular.xls",sep = "."),
sep = "\t",
row.names = T,
quote = F)
# Display the first few rows of the data frame
##Finding genes with high gene significance and high intramodular connectivity in interesting modules
PheName <- as.data.frame(design[,grep(phenoName,colnames(design))])
names(PheName) = phenoName
GS1 = as.numeric(cor(PheName,datExpr, use = "p"))
# abs(GS1)模塊和基因的關聯性
# abs(datKME$MM.green) 基因的連通性
num <- grep(moduleName,colnames(datKME))
FilterGenes= abs(GS1)> cor & abs(datKME[,num])>con
hubGenes_raw = data.frame(ID = rownames(datKME),
TORF = FilterGenes)
hubGenes = filter(hubGenes_raw, TORF == "TRUE")
table(hubGenes)
write.table(hubGenes,file = paste(moduleName,phenoName,"hubGene.xls",
sep = "_"),
sep = "\t",
row.names = F,
quote = F)
}
結果
-
做完后在working dir下面應該會有以下幾個文件,關于以下圖片及文件的意義,這個看我參考的那幾篇文章理解就好;
jianshu.jpg 還會發現我基本沒有做基因和模塊的相關,這部分如果有需求,也可以看參考的文章實現。
一些問題總結
-
正確的輸入數據格式
基本教程,FAQ中提到的我都嘗試了一下,不外乎以下幾種:- FPKM或者RPKM。
- log(fpkm+1)
- z-scored fpkm
- log(tpm +1)
使用不同類型的輸入數據會發現分出的module數量也不一樣,甚至sft都不盡相同,但總體來看,差異不大,用MAD篩選過后留下的基因中,除了直接用fpkm之外,其他幾種方法差異不是很大,我最終選擇了用log(tpm+1)
, 而且在轉換前先用90%樣本中readcount > 10
這個閾值去一下背景。個人習慣罷了。
無法找到合適的Power值:
這些在「WGCNA-FAQ」中作者已經解釋,這里著重說下我遇到的問題。
首先,有一次我嘗試用大量不同品種的不同組織的轉錄組(17個品種,3個組織,3個重復共153個sample)做WGCNA,發現一個特別有趣的結果:在sft
階段發現R2從1開始為明顯的負數值,隨著power的增大R2逐漸變為正數。剛開始百思不得其解,后來找到了答案:
Data heterogeneity may affect any statistical analysis, and even more so an unsupervised one such as WGCNA. What, if any, modifications should be made to the analysis depends crucially on whether the heterogeneity (or its underlying driver) is considered "interesting" for the question the analyst is trying to answer, or not. If one is lucky, the main driver of sample differences is the treatment/condition one studies, in which case WGCNA can be applied to the data as is. Unfortunately, often the heterogeneity drivers are uninteresting and should be adjusted for. Such factors can be technical (batch effects, technical variables such as post-mortem interval etc.) or biological (e.g., sex, tissue, or species differences).
原因就是樣品異質性(heterogeneity)! 由于樣本中總體分成了3個組織,由于基因表達的時空特異性,組織間的差異特別大,所以會導致組織內連通性超級高,而組織間連通性特別低。具體解釋見:Question about WGCNA soft thresholding value
The negative "signed R^2" is negative when your network has more genes with high connectivity than ones with low connectivity (i.e., the regression line for the fit log(n(k))~log(k) has a positive slope). It means your network shows a topology in some ways opposite (more high connectivity than low connectivity genes) to what is normally expected (a lot of low-connetivity gens and fewer high connectivity genes).Usually, a lot of high-connectivity genes means there is a strong global driver (e.g., you have samples from different tissues or a strong batch effect). Make sure your sample tree doesn't show very strong branches. Also, see WGCNA FAQ (https://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/faq.html) for some comments about heterogeneous data sets and lack of SFT.
這會導致一個結果,就是無法構成一個無尺度網絡;
這也不代表WGCNA往下走沒有意義了! 不要輕易的懷疑自己的結果,不是為了做WGCNA而做WGCNA,這樣就失去了分析的意義,如果有一定的生物學問題,那么他就是有意義的!
上面說了雖然無法構成一個無尺度的網絡,我們后期可能無法取到合適的Hubgene,但是這樣想,如果我想從這個結果找到與組織特異性強的模塊,這個后續分析就是有意義的,我們用經驗的power值繼續構建module,然后用tissue對sample進行分類,最后我們可以得到一個module-tissue的相關性結果,后續通過對相應module的基因進行富集分析,證實了我的想法,確實葉片的結果大多富集到生物節律,光合作用,葉綠體,細胞循環等,胚珠中富集到花發育,細胞壁,細胞膜,激素等通路...grey模塊中富集的較少,但不難看出有許多看家基因。雖然說生物學意義不是很大,但是說明了這條路應該沒有錯,下一步可以通過對ncRNA的WGCNA找到組織特異性表達的lncRNA,circRNA等等。這些module就會很有意義。
應該是未完待續~~