2020-05-09 復習WGCNA

前幾天聽了一個優秀本科生的學習WGCNA的分享,后浪果然夠猛,自然也要好好學習??!重新跑了一遍流程,復習一下。
參考學習資料:
首先是曾老師的教程:https://github.com/jmzeng1314/my_WGCNA
然后是:https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/index.html

WGCNA基本概念

  • 加權基因共表達網絡分析(WGCNA)是一種系統生物學方法,用于樣品中基因之間的相關模式。WGCNA可用于查找高度相關的基因的簇(模塊),使用module eigengeneintramodular hub gene對這些簇進行匯總,將模塊相互關聯并與外部樣本性狀關聯(使用eigengene網絡)方法),并用于計算 module membership度量。挑選模塊內hub基因,這些基因可以用于生物標志物或治療靶標。
  • 它與只挑選差異基因相比,WGCNA可以從成千上萬的基因中挑選出高度相關的基因的簇(模塊),并將模塊與外部樣本性狀關聯,找出與樣本性狀高度相關的模塊。然后就可以進行模塊內分析。

Co-expression network(共表達網絡)

共表達網絡定義為無向的、加權的基因網絡。這樣一個網絡的節點對應于基因,基因之間的邊代表基因表達量的相關性,加權是將相關性的絕對值提高到冪β≥1(軟閾值),加權基因共表達網絡的構建以犧牲低相關性為代價,強調高相關性。具體地說,[ a_ {ij} =|cor(x_ i,x_j)|^β ]表示無符號網絡的鄰接關系。

Module(模塊)

模塊是高度互連的基因簇。在無符號共表達網絡中,[ a_{ij} =|cor(x_ i,x_j)|^β ];模塊對應于具有高度相關的基因簇。在有符號網絡中,[ a_{ij} = (0.5 * (1+cor(x_ i,x_j)) )^β ],模塊對應于正相關的基因。這里的加權的網絡就等于鄰接矩陣。通過冪鄰接轉換,就強化了高相關性基因的關系,弱化了相關性基因的關系。

Connectivity(連接度)

對于每個基因,連接性(也稱為度)被定義為與其他基因的連接強度之和:在共表達網絡中,連接度衡量一個基因與所有其他網絡基因的相關性。

Intramodular connectivity(模塊內連接度)

模塊內鏈接度衡量給定基因相對于特定模塊的基因的連接或共表達程度。模塊內連接度可以做為Module membership的度量。

Module eigengene E

給定模塊的第一主成分,代表整個模塊的基因表達譜

Module Membership(MM)

對于每個基因,我們通過將其基因表達譜與模塊的Module eigengene相關性來定義Module Membership[ MM^{blue}(i)=cor(x_i,E^{blue})]測量基因i與藍色模塊Module eigengene的相關性。如果MM blue(i)接近0,則i-th基因不是藍色模塊的一部分。另一方面,如果MM blue(i)接近1或-1,則它與藍色模塊基因高度相關.MM標記編碼基因與藍色模塊Module eigengene之間是正相關還是負相關.

hub gene

高度連接基因的縮寫,根據定義,它是共表達網絡模塊內具有高連接度的基因。

Gene significance(GS)

GS.ClinicalTrait(i) = |cor(x_i ,ClinicalTrait)|其中Xi表示i基因的表達譜,GS.ClinicalTrait(i)的絕對值越高,第i基因的生物學意義就越大。

基本分析流程

  • 數據輸入和清洗
  • 網絡構建和模塊檢測
  • 量化模塊和樣本性狀的關系
  • 挑出感興趣模塊內部的基因
  • 可視化TOM矩陣
  • 將網絡導出到外部數據進行可視化

1.數據分析的常見問題

1. 需要多少個樣本?

不建議對少于15個樣本的數據集嘗試WGCNA。與其他分析方法一樣,更多的樣品通常會導致更可靠和更精確的結果。

2. 如何過濾掉探針?

探針集或基因可以通過均值、絕對中位差(MAD)或方差進行過濾,因為低表達或不變的基因通常代表噪聲。用均值表達還是方差過濾是否更好尚有爭議,兩者都有優缺點。不建議通過差異分析過濾掉基因。

3. 除了芯片數據,是否可以用RNA-seq數據進行WGCNA分析?

  • 使用(正確歸一化的)RNA-seq數據與使用(正確歸一化的)微陣列數據并沒有什么不同。
  • 只要使用相同的方式處理所有樣本,無論是使用RPKM,FPKM還是簡單的歸一化計數,對于WGCNA分析都不會產生很大的不同。
  • 如果數據來自不同批次,需要去除批次效應。

4. 挑選軟閾值的問題?

如果合理的閾值(無符號或有符號的混合網絡,小于15,有符號的網絡,小于30)不能使無尺度拓撲網絡系數R^2高于0.8,或者或平均連接度降到100以下??赡苁怯捎谂涡?,生物學異質性(例如,由來自2個不同組織的樣品組成的數據集)或條件之間的強烈變化(例如按時間序列表示)而導致的。應該仔細調查是否存在樣本異質性,驅動異質性的原因以及是否應調整數據等. 如果事實證明由一個不想刪除的有趣的生物學變量引起的(即調整數據),則可以根據樣本數量選擇適當的軟閾值如下表所示。

Number of samples Unsigned and signed hybrid networks Signed networks
Less than 20 9 18
20-30 8 16
30-40 7 14
more than 40 6 12

2. 分析流程

安裝必要的包

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("WGCNA")
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install(c("clusterProfiler","org.Mm.eg.db"))

install.packages("ggplot2")
install.packages("gplots")

2.1 數據輸入、清洗、預處理:得到一個行為樣本,列為基因的表達矩陣,另一個是樣本對應臨床特征的矩陣

# Step1.  Data input and cleaning:
#========================================================
#  Code chunk Step1.1
#========================================================
# Display the current working directory
rm(list = ls())
getwd();
library(dplyr)
# Load the WGCNA package
library(WGCNA);
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
#Read in the fpkm data set
library(readxl)
a <- read_excel("GSE98622_mouse-iri-master.xlsx")
class(a)
a=as.data.frame(a)
head(a[,1:6])
rownames(a)=a[,1]
table(a$type)
dat <- a%>%filter(type=="protein_coding")#篩選編碼蛋白的探針
probe2sym=dat[,1:3]#提取探針和基因symbol的信息
colnames(probe2sym)=c('probe', "symbol","type"  )
rownames(dat) <- dat[,1]
dat=dat[,4:ncol(dat)]
dat <- dat%>%select(starts_with("IRI"))#只關注自己感興趣的樣本例如:只關注缺血再灌注的樣本
dim(dat)
write.csv(probe2sym,file = 'anno_probe2sym.csv')
# Take a quick look at what is in the data set:
names(dat);

boxplot(dat[,1:10],las=2)#畫10個樣本的箱線圖看一下基因的表達情況

dat <- log2(dat+1)#將fpkm的數據進行log轉換
boxplot(dat[,1:10],las=2)


#===========================================================
#  Code chunk Step1.2  匹配數據和篩選 goodSamplesGenes
#===========================================================
## 因為WGCNA針對的是基因進行聚類,而一般我們的聚類是針對樣本用hclust即可,所以這個時候需要轉置
datExpr0 <-  t(dat[order(apply(dat,1,mad), decreasing = T)[1:5000],])# top 5000 mad genes

datExpr <- datExpr0 

#我們首先檢查有太多缺失值的基因和樣本
gsg = goodSamplesGenes(datExpr0, verbose = 3);
gsg$allOK

#如果最后一條語句返回TRUE,則所有基因都通過了就是OK的。如果沒有,我們就從數據中刪除有問題的基因和樣本:
if (!gsg$allOK){
  # Optionally, print the gene and sample names that were removed:
  if (sum(!gsg$goodGenes)>0) 
    printFlush(paste("Removing genes:", paste(names(datExpr0)[!gsg$goodGenes], collapse = ", ")));
  if (sum(!gsg$goodSamples)>0) 
    printFlush(paste("Removing samples:", paste(rownames(datExpr0)[!gsg$goodSamples], collapse = ", ")));
  # Remove the offending genes and samples from the data:
  datExpr0 = datExpr0[gsg$goodSamples, gsg$goodGenes]
}

#====================================================
#  Code chunk Step1.3 畫層次聚類樹,查看是否有離群的樣本
#====================================================
#接下來,我們對樣本進行聚類(與稍后對基因進行聚類形成對比),以查看是否有明顯的異常值
sampleTree = hclust(dist(datExpr0), method = "average");
# Plot the sample tree: Open a graphic output window of size 12 by 9 inches
# The user should change the dimensions if the window is too large or too small.
sizeGrWindow(12,9)
par(cex = 0.6);
par(mar = c(0,4,2,0))
#png("Step1-sampleClustering.png",width = 800,height = 600)
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5, 
     cex.axis = 1.5, cex.main = 2)+abline(h =75 , col = "red")
dev.off()
#====================================================
#  Code chunk Step1.4 如果有離群樣本就設置abline的h的值
#====================================================
# Plot a line to show the cut
#abline(h = 70000, col = "red");#從數據上看聚類的還可以,不需要剔除樣本所以修改下參數“ h = ”
# Determine cluster under the line
clust = cutreeStatic(sampleTree, cutHeight = 80, minSize = 10)#不需要剔除樣本所以修改下參數“ cutHeight = ”
table(clust)
# clust == 1 包含了我們需要的樣本
keepSamples = (clust==1)
datExpr = datExpr0[keepSamples, ]
nGenes = ncol(datExpr)
nSamples = nrow(datExpr)
#變量datExpr現在包含可用于網絡分析的表達式數據。
#====================================================
#  Code chunk Step1.5 畫樣本的層次聚類樹 和trait的熱圖
#====================================================

#讀取臨床文件
datTraits=read.table("datTraits.txt",sep = "\t",header = T,check.names = F)
datTraits[1:4,1:4]

## 下面主要是為了防止臨床表型與樣本名字對不上
datTraits <- datTraits[match(rownames(datExpr),rownames(datTraits)),]

identical(rownames(datTraits),rownames(datExpr))

# Re-cluster samples
sampleTree2 = hclust(dist(datExpr), method = "average")
# 將樣本用顏色表示:在圖2所示的曲線圖中,白色表示低值,紅色表示高值,灰色表示缺少條目
#如果是連續性變量會是漸變色,如果是‘0’,‘1’的數據將會是紅白相間
traitColors = numbers2colors(datTraits, signed = FALSE);
# Plot the sample dendrogram and the colors underneath.
sizeGrWindow(12,9)
par(cex = 0.6);
par(mar = c(0,4,2,0))
#png("Step1-Sample dendrogram and trait heatmap.png",width = 800,height = 600)
plotDendroAndColors(sampleTree2, traitColors,
                    groupLabels = names(datTraits), 
                    main = "Sample dendrogram and trait heatmap",)
dev.off()
datExpr=as.data.frame(datExpr)

save(datExpr, datTraits, file = "WGCNA-01-dataInput.RData")

選取 mad 前5000的基因

2.2 GSE98622共有49個樣本,提取里面為IRI的 31個樣本的表達矩陣。

2.2.1 臨床trait有:

  • datTraits:為所有不同時間的IRI組,
  • days組:為IRI 組缺血48h,72h的樣本
  • hours組:為IRI 組缺血2h,4h,24h的樣本
  • months組:為IRI 組缺血6m,12m的樣本
  • weeks組:為IRI 組缺血7d,14d,28d的樣本

2.3 得到樣本聚類樹和臨床trait的熱圖

2.4 一步構建網絡和篩選軟閾值(power)沒有得到合適的閾值,所有使用R包作者提供的經驗閾值7

#====================================================
#  Code chunk Step2.1.1 自動化一步自動網絡構建和模塊檢測
#====================================================
# Display the current working directory
rm(list = ls())
getwd();
# Load the WGCNA package
library(WGCNA)
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
# Allow multi-threading within WGCNA. This helps speed up certain calculations.
# At present this call is necessary for the code to work.
# Any error here may be ignored but you may want to update WGCNA if you see one.
# Caution: skip this line if you run RStudio or other third-party R environments. 
# See note above.
if (Sys.info()["sysname"]=="Darwin" ) {
  allowWGCNAThreads(nThreads = 2)
} else{
  enableWGCNAThreads(nThreads = 2)
  #disableWGCNAThreads()
}
# Load the data saved in the first part
lnames = load(file = "WGCNA-01-dataInput.RData");
#The variable lnames contains the names of loaded variables.
lnames

#====================================================
#  Code chunk Step2.1.2  選擇合適的軟閾值
#====================================================
# Choose a set of soft-thresholding powers
powers = c(c(1:10), seq(from = 12, to=20, by=1))
# Call the network topology analysis function
sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
# Plot the results:
sizeGrWindow(9, 5)
png("step2-beta-value.png",width = 800,height = 600)
par(mfrow = c(1,2));
cex1 = 0.9;#圖里面字體大小
# Scale-free topology fit index as a function of the soft-thresholding power
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
     main = paste("Scale independence"));
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     labels=powers,cex=cex1,col="red")+
  abline(h=0.70,col="red")# this line corresponds to using an R^2 cut-off of 
# Mean connectivity as a function of the soft-thresholding power
plot(sft$fitIndices[,1], sft$fitIndices[,5],
     xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
     main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")+abline(h=100,col="red")
dev.off()
sft$powerEstimate  #自動評估的軟閾值  默認的 R^2 cut-off是0.85

#====================================================
#  Code chunk Step2.1.3 #一步法完成網絡構建
#====================================================

net = blockwiseModules(datExpr, power = 7,#power需要進行修改
                       TOMType = "unsigned", minModuleSize = 30,
                       reassignThreshold = 0, mergeCutHeight = 0.25,
                       numericLabels = TRUE, pamRespectsDendro = FALSE,
                       saveTOMs = FALSE,
                       saveTOMFileBase = "WGCNA TOM", 
                       verbose = 3)

#參數mergeCutHeight是模塊合并的閾值。net$Colors包含模塊賦值,net$MES包含模塊的模塊特征(module eigengenes of the modules)。

table(net$colors)
#并表明有16個模塊,按照大小從大到小的順序標記為0到15。標簽0是為所有模塊之外的基因保留的
#用于模塊標識的層次聚類樹形圖(樹)在net$Dendrogram[[1]];$中返回。

#====================================================
#  Code chunk Step2.1.4 畫genes-modules的圖
#====================================================
# open a graphics window
sizeGrWindow(12, 9)
# Convert labels to colors for plotting
mergedColors = labels2colors(net$colors)
# Plot the dendrogram and the module colors underneath
png("Step2-genes-modules.png",width = 800,height = 500)
plotDendroAndColors(net$dendrograms[[1]], mergedColors[net$blockGenes[[1]]],
                    "Module colors",
                    dendroLabels = FALSE, hang = 0.03,
                    addGuide = TRUE, guideHang = 0.05)
dev.off()

#====================================================
#  Code chunk Step2.1.5 #保存MEs, moduleLabels, moduleColors, geneTree
#====================================================
moduleLabels = net$colors
moduleColors = labels2colors(net$colors)#模塊里每個基因的顏色
MEs = net$MEs;
geneTree = net$dendrograms[[1]];#用于模塊標識的層次聚類樹形圖(樹)保存在geneTree里面
save(MEs, moduleLabels, moduleColors, geneTree, 
     file = "WGCNA-02-networkConstruction-auto.RData")

2.5 模塊與臨床trait的關系

2.6 挑選與感興趣臨床特征的模塊

2.7 對模塊內的基因的進行GO富集分析

2.8 將挑選出的基因在cytoscape進行可視化,通過cytohubba插件進行可視化

3.結果

3.1 樣本聚類樹和臨床特征heatmap

Step1-Sample dendrogram and trait heatmap.png

解讀:基于樣本的歐幾里德距離以及性狀 的顏色指示(0=白色,1=紅色)的聚類樹狀圖(如果是連續性變量:表達量從低到高,顏色從白色過渡到紅色,灰色代表缺失值。)。順序和聚類樹中的順序一致,編碼特征值的顏色與分支不一致,這表明trait=0的樣本與trait=1的樣本不是“全局不同的”。

3.2 軟閾值的挑選:挑選軟閾值是為了構建無尺度network,使node的Mean Connectivity接近于0,沒有得到合適的軟閾值,所以使用經驗閾值7。

image.png

解讀:左圖:各種軟閾值(power)的網絡拓撲分析。左面板顯示無尺度擬合指數(y軸)作為軟閾值(power)(x軸)的函數。右側面板顯示作為軟閾值(power)(x軸)函數的平均連接性(度數,y軸)。

3.3 模塊的構建

# Step3. Relating modules to external clinical traits and identifying important genes
#====================================================
#  Code chunk Step3.1 
#====================================================
# Display the current working directory
rm(list = ls())
getwd();
# Load the WGCNA package
library(WGCNA)
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
# Load the expression and trait data saved in the first part
lnames = load(file = "WGCNA-01-dataInput.RData");
#The variable lnames contains the names of loaded variables.
lnames
# Load network data saved in the second part.
lnames = load(file = "WGCNA-02-networkConstruction-auto.RData");
lnames

#====================================================
#  Code chunk Step3.2 計算MEs
#====================================================
# Define numbers of genes and samples
nGenes = ncol(datExpr);#定義基因和樣本的數量
nSamples = nrow(datExpr);
# Recalculate MEs with color labels
MEs0 = moduleEigengenes(datExpr, moduleColors)$eigengenes
MEs = orderMEs(MEs0)#不同顏色的模塊的ME值矩 (樣本vs模塊)
moduleTraitCor = cor(MEs, datTraits, use = "p");#計算模塊與臨床數據的相關性 行為樣本,列為ME與臨床特征的關系
moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples);

#====================================================
#  Code chunk Step3.3 畫模塊和臨床trait的關系圖
#====================================================

sizeGrWindow(15,20)
# Will display correlations and their p-values
textMatrix =  paste(signif(moduleTraitCor, 2), "\n(",
                    signif(moduleTraitPvalue, 1), ")", sep = "");
dim(textMatrix) = dim(moduleTraitCor)
png("step3-Module-trait-relationships.png",width = 1500,height = 1200,res = 130)
par(mar= c(5,8, 2, 2));
# Display the correlation values within a heatmap plot
labeledHeatmap(Matrix = moduleTraitCor,
               xLabels = names(datTraits),
               yLabels = names(MEs),
               ySymbols = names(MEs),
               colorLabels = FALSE,
               colors = blueWhiteRed(50),#WGCNA提醒greenWhiteRed不適合紅綠色盲,建議用blueWhiteRed
               textMatrix = textMatrix,
               setStdMargins = FALSE,
               cex.text = 0.5,
               zlim = c(-1,1),
               main = paste("Module-trait relationships"))
dev.off()

#分析確定了幾個重要的模塊-性狀關聯。我們將把重點放在hour這一感興趣的特征上
#====================================================
#  Code chunk Step3.4 基因與性狀和重要模塊的關系:GS和MM
#====================================================
#GS: as(the absolute value of) the correlation between the gene and the trait
#MM: as the correlation of the module eigengene and the gene expression profile. This allows us to quantify the similarity of all genes on the array to every module.

# Define variable hour containing the hour column of datTrait
months = as.data.frame(datTraits$months);
names(months) = "months"
# names (colors) of the modules
modNames = substring(names(MEs), 3)

geneModuleMembership = as.data.frame(cor(datExpr, MEs, use = "p"));
MMPvalue = as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership), nSamples));
#在列名上加MM,p.MM
names(geneModuleMembership) = paste("MM", modNames, sep="");
names(MMPvalue) = paste("p.MM", modNames, sep="");

geneTraitSignificance = as.data.frame(cor(datExpr, months, use = "p"));#修改臨床特征hour
GSPvalue = as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), nSamples));

names(geneTraitSignificance) = paste("GS.", names(months), sep="");#修改臨床特征hour
names(GSPvalue) = paste("p.GS.", names(months), sep="");#修改臨床特征hour

#====================================================
#  Code chunk Step3.5  模塊內分析:作模塊membership和genesignificance的相關圖
#====================================================

module = "purple"
column = match(module, modNames);
moduleGenes = moduleColors==module;

sizeGrWindow(7, 7);
png("step3-Module_membership-gene_significance.png",width = 800,height = 600)
par(mfrow = c(1,1));
verboseScatterplot(abs(geneModuleMembership[moduleGenes, column]),
                   abs(geneTraitSignificance[moduleGenes, 1]),
                   xlab = paste("Module Membership in", module, "module"),
                   ylab = "Gene significance for months",
                   main = paste("Module membership vs. gene significance\n"),
                   cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)
dev.off()
#圖表如所示。顯然,GS和MM是高度相關的,
#說明與某個性狀高度顯著相關的基因通常也是與該性狀相關的模塊中最重要的(中心)元素。

#====================================================
#  Code chunk Step3.5.1   計算模塊內連接度
#====================================================

adjacency = adjacency(datExpr, power =7);#計算一個鄰接矩陣
Alldegrees=intramodularConnectivity(adjacency, moduleColors)

#a data frame with 4 columns giving the totalconnectivity(kTotal ),
#intramodular connectivity(kWithon ), extra-modular connectivity(kOut  ), 
#and the difference of the intraandextra-modular connectivities for all genes(kDiff ); 
#otherwise a vector of intramodular connectivities,

class(Alldegrees)
module = "purple"; # Select module
probes = names(datExpr) # Select module probes
inModule = (moduleColors==module);
modProbes = probes[inModule];
length(modProbes)
KIM_module=Alldegrees[modProbes,]

#====================================================
#  Code chunk Step3.5.2   展示模塊的熱圖和eigengene 
#====================================================

#我們現在創建一個解釋模塊(heatmap)和相應module eigengene(barplot)之間關系的圖:
sizeGrWindow(8,7);
which.module="purple"
ME=MEs[, paste("ME",which.module, sep="")]
par(mfrow=c(2,1), mar=c(0.3, 5.5, 3, 2))
plotMat(t(scale(datExpr[,moduleColors==which.module ]) ),
        nrgcols=30,rlabels=F,rcols=which.module,
        main=which.module, cex.main=2)
par(mar=c(5, 4.2, 0, 0.7))
barplot(ME, col=which.module, main="", cex.main=2,
        ylab="eigengene expression",xlab="array sample")

#最上面一行顯示了粉色模塊基因(行)在微陣列(列)中的位置。下一行顯示與相同微陣列樣本相對應的模塊特征基因表達值(y軸)。注意,ME在許多模塊基因表達不足的陣列中呈現低值(熱圖中為綠色)。ME對于很多模塊基因過度表達的陣列具有很高的值(熱圖中為紅色)。ME可以被認為是該模塊最具代表性的基因表達譜。


#====================================================
#  Code chunk Step3.6 探針名轉為基因名
#====================================================
names(datExpr)[1:10]
tail(names(datExpr)[moduleColors=="purple"])

annot = read.csv(file = "anno_probe2sym.csv",row.names = 1);
dim(annot)
names(annot)
probes = names(datExpr)
probes2annot = match(probes, annot$probe)
# The following is the number or probes without annotation:
sum(is.na(probes2annot))
# Should return 0.

#====================================================
#  Code chunk Step3.7  創建一個數據框,其中包含所有探針的以下信息:
#探針ID、基因符號、module color, gene significance for weight, and module membership and p-values in all modules.
#模塊將按其' hour '重要性排序,最重要的模塊位于左側。 
#====================================================
# Create the starting data frame
geneInfo0 = data.frame(probe = probes,#需要自己進行修改
                       geneSymbol = annot$symbol[probes2annot],
                       type = annot$type[probes2annot],
                       moduleColor = moduleColors,
                       geneTraitSignificance,
                       GSPvalue)
# Order modules by their significance for  ‘hour’
modOrder = order(-abs(cor(MEs,months, use = "p"))); # 修改特征參數‘hour’
# Add module membership information in the chosen order
for (mod in 1:ncol(geneModuleMembership)) 
{
  oldNames = names(geneInfo0)
  geneInfo0 = data.frame(geneInfo0, geneModuleMembership[, modOrder[mod]], 
                         MMPvalue[, modOrder[mod]]);
  names(geneInfo0) = c(oldNames, paste("MM.", modNames[modOrder[mod]], sep=""),
                       paste("p.MM.", modNames[modOrder[mod]], sep=""))
}
# Order the genes in the geneInfo variable first by module color, then by geneTraitSignificance
geneOrder = order(geneInfo0$moduleColor, -abs(geneInfo0$GS.months)); # 修改特征參數‘hour’
geneInfo = geneInfo0[geneOrder, ]

write.csv(geneInfo, file = "geneInfo.csv")
Step2-genes-modules.png

解讀:挑選出來的基因聚類樹狀圖,聚類時的距離為1-TOM值,基于拓撲重疊的不同,以及指定的模塊顏色。

3.4 量化模塊和臨床性狀的關聯

step3-Module-trait-relationships.png
  • 模塊與性狀關系圖。每行對應一個模塊特征基因,每列對應一個性狀。每個單元格包含相應的相關性和p值。根據顏色圖例通過相關性對表進行顏色編碼。
  • 通過將基因顯著性GS定義為基因和性狀之間的相關性(的絕對值),我們可以量化單個基因與我們感興趣的性狀(權重)的關聯。對于每個模塊,我們還將模塊成員數MM的定量度量定義為模塊本征基因與基因表達譜的相關性。這使我們能夠量化陣列上所有基因與每個模塊的相似性。

3.5 將表觀數據納入ME,統一制作ME相關性的熱圖: 選擇自己感興趣的臨床特征

step5-Eigengene-dendrogram.png

解讀:可視化的eigengene網絡表示模塊之間的關系和臨床性狀。模塊鄰接關系的熱圖,紅色表示高鄰接(正相關),藍色表示低鄰接(負相關)。

3.6 模塊內分析:選擇模塊,作Module membership和 Gene significance的相關圖

step3-Module_membership-gene_significance.png

解讀:從圖中可看出 GS和MM是高度相關的,說明與某個性狀高度顯著相關的基因通常也是與該性狀相關的模塊中最重要的(中心)元素。g感興趣模塊中權重對Module membership(MM)的Gene significance(GS)散點圖。在這個模塊中,GS和MM之間存在著極顯著的相關性。

3.7 對模塊內的基因進行GO分析

#Step4.Interfacing network analysis with other data such as GO and KEGG
#====================================================
#  Code chunk Step4.1 主要是關心具體某個模塊內部的基因
#====================================================
# Display the current working directory
rm(list = ls())
getwd();
# Load the WGCNA package
library(WGCNA)
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
if (Sys.info()["sysname"]=="Darwin" ) {
  allowWGCNAThreads(nThreads = 4)
} else{
  enableWGCNAThreads(nThreads = 2)
  #disableWGCNAThreads()
}
# Load the expression and trait data saved in the first part
lnames = load(file ="WGCNA-01-dataInput.RData");#加載進來這里的我的表達矩陣變成了matrix,將其轉為data.frame 才不會報錯
#The variable lnames contains the names of loaded variables.
lnames
# Load network data saved in the second part.
lnames = load(file = "WGCNA-02-networkConstruction-auto.RData");
lnames

# Select module
module = "purple";
# Select module probes
probes = colnames(datExpr) #我們例子里面的probe就是基因
inModule = (moduleColors==module);
table(inModule)
modProbes = probes[inModule]; #模塊內的基因已經挑了出來,可以用Y叔的包畫圖了
head(modProbes)
annot = read.csv(file = "anno_probe2sym.csv",row.names = 1);
modGenes = annot$symbol[match(modProbes, annot$probe)];

#====================================================
#  Code chunk Step4.2   GO分析,kEGG分析
#====================================================
library(clusterProfiler)
library(org.Mm.eg.db)
library(ggplot2)
modGenes=as.data.frame(modGenes)
mod_entrez= mapIds(x = org.Mm.eg.db,
                   keys = modGenes$modGenes,
                   keytype = "SYMBOL",
                   column = "ENTREZID")
length(mod_entrez)
mod_entrez =na.omit(mod_entrez)#去除沒有ENTREZ id 的基因,
length(mod_entrez)

#對基因集進行富集分析。給定一個基因載體,該函數將在FDR控制后返回富集GO分類。
go <- enrichGO(gene = mod_entrez,   #a vector of entrez gene id
               keyType = "ENTREZID",#輸入基因的型
               OrgDb = "org.Mm.eg.db", #組織數據庫,bioconductor里面有人,鼠等
               ont="all",
               pvalueCutoff = 0.5,
               qvalueCutoff = 0.5,
               readable = TRUE)#whether mapping gene ID to gene Name

par(mar= c(3,4, 2, 2));
png("GO_all.png",width = 1500,height = 1200,res = 130)# 嘗試隨便命名
dotplot(go, split="ONTOLOGY") + facet_grid(ONTOLOGY~., scale="free")
dev.off()
####  查看得到的結果     這里好像有個參數可以直接返回,等有空了去仔細看看這個R包
go_result=go@result
write.csv(go_result,file = 'go_all_result.csv')
image.png
# Step5.Network visualization using WGCNA functions

#====================================================
#  Code chunk Step5.1 可視化 TOM矩陣,WGCNA的標準配圖
#====================================================
# Display the current working directory
rm(list = ls())
getwd();
# Load the WGCNA package
library(WGCNA)
library(gplots)
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
if (Sys.info()["sysname"]=="Darwin" ) {
  allowWGCNAThreads(nThreads = 4)
} else{
  enableWGCNAThreads(nThreads = 2)
  #disableWGCNAThreads()
}
# Load the expression and trait data saved in the first part
lnames = load(file = "WGCNA-01-dataInput.RData");
#The variable lnames contains the names of loaded variables.
lnames
# Load network data saved in the second part.
lnames = load(file = "WGCNA-02-networkConstruction-auto.RData");
lnames
nGenes = ncol(datExpr)
nSamples = nrow(datExpr)


# Calculate topological overlap anew: this could be done more efficiently by saving the TOM
# calculated during module detection, but let us do it again here.
load(file="TOMsimilarityFromExpr.Rdata")
if (F) {
  
  TOM = TOMsimilarityFromExpr(datExpr, power = 7);#前面的power為21
  dissTOM = 1-TOM;#前面估計的power為7
  save(TOM,file="TOMsimilarityFromExpr.Rdata")
  # Transform dissTOM with a power to make moderately strong connections more visible in the heatmap
}


plotTOM = dissTOM^7;
# Set diagonal to NA for a nicer plot
diag(plotTOM) = NA;
# Call the plot function

if (T) {
  
sizeGrWindow(9,9)
png("step5-Network-heatmap.png",width = 800,height = 600)
myheatcol = colorpanel(250,'red',"orange",'lemonchiffon')
TOMplot(plotTOM, geneTree, moduleColors, main = "Network heatmap plot, all genes", col = myheatcol)
#TOMplot(plotTOM, geneTree, moduleColors, main = "Network heatmap plot, all genes")
dev.off()
}
#使用熱圖將基因網絡可視化。熱圖描繪了分析中所有基因之間的拓撲重疊矩陣(TOM)。
#淺色表示低重疊,逐漸變深的紅色表示高重疊。沿對角線的深色塊是模塊。
#左側和頂部還顯示了基因樹狀圖和模塊分配


  
  #====================================================
  #  Code chunk 3
  #====================================================
  nSelect = 400
  # For reproducibility, we set the random seed
  set.seed(10);
  select = sample(nGenes, size = nSelect);
  selectTOM = dissTOM[select, select];
  # There's no simple way of restricting a clustering tree to a subset of genes, so we must re-cluster.
  selectTree = hclust(as.dist(selectTOM), method = "average")
  selectColors = moduleColors[select];
  # Open a graphical window
  sizeGrWindow(9,9)
  # Taking the dissimilarity to a power, say 10, makes the plot more informative by effectively changing 
  # the color palette; setting the diagonal to NA also improves the clarity of the plot
  plotDiss = selectTOM^7;
  diag(plotDiss) = NA;
  myheatcol = colorpanel(250,'red',"orange",'lemonchiffon')
  TOMplot(plotDiss, selectTree, selectColors, main = "Network heatmap plot, selected genes",col = myheatcol)
  
#====================================================
#  Code chunk Step5.3 將性狀數據納入ME,統一制作ME相關性的熱圖:  選擇自己感興趣的臨床特征
#====================================================
#重新計算模塊MEs
# Recalculate module eigengenes
MEs = moduleEigengenes(datExpr, moduleColors)$eigengenes
# Isolate  'hour' from the clinical traits   
months = as.data.frame(datTraits$months);
names(months) = "months"
# hour加?入MEs成為MET
# Add the hour to existing module eigengenes
MET = orderMEs(cbind(MEs, months))
# Plot the relationships among the eigengenes and the trait
sizeGrWindow(5,10);
par(cex = 0.9)
png("step5-Eigengene-dendrogram.png",width = 800,height = 600)
par(margin(3,6,2,2))
plotEigengeneNetworks(MET, "Eigengene dendrogram and adjacency heatmap" , 
                      marDendro = c(0,4,1,2), 
                      marHeatmap = c(5,5,1,2) ,
                      cex.lab = 0.8, xLabelsAngle= 90)
dev.off()

#該函數生成ME和性狀的樹狀圖,以及它們之間關系的熱圖。顯示了 eigengenes的層次聚類樹圖由dissimilarity of eigengenes EI構成,Ej由1?cor(Ei,Ej)給出,
#下面的熱圖顯示eigengenes的鄰接值,AIj=(1+COR(Ei,Ej))/2。

#要拆分樹狀圖和熱圖,我們可以使用以下代碼
#====================================================
#  Code chunk Step5.5 拆分樹狀圖和熱圖
#====================================================
# Plot the dendrogram
sizeGrWindow(6,6);
par(cex = 1.0)
# 模塊的進化樹
#png("step5-Eigengene-dendrogram-hclust.png",width = 800,height = 600)
plotEigengeneNetworks(MET, "Eigengene dendrogram", marDendro = c(0,4,2,0),
                      plotHeatmaps = FALSE)
dev.off()
# Plot the heatmap matrix (note: this plot will overwrite the dendrogram plot)
par(cex = 1.0)
par(margin(3,6,2,2))
# 性狀與模塊熱圖
#png("step5-Eigengene-adjacency-heatmap.png",width = 800,height = 600)
plotEigengeneNetworks(MET, "Eigengene adjacency heatmap", marHeatmap = c(5,4,2,2),
                      plotDendrograms = FALSE, xLabelsAngle = 90)
dev.off()
#可視化的特征基因網絡表示模塊和臨床性狀之間的關系。


對基因的描述一般從三個層面進行:

  • Cellular component解釋的是基因存在在哪里,在細胞質還是在細胞核?如果存在細胞質那在哪個細胞器上?如果是在線粒體中那是存在線粒體膜上還是在線粒體的基質當中?這些信息都叫Cellular component。
  • Biological process是在說明該基因參與了哪些生物學過程,比如,它參與了rRNA的加工或參與了DNA的復制,這些信息都叫Biological process
  • Molecular function在講該基因在分子層面的功能是什么?它是催化什么反應的?立足于這三個方面,我們將得到基因的注釋信息。

3.8 拓撲重疊矩陣的熱圖

step5-Network-heatmap.png
  • 使用熱圖將基因網絡可視化。熱圖描繪了分析中所有基因之間的拓撲重疊矩陣(TOM)。淺色表示低重疊,逐漸變深的紅色表示高重疊。沿對角線的深色塊是模塊。左側和頂部還顯示了基因樹狀圖和模塊分配
# Step6.Export of networks to external software 
#====================================================
#  Code chunk Step6.1
#====================================================
# Display the current working directory
getwd();
rm(list = ls())
# Load the WGCNA package
library(WGCNA)
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
enableWGCNAThreads()
# Load the expression and trait data saved in the first part
lnames = load(file = "WGCNA-01-dataInput.RData");
#The variable lnames contains the names of loaded variables.
lnames
# Load network data saved in the second part.
lnames = load(file = "WGCNA-02-networkConstruction-auto.RData");
lnames

#====================================================
#  Code chunk Step6.2
#====================================================

# Recalculate topological overlap if needed
load(file="TOMsimilarityFromExpr.Rdata")
#TOM = TOMsimilarityFromExpr(datExpr, power = 7);#前面的power為7

# Read in the annotation file
annot = read.csv(file = "anno_probe2sym.csv",row.names = 1);
# Select modules


module = c("purple");###pink 對應  hour
# Select module probes
datExpr=as.data.frame(datExpr)
probes = names(datExpr)
inModule = (moduleColors==module);
table(moduleColors)
table(inModule)
modProbes = probes[inModule];
modGenes = annot$symbol[match(modProbes, annot$probe)];
# Select the corresponding Topological Overlap
modTOM = TOM[inModule, inModule];
dimnames(modTOM) = list(modProbes, modProbes)
# Export the network into edge and node list files Cytoscape can read
cyt = exportNetworkToCytoscape(modTOM,
                               edgeFile = paste("CytoscapeInput-edges-", paste(module, collapse="-"), ".txt", sep=""),
                               nodeFile = paste("CytoscapeInput-nodes-", paste(module, collapse="-"), ".txt", sep=""),
                               weighted = TRUE,
                               threshold = 0.02,
                               nodeNames = modProbes,
                               altNodeNames = modGenes,
                               nodeAttr = moduleColors[inModule])

總結:

  • 通過對數據集進行WGCNA進行分析,篩選出了一些想要的hub gene,可以對這些基因檢測,做相關的驗證的實驗。如果是腫瘤,可以去TCGA數據庫做預后分析等。

  • 在對hub基因進行挑選的時候,也可以通過模塊內具有高度Connectivity的進行篩選。

?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。