WGCNA得到模塊之后如何篩選模塊里面的hub基因
原創(chuàng) 生信技能樹
我在生信技能樹多次寫教程分享WGCNA的實戰(zhàn)細節(jié),見:
通常是介紹到,把輸入的top5000 MAD的基因根據(jù)WGCNA算法劃分為多個模塊,然后不同模塊都可以去和臨床形狀看相關性。
首先看樣本性狀和模塊的關系
如下圖,如下要看懂下面的圖需要理解3個概念:
gene significance (GS) was defined as mediated p-value of each gene (GS = lgP) in the linear regression between gene expression and the clinical traits.
module eigengenes (MEs) were defined as the first principal component of each gene module and the expression of MEs was considered as a representative of all genes in a given module.
module significance (MS) were defined as the average GS of all the genes involved in the module
首先,每個模塊都有一個MEs,模塊的MEs能夠代表模塊本身去跟性狀進行計算相關性(基于樣本),這個相關性值就體現(xiàn)在了下面的熱圖里面:
可以很清楚的看到,疾病進展的3個階段,都是有非常顯著的模塊與之相關。舉個例子,假如我們現(xiàn)在關心的是phase1,那么就可以深入查看,我們?nèi)磕K里面的所有基因,跟我們的phase1這個性狀的相關性系數(shù)。
可以看到,基本上就是等價于前面的模塊基因集與性狀特征的相關性熱圖。只不過是把其中一個性狀,也就是phase1單獨拿出來仔細看而已。
比如看black這個模塊里面的基因, 這些基因在phase1這個性狀里面的的GS值都比較高,意味著這個black模塊跟phase1這個性狀的MEs會比較高,對應前面的模塊基因集與性狀特征的相關性熱圖。
然后看基因和模塊的關系
既然這個性狀phase1有3個關聯(lián)性比較好的模塊,例子里面是 black, blue, turquoise, 那么就需要下游分析這3個模塊里面的基因集。但是每個模塊基因數(shù)量畢竟是太多,如下:
> as.data.frame(table(mergedColors)) mergedColors Freq1 black 1402 blue 5723 brown 4014 green 2375 greenyellow 746 grey 2037 magenta 858 pink 1039 purple 7610 red 19011 tan 6212 turquoise 259113 yellow 266
所以需要探索每個模塊里面的基因,到底跟性狀有什么樣的關系,如何從模塊里面繼續(xù)挑選感興趣的基因。
繪制如下 Module membership vs. gene significance 的圖,然后挑選右上角的點所代表的基因即可。
這個策略被很多文章采用,比如發(fā)表在:Front. Oncol., 11 September 2018 | https://doi.org/10.3389/fonc.2018.00374的文章:
Based the cut-off criteria (|MM| > 0.8 and |GS| > 0.2), 42 genes with high connectivity in the clinical significant module were identified as hub genes.
可以看到,這個文章里面對GS的閾值設置的很低哦,具體一點是:
The connectivity of genes was measured by absolute value of the Pearson's correlation.
Genes with high within-module connectivity were considered as hub genes of the modules (cor.geneModuleMembership > 0.8).
Hub genes inside a given module tended to have a strong correlation with certain clinical trait, which was measured by absolute value of the Pearson's correlation (cor.geneTraitSignificance > 0.2).
再輔助生存分析,就可以進一步縮小基因范圍啦
Among them, CCNB2, FBXO5, KIF4A, MCM10, and TPX2 were negatively associated with the overall survival and relapse free survival
為什么這篇文章是這樣操作的呢,其實是WGCNA官網(wǎng)推薦的,因為Module membership (MM) is a measure of intra-modular connectivity.
那么connectivity到底是什么呢?
既然大家都是Module membership (MM) is a measure of intra-modular connectivity.所以篩選NM和GS值就好了,為什么還會有一個專門的connectivity呢?
就需要再去理解 connectivity 定義了,搜索到一個介紹:https://www.researchgate.net/post/How_should_I_interpret_the_connectivity_measures_kTotal_kWithin_kOut_kDiff_in_WGCNA
- kTotal - connectivity of the each gene based on its r-values to all other genes in the whole network
- kWithin - connectivity of the each gene within a single module based on its r-values to all other genes within the same module
- and 4) kOut and kDiff mathematical derivatives from 1) and 2)
WGCNA官網(wǎng)說明很簡單:The function intramodularConnectivity computes the whole network connectivity kTotal, the within module connectivity kWithin, kOut=kTotal-kWithin, and kDiff=kIn-kOut=2*kIN-kTotal
因為這個概念很少有人知道,所以大家使用WGCNA把基因劃分好模塊之后,通常并不是計算這個指標,但是WGCNA官網(wǎng)推薦使用這個指標來挑選模塊內(nèi)部最重要的基因!
Finding genes with high gene significance and high intramodular connectivity in interesting modules