寫在前面
這篇文章仍然來自幾篇文章及自己平時的積累,主要闡述關鍵基因和hub基因。很多人誤以為hub基因就是關鍵基因,甚至有人認為差異表達基因就是關鍵基因。在正式看本文章之前,我先以個人理解的角度簡單的來說明這三者之間的關系,不同見解的請留言。
- 差異表達基因是兩個group之間有統計學差異的gene,以芯片為例的話,幾萬個探針里可能差異的就1000個左右(當然根據設定閾值差異很大)
- hub基因,是degree高的gene,在基因表達網絡中有高的連接度degree,不涉及betweeness等。并且hub基因的篩選有很大的人為因素,到底是取前5%還是10%沒有具體要求,一般建議5%。也就是說這是一個很寬松的設定。
- 關鍵基因,有人從hub里挑靠前的,有人從差異表達基因里挑p值大的。到怎么才算關鍵基因?籠統來說,假如你這個基因被敲減,表型顯著消失,那肯定是關鍵基因。但僅從生物信息分析角度怎么挑?不可能有一種方法就可以直接解決這個問題,現在只從表達網絡的角度,稍后我會寫一篇多個角度如何篩選關鍵基因的文章。,其范圍要比hub小。hub不一定關鍵,關鍵不一定hub。
總之,在數目上獲范疇上
DGEs>Hubs>key genes(candidate genes)
------------------------------------------------
好了,開始正文吧
HUB 基因
The WGCNA approach typically deals with the identification of gene modules by using the gene expression levels that are highly correlated across samples. This technique has been successfully utilized to detect gene modules in Arabidopsis, rice, maize and poplar for various biotic and abiotic stresses . Further, this approach also leads to construction of Gene Co-expression Network (GCN), a scale free network, where, genes are represented as nodes and edges depict associations among genes . In such network, highly connected genes are called hub genes, which are expected to play an important role in understanding the biological mechanism of response under stresses/conditions. Identification of hub genes will also help in mitigating the stress in plants through genetic engineering. The existing approaches have mainly focused on hub gene identification, based only on gene connection degrees in the GCN. Moreover, these techniques select such genes empirically without any statistical criteria. Besides, few approaches can be found in the literature for the identification of hub nodes in a scale free network.
這里可以看出,hub基因是是在無尺度共表達網絡中存在的,對應著degree,也就是說在GCN中。現存的方法主要關注hub基因的鑒定,基于的就是GCN中的連接度,這些技術只是憑經驗選擇,并沒有統計學標準。另外,在文獻中很少有方法發現來鑒定無尺度網絡的中hub nodes。
所以作者提出了一個算法,并寫了一個包,對hub gene提供p值,可以根據p值標準來減少hub gene數目。
包在這里
文章地址1
文章地址2
It has been a long-standing長久存在的 goal in systems biology to find relations between the topological properties and functional features of protein networks. However, most of the focus in network studies has been on highly connected proteins (“hubs”). As a complementary notion, it is possible to define bottlenecks as proteins with a high betweenness centrality (i.e., network nodes that have many “shortest paths” going through them, analogous to major bridges and tunnels on a highway map). Bottlenecks are, in fact, key connector proteins with surprising functional and dynamic properties. In particular, they are more likely to be essential proteins. In fact, in regulatory and other directed networks, betweenness (i.e., “bottleneck-ness”) is a much more significant indicator of essentiality than degree (i.e., “hub-ness”). Furthermore, bottlenecks correspond to the dynamic components of the interaction network—they are significantly less well coexpressed with their neighbors than nonbottlenecks, implying that expression dynamics is wired into the network topology.
A network is a graph consisting of a number of nodes with edges connecting them. Recently, network models have been widely applied to biological systems. Here, we are mainly interested in two types of biological networks: the interaction network, where nodes are proteins and edges connect interacting partners; and the regulatory network, where nodes are proteins and edges connect transcription factors and their targets. Betweenness is one of the most important topological properties of a network. It measures the number of shortest paths going through a certain node. Therefore, nodes with the highest betweenness control most of the information flow in the network, representing the critical points of the network. We thus call these nodes the “bottlenecks” of the network. Here, we focus on bottlenecks in protein networks. We find that, in the regulatory network, where there is a clear concept of information flow, protein bottlenecks indeed have a much higher tendency to be essential genes. In this type of network, betweenness is a good predictor of essentiality. Biological researchers can therefore use the betweenness as one more feature to choose potential targets for detailed analysis.
Figure1.png
下面是關于hub和bottlenecks的區別解釋
Central complex members have a low betweenness and are hub–nonbottlenecks. 中心復合體成員低betweenness,屬于hub-nonbottlenecks.
Because of the high connectivity inside these complexes, paths can go through them and all their neighbors. On the other hand, hub–bottlenecks tend to correspond to highly central proteins that connect several complexes or are peripheral members of central complexes.
Hub-bottlenecks傾向于對應那些高中心性蛋白,連接幾個復合體,或者是中心復合體的周邊成員,他們有高betweenness的事實顯示這些蛋白不是簡單的大的蛋白復合體的成員(nonbottleneck-hubs的特點),而是把這個復合體和網絡中其他部分連接起來,一定意義上說,是真正的連接度瓶頸。
The fact that they have a high betweenness suggests that these proteins are not, however, simply members of large protein complexes (which is true for nonbottleneck–hubs), but are those members that connect the complex to the rest of the graph; in a sense, real connectivity bottlenecks. While hub–nonbottlenecks mainly consist of structural proteins, hub–bottlenecks are more likely to be part of signal transduction pathways.
Hub-nonbottlenecks主要構成結構蛋白,
Hub-bottlenecks更傾向于是信號轉導通路的一部分
Furthermore, hub–bottlenecks are (by construction) the most efficient in disrupting the network upon hub removal. This relates nicely to the date/party-hub concept by Han et al. : hub–bottlenecks tend to be date-hubs, whereas hub–nonbottlenecks tend to be party-hubs.
另外,一旦hub被移走,hub-bottlenecks是破壞網絡最有效的節點。這和Han的hub概念非常接近:hub-bottlenecks傾向于是date-hubs,hub-nonbottlenecks傾向于party-hubs(hans的文章看了就明白,datehubs更容易是大架構的組織者維持者,是大老板)。(han的這個觀點發表在nature上,下面是han的觀點)
上面說的那個han的nature上的文章
https://www.nature.com/articles/nature02555
In apparently scale-free protein–protein interaction networks, or ‘interactome’ networks1,2, most proteins interact with few partners, whereas a small but significant proportion of proteins, the ‘hubs’, interact with many partners.
在無尺度蛋白相互作用網絡或叫相互作用組網絡,大多數蛋白都是和少數的partners作用,只有少部分蛋白,也就是hubs,和很多partners作用.
非hub但瓶頸通常比那些非hub非瓶頸蛋白和他們的鄰居共表達更少,符合這個觀察:betweenness是和鄰接蛋白平均相關性的指標,非hub但瓶頸蛋白很少是復合體成員,并且大部分都是調節蛋白和信號轉到machinery。
不管是生物還是非生物,只要是無尺度網絡,都對隨機的node移除有抵抗能力,但是對hubs的移除非常敏感。
大概就是酵母做了個實驗,移除敲除編碼hub蛋白的基因,比非hub的死亡率大3倍,我們發現了兩類hub:party hubs黨派型,同時和partners的大部分相互作用。Date hubs約會型,不同的時間或位置結合不同的partners。
這樣,酵母中的相互作用網絡的hub基于他們的partners‘表達譜,可以分為兩類:date和party hubs。這種區分揭示了酵母蛋白組組織模塊的模型,通過regulators,mediators或adaptors連接模塊,這就是date hubs。Party hubs代表不同的模塊內部的必須的成分,對這這些模塊介導的功能很重要(因此傾向于是必須蛋白),傾向于在蛋白組的組織上低水平工作。(大概意思是date hubs是大boss,溝通銜接,而party hubs是模塊內部的小老板)。我們提出,date hubs在整個蛋白組網絡中生物模塊的總體組織中是必須的,參與的是大范圍的整合連接(雖然一些date hub可以簡單的共享,并且調節模塊內或跨模塊的局部功能)。這種相互作用網絡的關鍵特點,比如對抗外界環境的遺傳穩定性和彈性,使用這樣的模塊組織方式作為框架就更好理解了。
因此,所謂的date-hubs是那些有高的betweeness(hub-bottlenecks),
而party-hubs更可能是有著低betweeness的hubs(hub-nonbottlenecks)
這個發現,或許表明了相互作用網絡中動態和拓撲特性之間的聯系,而這迄今為止是人類未知的。
作者相信,雖然先有不好實現的地方,但是betweenness將來會被證明是一個非常有用的工具對很多蛋白昂立來說,尤其是有方向的edges(調控網絡)。
總之,我們提供了兩種互補的拓撲網絡特性的整合分析,這適合于不同的網絡類型。這種整合的方法解釋了先前不為人知的網絡拓撲性質之間的聯系,蛋白質必要性和表達動態。我們相信,這種整合的方法就像現在提出的這種,會對將來的預測模型至為重要。