Abstrct/摘要
Graph embedding methods produce unsupervised node features from graphs that can then be used for a variety of machine learning tasks. Modern graphs, particularly in industrial applications, contain billions of nodes and trillions of edges, which exceeds the capability of existing embedding systems. We present PyTorch-BigGraph (PBG), an embedding system that incorporates several modifications to traditional multi-relation embedding systems that allow it to scale to graphs with billions of nodes and trillions of edges. PBG uses graph partitioning to train arbitrarily large embeddings on either a single machine or in a distributed environment. We demonstrate comparable performance with existing embedding systems on common benchmarks, while allowing for scaling to arbitrarily large graphs and parallelization on multiple machines. We train and evaluate embeddings on several large social network graphs as well as the full Freebase dataset, which contains over 100 million nodes and 2 billion edges.
圖嵌入是一種從圖中生成無監督節點特征(node features)的方法,生成的特征可以應用在各類機器學習任務上?,F代的圖網絡,尤其是在工業應用中,通常會包含數十億的節點(node)和數萬億的邊(edge)。這已經超出了已知嵌入系統的處理能力。我們介紹了一種嵌入系統,PyTorch-BigGraph(PBG),系統對傳統的多關系嵌入系統做了幾處修改讓系統能擴展到能處理數十億節點和數萬億條邊的圖形。PBG使用了圖形分區來支持任意大小的嵌入在單機或者分布式環境中訓練。我們展示了在功能的基準上與現有嵌入系統的性能比較,同時PBG允許在多臺機器上允許縮放到任意大小并且支持并行。我們在使用幾個大型社交圖網絡作為完整Freebase數據集來訓練和評估嵌入系統,數據集包含超過1億個節點和20億條邊。
1 Introduction / 1 簡介
Graph structured data is a common input to a variety of machine learning tasks (Wu et al., 2017; Cook & Holder,2006; Nickel et al., 2016a; Hamilton et al., 2017b). Working with graph data directly is difficult, so a common technique is to use graph embedding methods to create vector representations for each node so that distances between these vectors predict the occurrence of edges in the graph. Graph embeddings have been have been shown to serve as useful features for downstream tasks such as recommender systems in e-commerce (Wang et al., 2018), link prediction in social media (Perozzi et al., 2014), predicting drug interactions and characterizing protein-protein networks (Zitnik & Leskovec, 2017).
圖結構數據是各種機器學習任務的一種常見輸入,直接處理圖結構數據是比較困難的,常用的技術是通過圖嵌入方法為圖中的每個節點創建向量化表示使得這些向量間的距離能預測圖形中是否存在邊。圖嵌入已經被證明是下游任務中有意義的特征,如:電子商務中的推薦系統,社交媒體中的鏈接預測,藥物相互作用、表征蛋白網絡。
Graph data is common at modern web companies and poses an extra challenge to standard embedding methods: scale. For example, the Facebook graph includes over two billion user nodes and over a trillion edges representing friendships, likes, posts and other connections (Ching et al., 2015). The graph of users and products at Alibaba also consists of more than one billion users and two billion items (Wang et al.,2018). At Pinterest, the user to item graph includes over 2 billion entities and over 17 billion edges (Ying et al., 2018).There are two main challenges for embedding graphs of this size. First, an embedding system must be fast enough to embed graphs with 1011 ? 1012 edges in a reasonable time. Second, a model with two billion nodes and even 32 embedding parameters per node expressed as floats) would require 800GB of memory just to store its parameters, thus many standard methods exceed the memory capacity of typical commodity servers.
圖結構數據在現代網絡公司非常常見,這給標準的嵌入方法提出了額外的調整:規模。例如:Facebook圖中包含20億個用戶節點和超過1萬億條邊,這些邊代表朋友關系、喜好、帖子和其他鏈接。阿里巴巴的用戶和產品圖也包含10億以上的用戶和20億以上的商品。在Pinterest,用戶到項目的圖包含20億的實體和超過170億邊。對這樣大小的圖做嵌入主要有兩個挑戰,一是這個嵌入系統必須足夠快,需要在一個合理的時間內完成10^11 - 10^12條邊的嵌入;二是擁有20億個節點,每個節點32個嵌入參數(浮點表示),大概需要800GB大小的內存來存儲這些參數,因此許多標準方法超出了典型商用服務器的內存容量。
We present PyTorch-BigGraph (PBG), an embedding system that incorporates several modifications to standard models. The contribution of PBG is to scale to graphs with billions of nodes and trillions of edges.
我們介紹了Pytorch(譯注:facebook開源的深度學習框架)Biggraph,一種基于標準模型做了若干改進的嵌入系統。PBG帶來的改進是講標準方案擴展到具有數十億個節點和數萬億個邊的圖。
Important components of PBG are:
PBG的重要組成部分:
A block decomposition of the adjacency matrix intoN buckets, training on the edges from one bucket at a time. PBG then either swaps embeddings from each partition to disk to reduce memory usage, or performs distributed execution across multiple machines.
1、將鄰接矩陣分塊,分解為n個桶,每次從一個桶的邊緣開始進行訓練,然后PBG要么交換每個分區到磁盤的嵌入來減少內存的使用,要么跨機器進行分布式訓練。
A distributed execution model that leverages the block decomposition for the large parameter matrices, as well as a parameter server architecture for global parameters and feature embeddings for featurized nodes.
2、分布式執行模型:對大參數矩陣進行快分解,以及用于特征化節點的全局參數和特征嵌入的參數服務架構
Efficient negative sampling for nodes that samples negative nodes both uniformly and from the data, and reuses negatives within a batch to reduce memory bandwidth.
3、高效的節點負采樣:在數據中對負向節點進行均勻采樣 并在批處理中重用負節點以減少內存用量
Support for multi-entity, multi-relation graphs with per-relation configuration options such as edge weight and choice of relation operator.
4、支持多實體、多關系圖,支持邊緣權重、關系運算符選擇等關系配置選項
We evaluate PBG on the Freebase, LiveJournal and YouTube graphs and show that it matches the performance of existing embedding systems.
我們在Freebase、LiveJournal和YouTube圖數據集上評估了PBG,表明了PBG能和當前已有的陷入系統性能能匹配。
We also report results on larger graphs. We construct an embedding of the full Freebase knowledge graph (121 million entities, 2.4 billion edges), which we release publicly with this paper. Partitioning of the Freebase graph reduces memory consumption by 88% with only a small degradation in the embedding quality, and distributed execution on 8 machines decreases training time by a factor of 4. We also perform experiments on a large Twitter graph showing similar results with near-linear scaling.
我們也給出了更大圖數據上的結果,我們構造了一個完整的Freebase圖(1.21億個實體,24億條邊)的嵌入,并將其公開發布。Freebase圖的劃分減少了88%的內存消耗,嵌入向量的質量只有略微的下降,在8臺機器上分布式執行,時間減少了4倍。另外我們再一個Twitter的大圖數據上進行了試驗,通過近似線性縮放得到了相似的結果。
PBG 作為一個開源項目發布在:https://github.com/facebookresearch/PyTorch-BigGraph.PBG是通過Pytorch實現的,沒有其他外部依賴或者自動以運算符。