Facebook:BigGraph 中文文檔-損失計算(PyTorch)

目錄

圖嵌入是一種從圖中生成無監督節點特征(node features)的方法,生成的特征可以應用在各類機器學習任務上。現代的圖網絡,尤其是在工業應用中,通常會包含數十億的節點(node)和數萬億的邊(edge)。這已經超出了已知嵌入系統的處理能力。Facebook開源了一種嵌入系統,PyTorch-BigGraph(PBG),系統對傳統的多關系嵌入系統做了幾處修改讓系統能擴展到能處理數十億節點和數萬億條邊的圖形。

本系列為翻譯的pytouch的官方手冊,希望能幫助大家快速入門GNN及其使用,全文十五篇,文中如果有勘誤請隨時聯系。

(一)Facebook開源圖神經網絡-Pytorch Biggraph

(二)Facebook:BigGraph 中文文檔-數據模型(PyTorch)

(三)Facebook:BigGraph 中文文檔-從實體嵌入到邊分值(PyTorch)

(四)Facebook:BigGraph 中文文檔-I/O格式化(PyTorch)

(五)Facebook:BigGraph 中文文檔-批預處理

(六)Facebook:BigGraph 中文文檔-分布式模式(PyTorch)

(七)Facebook:BigGraph 中文文檔-損失計算(PyTorch)


Loss calculation 損失計算

The training process aims at finding the embeddings for the entities so that the scores of the positive edges are higher than the scores of the negative edges. When unpacking what this means, three different aspects come into play:

One must first determine which edges are to be considered as?positive and negative samples.

Then, once the scores of all the samples have been determined, one must decide how to aggregate them in a single?loss?value.

Finally, one must decide how to go about?optimizing?that loss.

This chapter will dive into each of these issues.

訓練過程的目標是尋找實體的嵌入向量使正邊的得分高于負邊的得分。當分析這意味著什么的時候,有以下三個方面需要關注:

1)必須首先確定把哪些邊將被視為正樣本和哪些視為負樣本。

2)然后,當所有樣本的分值確定的時候,我們需要決定如何匯總到一個損失loss value上。

3)最后,我們需要決定如何優化optimizing這個損失。

本章將深入討論這里的每一個問題。


Negative sampling 負采樣

The edges provided in the input data are known to be positives but, as PBG operates under the open-world assumption, the edges that are not in the input are not necessarily negatives. However, as PBG is designed to perform on large sparse graphs, it relies on the approximation that any random edge is a negative with very high probability.

輸入數據中提供的邊是已知的正樣本,但是由于PBG假設operates在開放世界中運行,故而輸入中不存在的邊不一定是負的。然而,PBG的設計是用于處理大稀疏圖,依賴于隨機一個邊是負邊的概率很高。

The goal of sampling negatives is to produce a set of negative edges for each positive edge of a batch. Usual downstream applications (ranking, classification, …) are interested in comparing the score of an edge?(x,r,y1)?with the score of an edge?(x,r,y2). Therefore, PBG produces negative samples for a given positive edge by corrupting the entity on one of its sides, keeping the other side and the relation type intact. This makes the sampling more suited to the task.

負采樣的目標是為每個batch的正邊生成一個負邊的集合。通常下游的應用(排序、分類等)會更關心邊?(x,r,y1)?和?(x,r,y2) 得分的比較。因此,PBG通過保持另一側實體,消損一個關系中一側的實體來為一個給定的正邊生成負樣本,這樣可以保證關系是正確的。這樣可以得到一個更適用于任務的采樣。

For performance reasons, the set of entities used to corrupt the positive edges in order to produce the negative samples may be shared across several positive edges. The way this usually happens is that positive edges are split into “chunks”, a single set of entities is sampled for each chunk, and all edges in that chunk are corrupted using that set of entities.

處于性能原因,用于生成負樣本的消解正邊實體集合可能會跨多個正邊共享。這種方式通常應用于當正邊被分割到不同的“chunks”中時,一個單獨的實體集合在每個chunk中被采樣,并且在該chunk中的所有邊都被這個集合中的實體消解過。

PBG supports several ways of sampling negatives:

PBG支持幾種樣本抽樣的方法:

All negatives 全負采樣

The most straightforward way is to use, for each positive edge,?all?its possible negatives. What this means is that for a positive?(x,r,y)(x,r,y)?(where x?and y?are the left- and right-hand side negatives respectively and r?is the relation type), its negatives will be?(x′,r,y) for all?x′of the same entity type as x?and?(x,r,y′) for?y′ of the same entity type as?y. (Due to technical reasons this is in fact restricted to only the?x′ in the same partition as?xx, and similarly for?y′, as negative sampling always operates within the current bucket.)

最直接的方法是,對每一個正邊,使用所有可能的負樣本。也就是說對于一個正的(x,r,y)(其中x和y分別是左邊和右邊的負因素,r是關系類型),對于x,負采樣是(x′,r,y),x′是和x相同類型的實體,對于y,(x,r,y′)實體類型相同的所有x′(x,r,y′),y′也是和y相同的實體類型。(出于技術原因,由于負采樣始終在當前桶內工作,所以實際上僅限于與x在同一分區中的x′,有y,y′也是處于同一分區)。

As one can imagine, this method generates a lot of negatives and thus doesn’t scale to graphs of any significant size. It should not be used in practice, and is provided in PBG mainly for “academic” reasons. It is mainly useful to get more accurate results during evaluation on small graphs.

可以而知,這種方法會產生大量的負邊,并無法擴展到任意有意義的大小的圖。實踐中通常不會采用使用,PBG中提供它主要是出于“學術”的原因,主要用于在對小圖進行評價時獲得更準確的結果。

This method is activated on a per-relation basis, by turning on the all_negs config flag. When it’s enabled, this mode takes precedence and overrides any other mode.

此方法是通過一個預置關系激活,方法是打開all_negs 配置項。啟用時,此模式優先并覆蓋任何其他模式。

Same-batch negatives 同批次負采樣

This negative sampling method produces negatives for a given positive edge of a batch by sampling from the other edges of the same batch. This is done by first splitting the batch into so-called chunks (beware that the name “chunks” is overloaded, and these chunks are different than the edge chunks explained in?Batch preparation). Then the set of negatives for a positive edge?(x,r,y) contains the edges?(x′,r,y)for all entities?x′ that are on left-hand side of another edge in the chunk, and the edges?(x,r,y′) with?y′ satisfying the same condition for the right-hand side.

對一個給定的正向邊,從同批次的其他邊采樣的方式來負采樣的方法生成負邊。這個通過第一次分割批次到分塊中完成(注意名稱“chunks”是重載的,并且這些塊與批處理準備中解釋的邊塊不同)。然后對正邊(x,r,y)的負采樣包含了邊(x,r,y′),其中y′是滿足同樣條件的右節點。

For a single positive edge, this means that the entities used to construct its negatives are sampled from the current partition proportionally to their degree, a.k.a., according to the data distribution. This helps counteract the effects of a very skewed distribution of degrees, which might cause the embeddings to just capture that distribution.

對于一個單獨的正向邊,這意味著這些用戶構造其負邊的實體從 當前分區中按照數據的分布比例(即a.k.a)來采樣。這有助于抵消分布不均勻的影響,從而避免了嵌入只捕獲單分布。

The size of the chunks is controlled by the global?num_batch_negs?parameter. To disable this sampling mode, set that parameter to zero.

塊的大小是由全局配置項num_batch_negs控制,如果要禁用該采樣模式,可以將該參數置為0

Uniformly-sampled negatives 均勻負采樣

This last method is perhaps the most natural approximation of the “all negatives” method that scales to arbitrarily large graphs. Instead of using?all?the entities on either side to produce negative edges (thus having the number of negatives scale linearly withe the size of the graph), a fixed given number of these entities is sampled uniformly with replacement. Thus the set of negatives remains of constant size no matter how large the graph is. As with the “all negatives” method, the sampling here is restricted to the entities that have the same type and that belong to the same partition as the entity of the positive edge.

最后一種方法是通過任意大小的圖進行縮放的方法來取“所有負邊”的最大似然。有別于使用all實體所在的邊來生成負邊(因此負邊的數量與圖的大小成線性比例),而是固定一個具體的數量,并在采樣時保持同分布。因此,無論圖有多大,負樣本的集合大小都保持不變。與“all negatives”方法一樣,這里的采樣僅限于與正邊的實體屬于同一分期并具有相同類型的樣本。

This method interacts with the same-batch method, as all the edges in a chunk receive the same set of uniformly sampled negatives. This caveat means that the uniform negatives of two different positives are independent and uncorrelated only if they belong to different chunks.

這種方法與same-batch方法正交,因為所有在一個塊中的邊都接收同一分布均勻采樣的集合。只有兩個不同的正邊屬于不同塊,他們的負采樣才是獨立和不相關的。

This method is controlled by the?num_uniform_negs?parameter, which controls how many negatives are sampled for each chunk. If?num_batch_negs?is zero, the batches will be split into chunks of size?num_uniform_negs.

該方法由num_uniform_negs參數控制,該參數控制每個塊采樣多少個負樣本。如果num_batch_negs為零,則批將被拆分為大小為num_uniform_negs的塊。


Loss functions 損失函數

Once positive and negative samples have been determined and their scores have been computed by the model, the scores’ suitability for a certain application must be assessed, which is done by aggregating them into a single real value, the loss. What loss function is most appropriate depends on what operations the embeddings will be used for.

當確認了正負樣本并通過模型來計算了他們的打分,我們需要評估打分是否適用于一些特定應用,這通過將他們聚合成一個獨立的值即損失來完成。使用什么類型的損失函數取決于嵌入將被用于哪一類操作。

In all cases, the loss function’s input will be a series of scores for positive samples and, for each of them, a set of scores for corresponding negative samples. For simplicity, suppose all these sets are of the same size (if they are not, they can be padded with “negative infinity” values, as these are the “ideal” scores for negative edges and should thus induce no loss).

一般情況下,損失函數的輸入是一組正樣本的分數,對每一個樣本,對應一組負樣本的分數。為了簡單起見,假設這些集合的大小是相同的(如果不是,則用“負無窮大”填充,因為這是負邊的“理想”分數,因此不會導致任何損失)

Ranking loss 排序損失

The ranking loss compares each positive score with each of its corresponding negatives. For each such pair, no loss is introduced if the positive score is greater than the negative one by at least a given margin. Otherwise the incurred loss is the amount by which that inequality is violated. This is the hinge loss on the difference between the positive and negative score. Formally, for a margin m, a positive score?si and a negative score?ti,jti,j, the loss is max(0,m?si+ti,j). The total loss is the sum of the losses over all pairs of positive and negative scores, i.e., over all i?and?j.

排序損失將比較每個正向分數和其對應負樣本集內的每個樣本。對每個比較對,如果正樣本分數至少比負樣本分數大出一個給定的差值就不會有損失,否則,產生的損失就是就是這個差值。這是正、負樣本不同的關鍵損失。形式上來說,對于一條邊m,一個正分Si和一個負分Ti,j,則損失為max(0,, m-Si+Ti,j).整體的損失是所有正負對損失的加和,即,所有i和j。

This loss function is chosen by setting the?loss_fn?parameter to?ranking, and the target margin is specified by the?margin?parameter.

通過將loss_fn參數設置為ranking來選擇此損失函數,并通過margin參數指定目標差額。

This loss function is suitable when the setting requires to rank some entities by how likely they are to be related to another given entity.

此損失函數適用于當要求根據某些實體對于另一實體進行排序時。

Logistic loss 邏輯損失

The logistic loss instead interprets the scores as the probabilities that the edges exist. It does so by first passing each score (whose domain is the entire real line) through the logistic function (x?1/(1+e?x), which maps it to a value between 0 and 1). This value is taken as the probability?p and the loss will be its binary cross entropy with the “target” probability, i.e., 1 for positive edges and 0 for negative ones. In formulas, the loss for positives is??logp whereas for negatives it’s??log(1?p). The total loss of due to the negatives is renormalized so it compares with the one of the positives.

邏輯損失將分數解釋為邊緣存在的概率,首先通過邏輯回歸函數?(x?1/(1+e?x)將分數(其域是整個實線)映射到0到1之間的值。其值取概率p,損失為其與“目標”概率的二元交叉熵,即正邊為1,負邊為0。在公式中,正數的損失是-log p,而負數的損失是-log(1-p)。由于負樣本而造成的總損失被重新范化,因此它與正樣本中的一個進行比較。

One can see this as the cross entropy between two distributions on the values “edge exists” and “edge doesn’t exist”. One is given by the score (passed through the logistic function), the other has all the mass on “exists” for positives or all the mass on “doesn’t exist” for negatives.

人們可以看到這兩個分布之間的交叉熵的值“邊緣存在”和“邊緣不存在”。我們可以看到這兩個分布之間的交叉熵的值“邊緣存在”和“邊緣不存在”。一個是由分數給出的(通過邏輯函數),另一個則是關于“存在”的所有的預測值,或者所有關于否定的“不存在”的預測值。

This loss function is parameterless and is enabled by setting?loss_fn?to?logistic.

此損失函數是無參數的,通過將 loss_fn 設置為logistic來啟用。

Softmax loss?Softmax 損失

The last loss function is designed for when one wants a distribution on the probabilities of some entities being related to a given entity (contrary to just wanting a ranking, as with the ranking loss). For a certain positive i, its score si?and the score ti,j?of all the corresponding negatives j?are first converted to probabilities by performing a softmax:?pi∝esi and?qi,j∝eti,j normalized so that they sum up to 1. Then the loss is the cross entropy between this distribution and the “target” one, i.e., the one that puts all the mass on the positive sample. So, in full, the loss for a single i?is??logpi, i.e., ?si+log?∑jeti,j.

最后一個損失函數設計用于當在一個想要于某個給定實體相關的概率分布的情況(與僅需要排序如排序損失相反)。對于一個確定的正樣本i,他的得分Si和所有的j個負樣本的得分Ti,j 首先通過softmax被轉化成一個概率分布:?pi∝esi和?qi,j∝eti,j,該分部相加等于1。然后損失是這個分布和“目標”分布之間的交叉熵,也就是說,把所有分值放在正樣本上。所以,整體而言,一個單獨的i的損失是-logPi,即?si+log?∑jeti,j.

This loss is activated by setting?loss_fn?to?softmax.

通過將loss_fn設置為softmax可激活該損失函數。


Optimizers 優化器

The Adagrad optimization method is used to update all model parameters. Adagrad performs stochastic gradient descent with an adaptive learning rate applied to each parameter inversely proportional to the inverse square magnitude of all previous updates. In practice, Adagrad updates lead to an order of magnitude faster convergence for typical PBG models.

采用adagrad優化方法對模型參數進行更新。adagrad執行隨機梯度下降,自適應學習率應用于每個參數,與所有先前更新的平方反比大小成反比。在實際應用中,adagrad更新使典型pbg模型的收斂速度提高了一個數量級。

The initial learning rate for Adagrad is specified by the?lr?config parameter.A separate learning rate can also be set for non-embeddings using the?relation_lr?parameter.

adagrad的初始學習速率由lr config參數指定。也可以使用relational_lr參數為非嵌入設置單獨的學習速率。

Standard Adagrad requires an equal amount of memory for optimizer state as the size of the model, which is prohibitive for the large models targeted by PBG. To reduce optimizer memory usage, a modified version of Adagrad is used that uses a common learning rate for each entity embedding. The learning rate is proportional to the inverse sum of the squared gradients from each element of the embedding, divided by the dimension. Non-embedding parameters (e.g. relation operator parameters) use standard Adagrad.

標準的adagrad需要與模型大小相等的內存用于優化器狀態,這對于pbg所針對的大型模型是禁止的。為了減少優化器的內存使用,使用了修改后的adagrad版本,該版本對每個實體嵌入使用通用的學習速率。學習率與嵌入的每個元素的梯度平方和除以維數的反比成正比。非嵌入參數(如關系運算符參數)使用標準adagrad。

Adagrad parameters are updated asynchronously across worker threads with no explicit synchronization. Asynchronous updates to the Adagrad state (the total squared gradient) appear stable, likely because each element of the state tensor only accumulates positives updates. Optimization is further stabilized by performing a short period of training with a single thread before beginning Hogwild! training, which is tuned by the?hogwild_delay?parameter.

adagrad參數在沒有顯式同步的情況下跨工作線程異步更新。對adagrad狀態(總平方梯度)的異步更新看起來是穩定的,可能是因為狀態張量的每個元素只累積正更新。優化是進一步穩定,通過執行一個單一線程前開始Hogwild 訓練,并由hogwild_delay?參數調整。

In distributed training, the Adagrad state for shared parameters (e.g. relation operator parameters) are shared via the parameter server using the same asynchronous gradient update as the parameters themselves. Similar to inter-thread synchronization, these asynchronous updates are stable after an initial burn-in period because the total squared gradient strictly accumulates positive values.

在分布式訓練中,共享參數(如關系運算符參數)的adagrad狀態通過參數服務器共享,使用與參數本身相同的異步梯度更新。與線程間同步類似,這些異步更新在初始磨合期之后是穩定的,因為總平方梯度嚴格累積正值。

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。