KL距離

KL距離,是Kullback-Leibler差異(Kullback-Leibler Divergence)的簡稱,也叫做相對熵(Relative Entropy)。它衡量的是相同事件空間里的兩個概率分布的差異情況。

KL距離全稱為Kullback-Leibler Divergence,也被稱為相對熵。公式為:

感性的理解,KL距離可以解釋為在相同的事件空間P(x)中兩個概率P(x)和Q(x)分布的差異情況。
從其物理意義上分析:可解釋為在相同事件空間里,概率分布P(x)的事件空間,若用概率分布Q(x)編碼時,平均每個基本事件(符號)編碼長度增加了多少比特。


信息論解釋
信息論解釋

如上面展開公式所示,前面一項是在P(x)概率分布下的熵的負數,而熵是用來表示在此概率分布下,平均每個事件需要多少比特編碼。這樣就不難理解上述物理意義的編碼的概念了。
但是KL距離并不是傳統意義上的距離。傳統意義上的距離需要滿足三個條件:1)非負性;2)對稱性(不滿足);3)三角不等式(不滿足)。但是KL距離三個都不滿足。反例可以看參考資料中的例子。

+++++++++++++++++++++++++++++++++++++++++++++++++++
作者:肖天睿鏈接:https://www.zhihu.com/question/29980971/answer/93489660來源:知乎著作權歸作者所有,轉載請聯系作者獲得授權。Interesting question, KL divergence is something I'm working with right now.KL divergence KL(p||q), in the context of information theory, measures the amount of extra bits (nats) that is necessary to describe samples from the distribution p with coding based on q instead of p itself. From the Kraft-Macmillan theorem, we know that the coding scheme for one value out of a set X can be represented q(x) = 2^(-l_i) as over X, where l_i is the length of the code for x_i in bits.We know that KL divergence is also the relative entropy between two distributions, and that gives some intuition as to why in it's used in variational methods. Variational methods use functionals as measures in its objective function (i.e. entropy of a distribution takes in a distribution and return a scalar quantity). It's interpreted as the "loss of information" when using one distribution to approximate another, and is desirable in machine learning due to the fact that in models where dimensionality reduction is used, we would like to preserve as much information of the original input as possible. This is more obvious when looking at VAEs which use the KL divergence between the posterior q and prior p distribution over the latent variable z. Likewise, you can refer to EM, where we decomposeln p(X) = L(q) + KL(q||p)Here we maximize the lower bound on L(q) by minimizing the KL divergence, which becomes 0 when p(Z|X) = q(Z). However, in many cases, we wish to restrict the family of distributions and parameterize q(Z) with a set of parameters w, so we can optimize w.r.t. w.Note that KL(p||q) = - \sum p(Z) ln (q(Z) / p(Z)), and so KL(p||q) is different from KL(q||p). This asymmetry, however, can be exploited in the sense that in cases where we wish to learn the parameters of a distribution q that over-compensates for p, we can minimize KL(p||q). Conversely when we wish to seek just the main components of p with q distribution, we can minimize KL(q||p). This example from the Bishop book illustrates this well.


KL divergence belongs to an alpha family of divergences, where the parameter alpha takes on separate limits for the forward and backwards KL. When alpha = 0, it becomes symmetric, and linearly related to the Hellinger distance. There are other metrics such as the Cauchy Schwartz divergence which are symmetric, but in machine learning settings where the goal is to learn simpler, tractable parameterizations of distributions which approximate a target, they might not be as useful as KL.

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。

推薦閱讀更多精彩內容

  • pyspark.sql模塊 模塊上下文 Spark SQL和DataFrames的重要類: pyspark.sql...
    mpro閱讀 9,507評論 0 13
  • # An illustrated introduction to the t-SNE algorithm In t...
    野牛公爵閱讀 502評論 0 0
  • 今天上午,我在廣場上騎車。我一直騎到廣場去,然后再開始從廣場騎車。廣場上很平靜,一個人也沒有,就我自己一個...
    a332095e373c張子龍閱讀 125評論 0 0
  • 這是我繼18年3月北京大講堂后第二次參加大講堂活動,大講堂上我看到了很多新的家人,就如同看到去年的自己,滿臉...
    momo1975閱讀 238評論 0 0
  • 成都,是一座來了不想走的城市。 收到軍信息那會兒,已是午飯點,很意外,他竟然在成都,我當時高興得幾乎要蹦起來。聯絡...
    蔣重來閱讀 386評論 0 3