Improved Training of Wasserstein GANs翻譯[上]

Improved Training of Wasserstein GANs翻譯 下

code

Improved Training of Wasserstein GANs

改進了Wasserstein GAN的訓練

論文:http://arxiv.org/pdf/1704.00028v3.pdf

Abstract

摘要

Generative Adversarial Networks (GANs) are powerful generative models, but suffer from training instability. The recently proposed Wasserstein GAN (WGAN) makes progress toward stable training of GANs, but sometimes can still generate only poor samples or fail to converge. We ?nd that these problems are often due to the use of weight clipping in WGAN to enforce a Lipschitz constraint on the critic, which can lead to undesired behavior. We propose an alternative to clipping weights: penalize the norm of gradient of the critic with respect to its input. Our proposed method performs better than standard WGAN and enables stable training of a wide variety of GAN architectures with almost no hyperparameter tuning, including 101-layer ResNets and language models with continuous generators. We also achieve high quality generations on CIFAR-10 and LSUN bedrooms. ?

生成性對抗網絡(GAN)是強大的生成模型,但受到訓練不穩定性的影響。最近提出的Wasserstein GAN(WGAN)在GAN的穩定訓練方面取得了進展,但有時仍然只會產生不良樣本或無法收斂。我們發現這些問題通常是由于在WGAN中使用權重限制來對評論家強制執行Lipschitz約束,這可能導致不期望的行為。我們提出了裁剪權重的替代方法:懲罰評論家關于其輸入的漸變范數。我們提出的方法比標準WGAN表現更好,并且能夠在幾乎沒有超參數調整的情況下對各種GAN架構進行穩定的訓練,包括101層ResNets和帶有連續生成器的語言模型。我們還在CIFAR-10和LSUN臥室上實現了高品質的世代。 ?

1 Introduction

1簡介

Generative Adversarial Networks (GANs) [9] are a powerful class of generative models that cast generative modeling as a game between two networks: a generator network produces synthetic data given some noise source and a discriminator network discriminates between the generator’s output and true data. GANs can produce very visually appealing samples, but are often hard to train, and much of the recent work on the subject [23, 19, 2, 21] has been devoted to ?nding ways of stabilizing training. Despite this, consistently stable training of GANs remains an open problem.

生成性對抗網絡(GAN)[9]是一類強大的生成模型,它將生成建模作為兩個網絡之間的游戲:生成器網絡在給定一些噪聲源的情況下生成合成數據,鑒別器網絡區分生成器的輸出和真實數據。GAN可以產生非常具有視覺吸引力的樣本,但通常難以訓練,并且最近關于該主題的大部分工作[23,19,2,21]致力于尋找穩定訓練的方法。盡管如此,持續穩定的GAN培訓仍然是一個懸而未決的問題。

In particular, [1] provides an analysis of the convergence properties of the value function being optimized by GANs. Their proposed alternative, named Wasserstein GAN (WGAN) [2], leverages the Wasserstein distance to produce a value function which has better theoretical properties than the original. WGAN requires that the discriminator (called the critic in that work) must lie within the space of 1-Lipschitz functions, which the authors enforce through weight clipping.

特別是,[1]提供了由GAN優化的值函數的收斂性質的分析。他們提出的替代方案,名為Wasserstein GAN(WGAN)[2],利用Wasserstein距離產生一個值函數,該函數具有比原始值更好的理論性質。WGAN要求鑒別者(在該工作中稱為評論家)必須位于1-Lipschitz函數的空間內,作者通過權重削減強制執行。

Our contributions are as follows:

我們的貢獻如下:

1. On toy datasets, we demonstrate how critic weight clipping can lead to undesired behavior.

1.在玩具數據集上,我們展示了批評者權重裁剪如何導致不良行為。

2. We propose gradient penalty (WGAN-GP), which does not suffer from the same problems.

我們提出了梯度懲罰(WGAN-GP),它沒有遇到同樣的問題。

3. We demonstrate stable training of varied GAN architectures, performance improvements over weight clipping, high-quality image generation, and a character-level GAN language model without any discrete sampling.

3.我們展示了各種GAN架構的穩定培訓,重量削減的性能改進,高質量的圖像生成,以及沒有任何離散采樣的字符級GAN語言模型。

?Now at Google Brain

*現在谷歌大腦

?Code for our models is available at https://github.com/igul222/improved_wgan_training.

?我們的型號代碼可在https://github.com/igul222/improved_wgan_training獲得。

2 Background

2背景

2.1 Generative adversarial networks

2.1生成對抗網絡

The GAN training strategy is to de?ne a game between two competing networks. The generator network maps a source of noise to the input space. The discriminator network receives either a generated sample or a true data sample and must distinguish between the two. The generator is trained to fool the discriminator.

GAN培訓策略是定義兩個競爭網絡之間的游戲。生成器網絡將噪聲源映射到輸入空間。鑒別器網絡接收生成的樣本或真實的數據樣本,并且必須區分這兩者。訓練發生器以愚弄鑒別器。

Formally, the game between the generator G and the discriminator D is the minimax objective:

形式上,生成器G和鑒別器D之間的游戲是最小極大目標:

image

where
image

is the data distribution and
image
is the model distribution implicitly de?ned by
image
image

(the input z to the generator is sampled from some simple noise distribution p, such as the uniform distribution or a spherical Gaussian distribution).

其中
image

是數據分布,
image
image
image

隱含定義的模型分布(生成器的輸入z是從一些簡單的噪聲分布p中采樣的,例如均勻分布或球形高斯分布)。

If the discriminator is trained to optimality before each generator parameter update, then minimizing the value function amounts to minimizing the Jensen-Shannon divergence between
image

and
image
[9], but doing so often leads to vanishing gradients as the discriminator saturates. In practice, [9] way to circumvent this dif?culty. However, even this modi?ed loss function can misbehave in the advocates that the generator be instead trained to maximize
image
, which goes some presence of a good discriminator [1].

如果在每個發生器參數更新之前將鑒別器訓練為最優性,則最小化值函數相當于最小化
image

image
之間的Jensen-Shannon散度[9],但這樣做通常會導致梯度消失,因為鑒別器飽和。在實踐中,[9]方法來規避這種困難。然而,即使這種修改的損失函數也可能在倡導者中行為不端,即生成器被訓練以最大化
image
,這會產生良好的鑒別器[1]。

2.2 Wasserstein GANs

2.2 Waterstone GAN

[2] argues that the divergences which GANs typically minimize are potentially not continuous with respect to the generator’s parameters, leading to training dif?culty. They propose instead using the Earth-Mover (also called Wasserstein-1) distance
image

, which is informally de?ned as the minimum cost of transporting mass in order to transform the distribution q into the distribution p (where the cost is mass times transport distance). Under mild assumptions,
image
is continuous everywhere and differentiable almost everywhere.

[2]認為,GAN通常最小化的分歧可能與發電機的參數不連續,導致訓練困難。他們建議改為使用Earth-Mover(也稱為Wasserstein-1)距離
image

,它被非正式地定義為運輸質量的最低成本,以便將分布q轉換為分布p(其中成本是質量乘以運輸距離) 。在溫和的假設下,
image
在任何地方都是連續的,幾乎無處不在。

The WGAN value function is constructed using the Kantorovich-Rubinstein duality [25] to obtain

使用Kantorovich-Rubinstein對偶[25]構造WGAN值函數以獲得

image

where D is the set of 1-Lipschitz functions and
image

is once again the model distribution implicitly de?ned by
image
. In that case, under an optimal discriminator (called a critic in the paper, since it’s not trained to classify), minimizing the value function with respect to the generator parameters minimizes
image
.

其中D是1-Lipschitz函數的集合,
image

再次是由
image
隱含定義的模型分布。在這種情況下,在最佳鑒別器(在論文中稱為批評者,因為它沒有經過分類訓練)下,最小化關于生成器參數的值函數可以最小化
image
。

The WGAN value function results in a critic function whose gradient with respect to its input is better behaved than its GAN counterpart, making optimization of the generator easier. Empirically, it was also observed that the WGAN value function appears to correlate with sample quality, which is not the case for GANs [2].

WGAN值函數產生一個批評函數,其相對于其輸入的梯度比其GAN對應物表現得更好,使得生成器的優化更容易。根據經驗,還觀察到WGAN值函數似乎與樣本質量相關,而GAN則不然[2]。

To enforce the Lipschitz constraint on the critic, [2] propose to clip the weights of the critic to lie k-Lipschitz functions for some k which depends on c and the critic architecture. In the following within a compact space [
image

, c]. The set of functions satisfying this constraint is a subset of the sections, we demonstrate some of the issues with this approach and propose an alternative.

為了對評論家強制執行Lipschitz約束,[2]建議將評論者的權重剪輯為k-Lipschitz函數,取決于c和評論體系結構。在緊湊的空間內[
image

,c]。滿足此約束的函數集是這些部分的子集,我們演示了此方法的一些問題并提出了替代方案。

2.3 Properties of the optimal WGAN critic

2.3最佳WGAN評論家的屬性

In order to understand why weight clipping is problematic in a WGAN critic, as well as to motivate our approach, we highlight some properties of the optimal critic in the WGAN framework. We prove these in the Appendix.

為了理解為什么體重削減在WGAN評論家中存在問題,以及激勵我們的方法,我們強調了WGAN框架中最佳評論家的一些屬性。我們在附錄中證明了這些。

Proposition 1. Let
image

and
image
be two distributions in X , a compact metric space. Then, there is a 1-Lipschitz function
image
which is the optimal solution of
image

. Let π be the optimal coupling between
image
and
image
, de?ned as the minimizer of:
image
image
where
image
is the set of joint distributions
image
whose marginals are
image

and
image
, respectively. Then, if
image
is differentiable?,
image
, and
image
image
with
image
, it holds that
image
. Corollary 1.
image
has gradient norm 1 almost everywhere under
image
and
image
.

命題1.讓
image

image
成為X中的兩個分布,一個緊湊的度量空間。然后,有一個1-Lipschitz函數
image
,這是
image

的最佳解決方案。設π是
image
image
之間的最佳耦合,定義為:
image
image
的最小化器,其中
image
是聯合分布
image
的集合,其邊緣分別為
image

image
。然后,如果
image
image
可區分?,
image
image
image
,則它保持
image
。推論1.
image
image
image
下幾乎無處不在,具有梯度范數1。

3 Dif?culties with weight constraints

3重量限制的差異

We ?nd that weight clipping in WGAN leads to optimization dif?culties, and that even when optimization succeeds the resulting critic can have a pathological value surface. We explain these problems below and demonstrate their effects; however we do not claim that each one always occurs in practice, nor that they are the only such mechanisms.

我們發現WGAN中的權重削減會導致優化困難,即使優化成功,最終的批評者也可能具有病態價值表面。我們在下面解釋這些問題并展示它們的影響;但是我們并不是說每個人總是在實踐中出現,也不是說他們是唯一這樣的機制。

Our experiments use the speci?c form of weight constraint from [2] (hard clipping of the magnitude of each weight), but we also tried other weight constraints (L2 norm clipping, weight normalization), as well as soft constraints (L1 and L2 weight decay) and found that they exhibit similar problems.

我們的實驗使用[2]中的特殊形式的權重約束(每個權重的大小的硬限幅),但我們也嘗試了其他權重約束(L2范數裁剪,權重歸一化)以及軟約束(L1和L2權重)腐爛)并發現他們表現出類似的問題。

To some extent these problems can be mitigated with batch normalization in the critic, which [2] use in all of their experiments. However even with batch normalization, we observe that very deep WGAN critics often fail to converge.

在某種程度上,這些問題可以通過評論家中的批量標準化來減輕,[2]在他們的所有實驗中使用。然而,即使批量歸一化,我們也觀察到非常深刻的WGAN批評者經常無法收斂。

image

8 Gaussians 25 Gaussians Swiss Roll (a) Value surfaces of WGAN critics trained to op(b) (left) Gradient norms of deep WGAN critics dur timality on toy datasets using (top) weight clipping ing training on the Swiss Roll dataset either explode and (bottom) gradient penalty. Critics trained with or vanish when using weight clipping, but not when weight clipping fail to capture higher moments of the using a gradient penalty. (right) Weight clipping (top) data distribution. The ‘generator’ is held ?xed at the pushes weights towards two values (the extremes of real data plus Gaussian noise. the clipping range), unlike gradient penalty (bottom).

8高斯25高斯瑞士卷(a)WGAN評論家的價值表面訓練為op(b)(左)深度WGAN評論家的漸變規范在玩具數據集上使用瑞士Roll數據集上的(頂部)權重裁剪訓練的爆發性和爆炸性(下)梯度懲罰。批評者在使用體重削減時訓練或消失,但是當體重削減無法捕捉使用漸變懲罰的更高時刻時。 (右)重量削減(頂部)數據分布。與梯度罰分(底部)不同,“發生器”保持固定在推動權重兩個值(實際數據的極值加上高斯噪聲,削波范圍)。

image
image
image
image

Figure 1: Gradient penalty in WGANs does not exhibit undesired behavior like weight clipping.

圖1:WGAN中的梯度懲罰沒有表現出像重量削減那樣的不良行為。

3.1 Capacity underuse

3.1未充分使用的能力

Implementing a k-Lipshitz constraint via weight clipping biases the critic towards much simpler functions. As stated previously in Corollary 1, the optimal WGAN critic has unit gradient norm almost everywhere under
image

and
image
; under a weight-clipping constraint, we observe that our neural network architectures which try to attain their maximum gradient norm k end up learning extremely simple functions.

通過權重削減實現k-Lipshitz約束使批評者偏向更簡單的函數。如前面的推論1中所述,最佳WGAN評論家幾乎在
image

image
下都有單位梯度范數;在權重限制約束下,我們觀察到我們的神經網絡架構試圖獲得其最大梯度范數k,最終學習極其簡單的函數。

To demonstrate this, we train WGAN critics with weight clipping to optimality on several toy distributions, holding the generator distribution
image

?xed at the real distribution plus unit-variance Gaussian noise. We plot value surfaces of the critics in Figure 1a. We omit batch normalization in the ?We can actually assume much less, and talk only about directional derivatives on the direction of the line; which we show in the proof always exist. This would imply that in every point where
image
is differentiable (and thus we can take gradients in a neural network setting) the statement holds.

為了證明這一點,我們訓練WGAN批評者在幾個玩具分布上使用權重削減到最優,保持發生器分布
image

固定在實際分布加單位方差高斯噪聲。我們繪制了圖1a中評論家的價值面。我們在?中省略了批量歸一化。我們實際上可以假設更少,并且只討論線路方向上的方向導數;我們在證明中表明的總是存在的。這意味著在
image
可區分的每個點(因此我們可以在神經網絡設置中采用漸變)聲明成立。

§This assumption is in order to exclude the case when the matching point of sample x is x itself. It is

§這個假設是為了排除樣本x的匹配點是x本身的情況。它是

satis?ed in the case that
image

and
image
have supports that intersect in a set of measure 0, such as when they are supported by two low dimensional manifolds that don’t perfectly align [1].

image

image
具有在一組度量0中相交的支持的情況下,例如當它們由兩個不完全對齊的低維流形支持時,它們是令人滿意的[1]。
image

critic. In each case, the critic trained with weight clipping ignores higher moments of the data distribution and instead models very simple approximations to the optimal functions. In contrast, our approach does not suffer from this behavior.

評論家。在每種情況下,使用權重裁剪訓練的評論家忽略了數據分布的更高時刻,而是模擬對最優函數的非常簡單的近似。相反,我們的方法不會受到這種行為的影響。

3.2 Exploding and vanishing gradients

3.2爆炸和消失梯度

We observe that the WGAN optimization process is dif?cult because of interactions between the weight constraint and the cost function, which result in either vanishing or exploding gradients without careful tuning of the clipping threshold c.

我們觀察到WGAN優化過程是困難的,因為權重約束和成本函數之間的相互作用,這導致消失或爆炸的梯度,而不仔細調整限幅閾值c。

To demonstrate this, we train WGAN on the Swiss Roll toy dataset, varying the clipping threshold c in [
image

,
image
,
image
], and plot the norm of the gradient of the critic loss with respect to successive layers of activations. Both generator and critic are 12-layer ReLU MLPs without batch normalization. Figure 1b shows that for each of these values, the gradient either grows or decays exponentially as we move farther back in the network. We ?nd our method results in more stable gradients that neither vanish nor explode, allowing training of more complicated networks.

為了證明這一點,我們在Swiss Roll玩具數據集上訓練WGAN,改變[
image

,
image
,
image
]中的裁剪閾值c,并繪制關于連續激活層的評論者損失梯度的范數。發電機和評論家都是12層ReLU MLP,無需批量歸一化。圖1b顯示,對于這些值中的每一個,隨著我們在網絡中向后移動,梯度會以指數方式增長或衰減。我們發現我們的方法產生更穩定的梯度,既不會消失也不會爆炸,從而可以訓練更復雜的網絡。

文章引用于 http://tongtianta.site/paper/3418
編輯 Lornatang
校準 Lornatang

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。