Improved Training of Wasserstein GANs翻譯 下
Improved Training of Wasserstein GANs
改進了Wasserstein GAN的訓練
論文:http://arxiv.org/pdf/1704.00028v3.pdf
Abstract
摘要
Generative Adversarial Networks (GANs) are powerful generative models, but suffer from training instability. The recently proposed Wasserstein GAN (WGAN) makes progress toward stable training of GANs, but sometimes can still generate only poor samples or fail to converge. We ?nd that these problems are often due to the use of weight clipping in WGAN to enforce a Lipschitz constraint on the critic, which can lead to undesired behavior. We propose an alternative to clipping weights: penalize the norm of gradient of the critic with respect to its input. Our proposed method performs better than standard WGAN and enables stable training of a wide variety of GAN architectures with almost no hyperparameter tuning, including 101-layer ResNets and language models with continuous generators. We also achieve high quality generations on CIFAR-10 and LSUN bedrooms. ?
生成性對抗網絡(GAN)是強大的生成模型,但受到訓練不穩定性的影響。最近提出的Wasserstein GAN(WGAN)在GAN的穩定訓練方面取得了進展,但有時仍然只會產生不良樣本或無法收斂。我們發現這些問題通常是由于在WGAN中使用權重限制來對評論家強制執行Lipschitz約束,這可能導致不期望的行為。我們提出了裁剪權重的替代方法:懲罰評論家關于其輸入的漸變范數。我們提出的方法比標準WGAN表現更好,并且能夠在幾乎沒有超參數調整的情況下對各種GAN架構進行穩定的訓練,包括101層ResNets和帶有連續生成器的語言模型。我們還在CIFAR-10和LSUN臥室上實現了高品質的世代。 ?
1 Introduction
1簡介
Generative Adversarial Networks (GANs) [9] are a powerful class of generative models that cast generative modeling as a game between two networks: a generator network produces synthetic data given some noise source and a discriminator network discriminates between the generator’s output and true data. GANs can produce very visually appealing samples, but are often hard to train, and much of the recent work on the subject [23, 19, 2, 21] has been devoted to ?nding ways of stabilizing training. Despite this, consistently stable training of GANs remains an open problem.
生成性對抗網絡(GAN)[9]是一類強大的生成模型,它將生成建模作為兩個網絡之間的游戲:生成器網絡在給定一些噪聲源的情況下生成合成數據,鑒別器網絡區分生成器的輸出和真實數據。GAN可以產生非常具有視覺吸引力的樣本,但通常難以訓練,并且最近關于該主題的大部分工作[23,19,2,21]致力于尋找穩定訓練的方法。盡管如此,持續穩定的GAN培訓仍然是一個懸而未決的問題。
In particular, [1] provides an analysis of the convergence properties of the value function being optimized by GANs. Their proposed alternative, named Wasserstein GAN (WGAN) [2], leverages the Wasserstein distance to produce a value function which has better theoretical properties than the original. WGAN requires that the discriminator (called the critic in that work) must lie within the space of 1-Lipschitz functions, which the authors enforce through weight clipping.
特別是,[1]提供了由GAN優化的值函數的收斂性質的分析。他們提出的替代方案,名為Wasserstein GAN(WGAN)[2],利用Wasserstein距離產生一個值函數,該函數具有比原始值更好的理論性質。WGAN要求鑒別者(在該工作中稱為評論家)必須位于1-Lipschitz函數的空間內,作者通過權重削減強制執行。
Our contributions are as follows:
我們的貢獻如下:
1. On toy datasets, we demonstrate how critic weight clipping can lead to undesired behavior.
1.在玩具數據集上,我們展示了批評者權重裁剪如何導致不良行為。
2. We propose gradient penalty (WGAN-GP), which does not suffer from the same problems.
我們提出了梯度懲罰(WGAN-GP),它沒有遇到同樣的問題。
3. We demonstrate stable training of varied GAN architectures, performance improvements over weight clipping, high-quality image generation, and a character-level GAN language model without any discrete sampling.
3.我們展示了各種GAN架構的穩定培訓,重量削減的性能改進,高質量的圖像生成,以及沒有任何離散采樣的字符級GAN語言模型。
?Now at Google Brain
*現在谷歌大腦
?Code for our models is available at https://github.com/igul222/improved_wgan_training.
?我們的型號代碼可在https://github.com/igul222/improved_wgan_training獲得。
2 Background
2背景
2.1 Generative adversarial networks
2.1生成對抗網絡
The GAN training strategy is to de?ne a game between two competing networks. The generator network maps a source of noise to the input space. The discriminator network receives either a generated sample or a true data sample and must distinguish between the two. The generator is trained to fool the discriminator.
GAN培訓策略是定義兩個競爭網絡之間的游戲。生成器網絡將噪聲源映射到輸入空間。鑒別器網絡接收生成的樣本或真實的數據樣本,并且必須區分這兩者。訓練發生器以愚弄鑒別器。
Formally, the game between the generator G and the discriminator D is the minimax objective:
形式上,生成器G和鑒別器D之間的游戲是最小極大目標:
(the input z to the generator is sampled from some simple noise distribution p, such as the uniform distribution or a spherical Gaussian distribution).
其中隱含定義的模型分布(生成器的輸入z是從一些簡單的噪聲分布p中采樣的,例如均勻分布或球形高斯分布)。
If the discriminator is trained to optimality before each generator parameter update, then minimizing the value function amounts to minimizing the Jensen-Shannon divergence between2.2 Wasserstein GANs
2.2 Waterstone GAN
[2] argues that the divergences which GANs typically minimize are potentially not continuous with respect to the generator’s parameters, leading to training dif?culty. They propose instead using the Earth-Mover (also called Wasserstein-1) distanceThe WGAN value function is constructed using the Kantorovich-Rubinstein duality [25] to obtain
使用Kantorovich-Rubinstein對偶[25]構造WGAN值函數以獲得
The WGAN value function results in a critic function whose gradient with respect to its input is better behaved than its GAN counterpart, making optimization of the generator easier. Empirically, it was also observed that the WGAN value function appears to correlate with sample quality, which is not the case for GANs [2].
WGAN值函數產生一個批評函數,其相對于其輸入的梯度比其GAN對應物表現得更好,使得生成器的優化更容易。根據經驗,還觀察到WGAN值函數似乎與樣本質量相關,而GAN則不然[2]。
To enforce the Lipschitz constraint on the critic, [2] propose to clip the weights of the critic to lie k-Lipschitz functions for some k which depends on c and the critic architecture. In the following within a compact space [, c]. The set of functions satisfying this constraint is a subset of the sections, we demonstrate some of the issues with this approach and propose an alternative.
為了對評論家強制執行Lipschitz約束,[2]建議將評論者的權重剪輯為k-Lipschitz函數,取決于c和評論體系結構。在緊湊的空間內[,c]。滿足此約束的函數集是這些部分的子集,我們演示了此方法的一些問題并提出了替代方案。
2.3 Properties of the optimal WGAN critic
2.3最佳WGAN評論家的屬性
In order to understand why weight clipping is problematic in a WGAN critic, as well as to motivate our approach, we highlight some properties of the optimal critic in the WGAN framework. We prove these in the Appendix.
為了理解為什么體重削減在WGAN評論家中存在問題,以及激勵我們的方法,我們強調了WGAN框架中最佳評論家的一些屬性。我們在附錄中證明了這些。
Proposition 1. Let3 Dif?culties with weight constraints
3重量限制的差異
We ?nd that weight clipping in WGAN leads to optimization dif?culties, and that even when optimization succeeds the resulting critic can have a pathological value surface. We explain these problems below and demonstrate their effects; however we do not claim that each one always occurs in practice, nor that they are the only such mechanisms.
我們發現WGAN中的權重削減會導致優化困難,即使優化成功,最終的批評者也可能具有病態價值表面。我們在下面解釋這些問題并展示它們的影響;但是我們并不是說每個人總是在實踐中出現,也不是說他們是唯一這樣的機制。
Our experiments use the speci?c form of weight constraint from [2] (hard clipping of the magnitude of each weight), but we also tried other weight constraints (L2 norm clipping, weight normalization), as well as soft constraints (L1 and L2 weight decay) and found that they exhibit similar problems.
我們的實驗使用[2]中的特殊形式的權重約束(每個權重的大小的硬限幅),但我們也嘗試了其他權重約束(L2范數裁剪,權重歸一化)以及軟約束(L1和L2權重)腐爛)并發現他們表現出類似的問題。
To some extent these problems can be mitigated with batch normalization in the critic, which [2] use in all of their experiments. However even with batch normalization, we observe that very deep WGAN critics often fail to converge.
在某種程度上,這些問題可以通過評論家中的批量標準化來減輕,[2]在他們的所有實驗中使用。然而,即使批量歸一化,我們也觀察到非常深刻的WGAN批評者經常無法收斂。
8 Gaussians 25 Gaussians Swiss Roll (a) Value surfaces of WGAN critics trained to op(b) (left) Gradient norms of deep WGAN critics dur timality on toy datasets using (top) weight clipping ing training on the Swiss Roll dataset either explode and (bottom) gradient penalty. Critics trained with or vanish when using weight clipping, but not when weight clipping fail to capture higher moments of the using a gradient penalty. (right) Weight clipping (top) data distribution. The ‘generator’ is held ?xed at the pushes weights towards two values (the extremes of real data plus Gaussian noise. the clipping range), unlike gradient penalty (bottom).
8高斯25高斯瑞士卷(a)WGAN評論家的價值表面訓練為op(b)(左)深度WGAN評論家的漸變規范在玩具數據集上使用瑞士Roll數據集上的(頂部)權重裁剪訓練的爆發性和爆炸性(下)梯度懲罰。批評者在使用體重削減時訓練或消失,但是當體重削減無法捕捉使用漸變懲罰的更高時刻時。 (右)重量削減(頂部)數據分布。與梯度罰分(底部)不同,“發生器”保持固定在推動權重兩個值(實際數據的極值加上高斯噪聲,削波范圍)。
Figure 1: Gradient penalty in WGANs does not exhibit undesired behavior like weight clipping.
圖1:WGAN中的梯度懲罰沒有表現出像重量削減那樣的不良行為。
3.1 Capacity underuse
3.1未充分使用的能力
Implementing a k-Lipshitz constraint via weight clipping biases the critic towards much simpler functions. As stated previously in Corollary 1, the optimal WGAN critic has unit gradient norm almost everywhere under§This assumption is in order to exclude the case when the matching point of sample x is x itself. It is
§這個假設是為了排除樣本x的匹配點是x本身的情況。它是
satis?ed in the case thatcritic. In each case, the critic trained with weight clipping ignores higher moments of the data distribution and instead models very simple approximations to the optimal functions. In contrast, our approach does not suffer from this behavior.
評論家。在每種情況下,使用權重裁剪訓練的評論家忽略了數據分布的更高時刻,而是模擬對最優函數的非常簡單的近似。相反,我們的方法不會受到這種行為的影響。
3.2 Exploding and vanishing gradients
3.2爆炸和消失梯度
We observe that the WGAN optimization process is dif?cult because of interactions between the weight constraint and the cost function, which result in either vanishing or exploding gradients without careful tuning of the clipping threshold c.
我們觀察到WGAN優化過程是困難的,因為權重約束和成本函數之間的相互作用,這導致消失或爆炸的梯度,而不仔細調整限幅閾值c。
To demonstrate this, we train WGAN on the Swiss Roll toy dataset, varying the clipping threshold c in [文章引用于 http://tongtianta.site/paper/3418
編輯 Lornatang
校準 Lornatang