Improved Training of Wasserstein GANs翻譯 上
4 Gradient penalty
4梯度罰款
We now propose an alternative way to enforce the Lipschitz constraint. A differentiable function is 1-Lipschtiz if and only if it has gradients with norm at most 1 everywhere, so we consider directly constraining the gradient norm of the critic’s output with respect to its input. To circumvent tractability issues, we enforce a soft version of the constraint with a penalty on the gradient norm for random samples. Our new objective is
我們現在提出一種強制Lipschitz約束的替代方法。可區分函數是1-Lipschtiz,當且僅當它具有最多1個范數的梯度時,所以我們考慮直接約束評論者輸出相對于其輸入的梯度范數。為了避免易處理性問題,我們強制執行約束的軟版本,對隨機樣本的梯度范數進行懲罰。我們的新目標是
, which we found to work well across a variety of architectures and datasets ranging from toy tasks to large ImageNet CNNs.
懲罰系數本文中的所有實驗都使用,我們發現它可以很好地適用于從玩具任務到大型ImageNet CNN的各種架構和數據集。
No critic batch normalization Most prior GAN implementations [22, 23, 2] use batch normalization in both the generator and the discriminator to help stabilize training, but batch normalization changes the form of the discriminator’s problem from mapping a single input to a single output to mapping from an entire batch of inputs to a batch of outputs [23]. Our penalized training objective is no longer valid in this setting, since we penalize the norm of the critic’s gradient with respect to each input independently, and not the entire batch. To resolve this, we simply omit batch normalization in the critic in our models, ?nding that they perform well without it. Our method works with normalization schemes which don’t introduce correlations between examples. In particular, we recommend layer normalization [3] as a drop-in replacement for batch normalization.
沒有評論批量標準化大多數先前的GAN實現[22,23,2]在生成器和鑒別器中都使用批量標準化來幫助穩定訓練,但批量標準化會將鑒別器問題的形式從單個輸入映射到單個輸出變為從一批輸入映射到一批輸出[23]。我們的懲罰性培訓目標在此設置中不再有效,因為我們會獨立地懲罰評論者關于每個輸入的梯度的標準,而不是整個批次。為了解決這個問題,我們在模型中忽略批評規范化,發現它們在沒有它的情況下表現良好。我們的方法適用于規范化方案,這些方案不會引入示例之間的相關性。特別是,我們建議將層標準化[3]作為批量標準化的直接替代。
Two-sided penalty We encourage the norm of the gradient to go towards 1 (two-sided penalty) instead of just staying below 1 (one-sided penalty). Empirically this seems not to constrain the critic too much, likely because the optimal WGAN critic anyway has gradients with norm 1 almost everywhere under5 Experiments
5實驗
5.1 Training random architectures within a set
5.1培訓集合中的隨機體系結構
We experimentally demonstrate our model’s ability to train a large number of architectures which we think are useful to be able to train. Starting from the DCGAN architecture, we de?ne a set of architecture variants by changing model settings to random corresponding values in Table 1. We believe that reliable training of many of the architectures in this set is a useful goal, but we do not claim that our set is an unbiased or representative sample of the whole space of useful architectures: it is designed to demonstrate a successful regime of our method, and readers should evaluate whether it contains architectures similar to their intended application.
我們通過實驗證明了我們的模型訓練大量架構的能力,我們認為這些架構對訓練有用。從DCGAN架構開始,我們通過將模型設置更改為表1中的隨機對應值來定義一組架構變體。我們相信,對這一系列中的許多架構進行可靠的培訓是一個有用的目標,但我們并不認為我們的集合是整個有用架構空間的公正或有代表性的樣本:它旨在展示我們的成功制度。方法,讀者應評估它是否包含與其預期應用類似的架構。
Table 1: We evaluate WGAN-GP’s ability to train the architectures in this set.
表1:我們評估WGAN-GP在該組中訓練架構的能力。
Table 2: Outcomes of training 200 random architectures, for different success thresholds. For comparison, our standard DCGAN scored 7.24.
表2:針對不同的成功閾值,培訓200個隨機體系結構的結果。相比之下,我們的標準DCGAN得分為7.24。
101-layer ResNet G and D 5.2 Training varied architectures on LSUN bedrooms To demonstrate our model’s ability to train many architectures with its default settings, we train six different GAN architectures on the LSUN bedrooms dataset [31]. In addition to the baseline DCGAN architecture from [22], we choose six architectures whose successful training we demonstrate: (1) no BN and a constant number of ?lters in the generator, as in [2], (2) 4-layer 512-dim ReLU MLP generator, as in [2], (3) no normalization in either the discriminator or generator (4) gated multiplicative nonlinearities, as in [24], (5) tanh nonlinearities, and (6) 101-layer ResNet generator and discriminator.
基于LSUN臥室的101層ResNet G和D 5.2培訓各種架構為了展示我們的模型能夠以默認設置訓練許多架構,我們在LSUN臥室數據集上訓練了六種不同的GAN架構[31]。除了[22]的基線DCGAN架構外,我們選擇了六種架構,我們展示了它們的成功訓練:(1)發生器中沒有BN和恒定數量的濾波器,如[2],(2)4層512 -dim ReLU MLP發生器,如[2]中所述,(3)在鑒別器或發生器中沒有歸一化(4)門控乘法非線性,如[24],(5)tanh非線性,和(6)101層ResNet發生器和鑒別器。
Figure 2: Different GAN architectures trained with different methods. We only succeeded in training every architecture with a shared set of hyperparameters using WGAN-GP.
圖2:使用不同方法訓練的不同GAN架構。我們只使用WGAN-GP成功地使用一組共享的超參數來訓練每個架構。
Although we do not claim it is impossible without our method, to the best of our knowledge this is the ?rst time very deep residual networks were successfully trained in a GAN setting. For each architecture, we train models using four different GAN methods: WGAN-GP, WGAN with weight clipping, DCGAN [22], and Least-Squares GAN [18]. For each objective, we used the default set of optimizer hyperparameters recommended in that work (except LSGAN, where we searched over learning rates).
雖然我們沒有聲稱沒有我們的方法是不可能的,但據我們所知,這是第一次在GAN設置中成功訓練非常深的殘留網絡。對于每種架構,我們使用四種不同的GAN方法訓練模型:WGAN-GP,帶權重限幅的WGAN,DCGAN [22]和最小二乘GAN [18]。對于每個目標,我們使用了該工作中推薦的默認優化器超參數集(除了LSGAN,我們搜索了學習率)。
For WGAN-GP, we replace any batch normalization in the discriminator with layer normalization (see section 4). We train each model for 200K iterations and present samples in Figure 2. We only succeeded in training every architecture with a shared set of hyperparameters using WGAN-GP. For every other training method, some of these architectures were unstable or suffered from mode collapse.
對于WGAN-GP,我們用層規范化替換鑒別器中的任何批量標準化(參見第4節)。我們訓練每個模型進行200K次迭代,并在圖2中顯示樣本。我們只使用WGAN-GP成功地使用一組共享的超參數來訓練每個架構。對于其他所有訓練方法,其中一些架構不穩定或遭受模式崩潰。
5.3 Improved performance over weight clipping
5.3改善了重量削減的性能
One advantage of our method over weight clipping is improved training speed and sample quality. To demonstrate this, we train WGANs with weight clipping and our gradient penalty on CIFAR10 [13] and plot Inception scores [23] over the course of training in Figure 3. For WGAN-GP, we train one model with the same optimizer (RMSProp) and learning rate as WGAN with weight clipping, and another model with Adam and a higher learning rate. Even with the same optimizer, our method converges faster and to a better score than weight clipping. Using Adam further improves performance. We also plot the performance of DCGAN [22] and ?nd that our method converges more slowly (in wall-clock time) than DCGAN, but its score is more stable at convergence.
我們的方法優于減重的一個優點是提高了訓練速度和樣本質量。為了證明這一點,我們在圖3中的訓練過程中訓練WGAN進行了體重削減和CIFAR10 [13]的梯度懲罰以及初始得分[23]。對于WGAN-GP,我們訓練一個模型使用相同的優化器(RMSProp)和學習率作為WGAN進行權重削減,另一個模型使用Adam和更高的學習率。即使使用相同的優化器,我們的方法收斂速度更快,并且比重量限幅更好。使用Adam進一步提高了性能。我們還繪制了DCGAN [22]的性能,并發現我們的方法比DCGAN收斂得更慢(在掛鐘時間內),但其收斂在收斂時更穩定。
Figure 3: CIFAR-10 Inception score over generator iterations (left) or wall-clock time (right) for four models: WGAN with weight clipping, WGAN-GP with RMSProp and Adam (to control for the optimizer), and DCGAN. WGAN-GP signi?cantly outperforms weight clipping and performs comparably to DCGAN.
圖3:CIFAR-10在四個模型的生成器迭代(左)或掛鐘時間(右)上的初始得分:具有權重削減的WGAN,具有RMSProp和Adam的WGAN-GP(用于控制優化器)和DCGAN。WGAN-GP顯著優于減重并且與DCGAN相當。
5.4 Sample quality on CIFAR-10 and LSUN bedrooms
5.4 CIFAR-10和LSUN臥室的樣品質量
For equivalent architectures, our method achieves comparable sample quality to the standard GAN objective. However the increased stability allows us to improve sample quality by exploring a wider range of architectures. To demonstrate this, we ?nd an architecture which establishes a new state of the art Inception score on unsupervised CIFAR-10 (Table 3). When we add label information (using the method in [20]), the same architecture outperforms all other published models except for SGAN.
對于等效架構,我們的方法實現了與標準GAN目標相當的樣本質量。然而,增加的穩定性使我們能夠通過探索更廣泛的架構來提高樣品質量。為了證明這一點,我們找到了一種架構,它在無人監督的CIFAR-10上建立了一種新的最先進的入門分數(表3)。當我們添加標簽信息時(使用[20]中的方法),相同的架構優于除SGAN之外的所有其他已發布模型。
Table 3: Inception scores on CIFAR-10. Our unsupervised model achieves state-of-the-art performance, and our conditional model outperforms all others except SGAN.
表3:CIFAR-10的初始分數。我們的無監督模型實現了最先進的性能,我們的條件模型優于除SGAN之外的所有其他模型。
Unsupervised Supervised believe these samples are at least competitive with the best reported so far on any resolution for this We also train a deep ResNet onLSUN bedrooms and show samples in Figure 4. We dataset.
無監督的監督認為這些樣本至少與迄今為止報道的最佳報告競爭對手。我們還在LSUN臥室培訓深度ResNet并在圖4中顯示樣本。我們的數據集。
5.5 Modeling discrete data with a continuous generator
5.5使用連續發電機建模離散數據
To demonstrate our method’s ability to model degenerate distributions, we consider the problem of modeling a complex discrete distribution with a GAN whose generator is de?ned over a continuous space. As an instance of this problem, we train a character-level GAN language model on the Google Billion Word dataset [6]. Our generator is a simple 1D CNN which deterministically transforms a latent vector into a sequence of 32 one-hot character vectors through 1D convolutions. We apply a softmax nonlinearity at the output, but use no sampling step: during training, the softmax output is to the best published results so far. Figure 4: Samples ofLSUN bedrooms. We believe these samples are at least comparable passed directly into the critic (which, likewise, is a simple 1D CNN). When decoding samples, we just take the argmax of each output vector.
為了證明我們的方法能夠對退化分布進行建模,我們考慮使用GAN對復雜離散分布建模的問題,其中GAN的生成器是在連續空間上定義的。作為這個問題的一個例子,我們在Google Billion Word數據集上訓練了一個字符級的GAN語言模型[6]。我們的生成器是一個簡單的1D CNN,通過1D卷積確定性地將潛在向量轉換為32個單熱字符向量的序列。我們在輸出端應用softmax非線性,但不使用采樣步驟:在訓練期間,softmax輸出到目前為止發布的最佳結果。圖4:LSUN臥室的樣品。我們相信這些樣本至少可以直接傳遞給評論家(同樣,這是一個簡單的1D CNN)。解碼樣本時,我們只取每個輸出向量的argmax。
We present samples from the model in Table 4. Our model makes frequent spelling errors (likely because it has to output each character independently) but nonetheless manages to learn quite a lot about the statistics of language. We were unable to produce comparable results with the standard GAN objective, though we do not claim that doing so is impossible.
我們在表4中提供了模型中的樣本。我們的模型經常出現拼寫錯誤(可能是因為它必須獨立輸出每個字符),但仍然能夠學到很多關于語言統計的知識。我們無法與標準GAN目標產生可比較的結果,但我們并未聲稱這樣做是不可能的。
Table 4: Samples from a WGAN-GP character-level language model trained on sentences from the Billion Word dataset, truncated to 32 characters. The model learns to directly output one-hot character embeddings from a latent vector without any discrete sampling step. We were unable to achieve comparable results with the standard GAN objective and a continuous generator.
表4:來自WGAN-GP字符級語言模型的樣本,該模型使用Billion Word數據集中的句子進行訓練,截斷為32個字符。該模型學習直接從潛在向量輸出單熱字符嵌入而無需任何離散采樣步驟。我們無法使用標準GAN物鏡和連續發電機獲得可比較的結果。
Figure 5: (a) The negative critic loss of our model on LSUN bedrooms converges toward a minimum as the network trains. (b) WGAN training and validation losses on a random 1000-digit subset of MNIST show over?tting when using either our method (left) or weight clipping (right). In particular, with our method, the critic over?ts faster than the generator, causing the training loss to increase gradually over time even as the validation loss drops.
圖5:(a)我們的LSUN臥室模型的負面批評損失在網絡訓練時趨于最小。 (b)當使用我們的方法(左)或權重削減(右)時,隨機的1000位MNIST子集上的WGAN訓練和驗證損失顯示過度擬合。特別是,使用我們的方法,批評者比發電機更快,導致培訓損失隨著時間的推移逐漸增加,即使驗證損失下降。
Other attempts at language modeling with GANs [32, 14, 30, 5, 15, 10] typically use discrete models and gradient estimators [28, 12, 17]. Our approach is simpler to implement, though whether it scales beyond a toy language model is unclear.
使用GAN [32,14,30,5,15,10]進行語言建模的其他嘗試通常使用離散模型和梯度估計[28,12,17]。我們的方法實現起來比較簡單,但是它是否超出了玩具語言模型還不清楚。
5.6 Meaningful loss curves and detecting over?tting
5.6有意義的損耗曲線和檢測過度擬合
An important bene?t of weight-clipped WGANs is that their loss correlates with sample quality and converges toward a minimum. To show that our method preserves this property, we train a WGAN-GP on the LSUN bedrooms dataset [31] and plot the negative of the critic’s loss in Figure 5a. We see that the loss converges as the generator minimizes.
重量限制WGAN的一個重要好處是它們的損失與樣品質量相關,并且收斂到最小。為了表明我們的方法保留了這個屬性,我們在LSUN臥室數據集上訓練了一個WGAN-GP [31],并繪制了圖5a中評論家損失的負面影響。我們看到損失在發生器最小化時收斂。
Given enough capacity and too little training data, GANs will over?t. To explore the loss curve’s behavior when the network over?ts, we train large unregularized WGANs on a random 1000-image subset of MNIST and plot the negative critic loss on both the training and validation sets in Figure 5b. In both WGAN and WGAN-GP, the two losses diverge, suggesting that the critic over?ts and provides an inaccurate estimate of, at which point all bets are off regarding correlation with sample quality. However in WGAN-GP, the training loss gradually increases even while the validation loss drops.
如果有足夠的容量和太少的訓練數據,GAN將會過度。為了探索網絡過度時的損失曲線的行為,我們在MNIST的隨機1000圖像子集上訓練大的非正規化WGAN,并在圖5b中的訓練和驗證集上繪制負面評論者損失。在WGAN和WGAN-GP中,這兩種損失有所不同,這表明對于過濾器的批評并提供了對的不準確估計,此時所有的投注均與樣本質量相關。然而,在WGAN-GP中,即使驗證損失下降,訓練損失也逐漸增加。
[29] also measure over?tting in GANs by estimating the generator’s log-likelihood. Compared to that work, our method detects over?tting in the critic (rather than the generator) and measures over?tting against the same loss that the network minimizes.
[29]還通過估計發電機的對數似然來測量GAN中的過量配置。與該工作相比,我們的方法檢測批評者(而不是發電機)中的過度配置,并針對網絡最小化的相同損失進行測量。
6 Conclusion
六,結論
In this work, we demonstrated problems with weight clipping in WGAN and introduced an alternative in the form of a penalty term in the critic loss which does not exhibit the same problems. Using our method, we demonstrated strong modeling performance and stability across a variety of architectures. Now that we have a more stable algorithm for training GANs, we hope our work opens the path for stronger modeling performance on large-scale image datasets and language. Another interesting direction is adapting our penalty term to the standard GAN objective function, where it might stabilize training by encouraging the discriminator to learn smoother decision boundaries.
在這項工作中,我們展示了WGAN中減重的問題,并在批評者損失中以懲罰性術語的形式引入了替代方案,其沒有表現出相同的問題。使用我們的方法,我們展示了各種架構的強大建模性能和穩定性。現在我們有了一個更穩定的GAN訓練算法,我們希望我們的工作為大規模圖像數據集和語言打開更強大的建模性能之路。另一個有趣的方向是使我們的懲罰項適應標準的GAN目標函數,它可以通過鼓勵鑒別器學習更平滑的決策邊界來穩定訓練。
Acknowledgements
致謝
We would like to thank Mohamed Ishmael Belghazi, L′eon Bottou, Zihang Dai, Stefan Doerr, Ian Goodfellow, Kyle Kastner, Kundan Kumar, Luke Metz, Alec Radford, Colin Raffel, Sai Rajeshwar, Aditya Ramesh, Tom Sercu, Zain Shah and Jake Zhao for insightful comments.
我們要感謝Mohamed Ishmael Belghazi,L'Thon Bottou,Zihang Dai,Stefan Doerr,Ian Goodfellow,Kyle Kastner,Kundan Kumar,Luke Metz,Alec Radford,Colin Raffel,Sai Rajeshwar,Aditya Ramesh,Tom Sercu,Zain Shah和杰克趙的見解很有見地。
文章引用于 http://tongtianta.site/paper/3418
編輯 Lornatang
校準 Lornatang