iccv2019论文全集9240-large-scale-adversarial-representation-learning_第1页
iccv2019论文全集9240-large-scale-adversarial-representation-learning_第2页
iccv2019论文全集9240-large-scale-adversarial-representation-learning_第3页
iccv2019论文全集9240-large-scale-adversarial-representation-learning_第4页
iccv2019论文全集9240-large-scale-adversarial-representation-learning_第5页
已阅读5页,还剩6页未读 继续免费阅读

付费下载

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Large Scale Adversarial Representation Learning Jeff Donahue DeepMind jeffdonahue Karen Simonyan DeepMind simonyan Abstract Adversarially trained generative models (GANs) have recently achieved compelling image synthesis results. But despite early successes in using GANs for unsuper- vised represent

2、ation learning, they have since been superseded by approaches based on self-supervision. In this work we show that progress in image generation quality translates to substantially improved representation learning performance. Our ap- proach, BigBiGAN, builds upon the state-of-the-art BigGAN model, e

3、xtending it to representation learning by adding an encoder and modifying the discriminator. We extensively evaluate the representation learning and generation capabilities of these BigBiGAN models, demonstrating that these generation-based models achieve the state of the art in unsupervised represe

4、ntation learning on ImageNet, as well as in unconditional image generation. Pretrained BigBiGAN models including image generators and encoders are available on TensorFlow Hub1. 1Introduction In recent years we have seen rapid progress in generative models of visual data. While these models were prev

5、iously confi ned to domains with single or few modes, simple structure, and low resolution, with advances in both modeling and hardware they have since gained the ability to convincingly generate complex, multimodal, high resolution image distributions 1, 17, 18. Intuitively, the ability to generate

6、 data in a particular domain necessitates a high-level understanding of the semantics of said domain. This idea has long-standing appeal as raw data is both cheap readily available in virtually infi nite supply from sources like the Internet and rich, with images comprising far more information than

7、 the class labels that typical discriminative machine learning models are trained to predict from them. Yet, while the progress in generative models has been undeniable, nagging questions persist: what semantics have these models learned, and how can they be leveraged for representation learning? Th

8、edreamofgenerationasameansoftrueunderstandingfromrawdataalonehashardlybeenrealized. Instead, the most successful approaches for unsupervised learning leverage techniques adopted from the fi eld of supervised learning, a class of methods known as self-supervised learning 4,35,32,9. These approaches t

9、ypically involve changing or holding back certain aspects of the data in some way, and training a model to predict or generate aspects of the missing information. For example, 34,35 proposed colorization as a means of unsupervised learning, where a model is given a subset of the color channels in an

10、 input image, and trained to predict the missing channels. Generative models as a means of unsupervised learning offer an appealing alternative to self- supervised tasks in that they are trained to model the full data distribution without requiring any modifi cation of the original data. One class o

11、f generative models that has been applied to rep- resentation learning is generative adversarial networks (GANs) 11. The generator in the GAN 1 Models available athttps:/tfhub.dev/s?publisher=deepmind however, the shape of the reconstruction error surface is dictated by a parametric discriminator, a

12、s opposed to simple pixel-level measures like the 2error. Since the discriminator is usually a powerful neural network, the hope is that it will induce an error surface which emphasizes “semantic” errors in reconstructions, rather than low-level details. In 5 it was demonstrated that the encoder lea

13、rned via the BiGAN or ALI framework is an effective means of visual representation learning on ImageNet for downstream tasks. However, it used a DCGAN 26 style generator, incapable of producing high-quality images on this dataset, so the semantics the encoder could model were in turn quite limited.

14、In this work we revisit this approach using BigGAN 1 as the generator, a modern model that appears capable of capturing many of the modes and much of the structure present in ImageNet images. Our contributions are as follows: We show that BigBiGAN (BiGAN with BigGAN generator) matches the state of t

15、he art in unsupervised representation learning on ImageNet. We propose a more stable version of the joint discriminator for BigBiGAN. We perform a thorough empirical analysis and ablation study of model design choices. We show that the representation learning objective also improves unconditional im

16、age generation, and demonstrate state-of-the-art results in unconditional ImageNet generation. We open source pretrained BigBiGAN models on TensorFlow Hub2. 2BigBiGAN The BiGAN 5 or ALI 8 approaches were proposed as extensions of the GAN 11 framework which enable the learning of an encoder that can

17、be employed as an inference model 8 or feature representation 5. Given a distributionPxof datax(e.g., images), and a distributionPzof latentsz 2See footnote1. 2 (usually a simple continuous distribution like an isotropic GaussianN(0,I), the generatorGmodels a conditional distributionP(x|z)of dataxgi

18、ven latent inputszsampled from the latent priorPz, as in the standard GAN generator 11. The encoderEmodels the inverse conditional distribution P(z|x), predicting latents z given data x sampled from the data distribution Px. Besides the addition ofE , the other modifi cation to the GAN in the BiGAN

19、framework is a joint discriminatorD, which takes as input data-latent pairs(x,z)(rather than just dataxas in a standard GAN), and learns to discriminate between pairs from the data distribution and encoder, versus the generator and latent distribution. Concretely, its inputs are pairs(x Px, z E(x)an

20、d ( x G(z),z Pz), and the goal of theGandEis to “fool” the discriminator by making the two joint distributionsPxEandPGzfrom which these pairs are sampled indistinguishable. The adversarial minimax objective in 5, 8, analogous to that of the GAN framework 11, was defi ned as follows: min GE max D ?E

21、xPx,zE(x)log(D(x,z) + EzPz,xG(z)log(1 (D(x,z) ? Under this objective, 5,8 showed that with an optimalD,GandEminimize the Jensen-Shannon divergence between the joint distributionsPxEandPGz, and therefore at the global optimum, the two joint distributionsPxE= PGzmatch, analogous to the results from st

22、andard GANs 11. Furthermore, 5 showed that in the case whereEandGare deterministic functions (i.e., the learned conditional distributionsPG(x|z)andPE(z|x)are Diracfunctions), these two functions are inverses attheglobaloptimum: e.g.,xsupp(Px)x = G(E(x), withtheoptimaljointdiscriminatoreffectively im

23、posing 0reconstruction costs on x and z. While the crux of our approach, BigBiGAN, remains the same as that of BiGAN 5,8, we have adopted the generator and discriminator architectures from the state-of-the-art BigGAN 1 generative image model. Beyond that, we have found that an improved discriminator

24、 structure leads to better representation learning results without compromising generation (Figure 1). Namely, in addition to the joint discriminator loss proposed in 5,8 which ties the data and latent distributions together, we propose additional unary terms in the learning objective, which are fun

25、ctions only of either the datax or the latentsz. Although 5,8 prove that the original BiGAN objective already enforces that the learnt joint distributions match at the global optimum, implying that the marginal distributions of xandzmatch as well, these unary terms intuitively guide optimization in

26、the “right direction” by explicitly enforcing this property. For example, in the context of image generation, the unary loss term onxmatches the original GAN objective and provides a learning signal which steers only the generator to match the image distribution independently of its latent inputs. (

27、In our evaluation we will demonstrate empirically that the addition of these terms results in both improved generation and representation learning.) Concretely, the discriminator lossLDand the encoder-generator lossLEG are defi ned as follows, based on scalar discriminator “score” functions sand the

28、 corresponding per-sample losses : sx(x) = | xF(x) sz(z) = | zH(z) sxz(x,z) = | xzJ(F(x),H(z) EG(x,z,y) = y (sx(x) + sz(z) + sxz(x,z)y 1,+1 LEG(Px,Pz) = ExPx, zE(x)EG(x, z,+1) + EzPz, xG(z)EG( x,z,1) D(x,z,y) = h(ysx(x) + h(ysz(z) + h(ysxz(x,z)y 1,+1 LD(Px,Pz) = ExPx, zE(x)D(x, z,+1) + EzPz, xG(z)D(

29、 x,z,1) whereh(t) = max(0,1 t)is a “hinge” used to regularize the discriminator 22,30 3, also used in BigGAN 1. The discriminatorDincludes three submodules:F,H, andJ.Ftakes onlyxas input andHtakes onlyz, and learned projections of their outputs with parametersxandzrespectively give the scalar unary

30、scoressxandsz. In our experiments, the dataxare images and latentszare unstructured fl at vectors; accordingly,Fis a ConvNet andHis an MLP. The joint scoresxztyingx and z is given by the remaining D submodule, J, a function of the outputs of F and H. 3We also considered an alternative discriminator

31、loss 0Dwhich invokes the “hinge”hjust once on the sum of the three loss terms 0D(x,z,y) = h(y (sx(x) + sz(z) + sxz(x,z)but found that this performed signifi cantly worse than Dabove which clamps each of the three loss terms separately. 3 TheEandGparametersare optimized to minimize the lossLEG, and t

32、heDparametersare optimized to minimize lossLD. As usual, the expectationsEare estimated by Monte Carlo samples taken over minibatches. Like in BiGAN 5 and ALI 8, the discriminator lossLDintuitively trains the discriminator to distinguish between the two joint data-latent distributions from the encod

33、er and the generator, pushing it to predict positive values for encoder input pairs(x,E(x)and negative values for generator input pairs(G(z),z). The generator and encoder lossLEGtrains these two modules to fool the discriminator into incorrectly predicting the opposite, in effect pushing them to cre

34、ate matching joint data-latent distributions. (In the case of deterministicEandG, this requires the two modules to invert one another 5.) 3Evaluation Most of our experiments follow the standard protocol used to evaluate unsupervised learning tech- niques, fi rst proposed in 34. We train a BigBiGAN o

35、n unlabeled ImageNet, freeze its learned representation, and then train a linear classifi er on its outputs, fully supervised using all of the training set labels. We also measure image generation performance, reporting Inception Score 28 (IS) and Frchet Inception Distance 15 (FID) as the standard m

36、etrics there. 3.1Ablation We begin with an extensive ablation study in which we directly evaluate a number of modeling choices, with results presented in Table 1. Where possible we performed three runs of each variant with different seeds and report the mean and standard deviation for each metric. W

37、e start with a relatively fully-fl edged version of the model at128 128resolution (row Base), with theGarchitecture and theFcomponent ofDtaken from the corresponding128 128architectures in BigGAN, including the skip connections and shared noise embedding proposed in 1.zis 120 dimensions, split into

38、six groups of 20 dimensions fed into each of the six layers ofGas in 1. The remaining components ofDHandJ are 8-layer MLPs with ResNet-style skip connections (four residual blocks with two layers each) and size 2048 hidden layers. TheEarchitecture is the ResNet-v2-50 ConvNet originally proposed for

39、image classifi cation in 13, followed by a 4-layer MLP (size 4096) with skip connections (two residual blocks) after ResNets globally average pooled output. The unconditional BigGAN training setup corresponds to the “Single Label” setup proposed in 23, where a single “dummy” label is used for all im

40、ages (theoretically equivalent to learning a bias in place of the class-conditional batch norm inputs). We then ablate several aspects of the model, with results detailed in the following paragraphs. Additional architectural and optimization details are provided in Appendix A (supplementary material

41、). Full learning curves for many results are included in Appendix D (supplementary material). Latent distribution Pzand stochastic E.As in ALI 8, the encoderEof our Base model is non- deterministic, parametrizing a distributionN(,).and are given by a linear layer at the output of the model, and the

42、fi nal standard deviation is computed from using a non-negative “softplus” non-linearity = log(1 + exp( )7 . The fi nalzuses the reparametrized sampling from 19, withz = + ?, where? N(0,I). Compared to a deterministic encoder (row DeterministicE) which predictszdirectly without sampling (effectively

43、 modelingP(z|x)as a Diracdistribution), the non-deterministic Base model achieves signifi cantly better classifi cation performance (at no cost to generation). We also compared to using a uniformPz= U(1,1)(row UniformPz) withE deterministically predictingz = tanh( z)given a linear output z, as done

44、in BiGAN 5. This also achieves worse classifi cation results than the non-deterministic Base model. Unary loss terms.We evaluate the effect of removing one or both unary terms of the loss function proposed in Section 2,sxandsz. Removing both unary terms (row No Unaries) corresponds to the original o

45、bjective proposed in 5,8. It is clear that thexunary term has a large positive effect on generation performance, with the Base andx Unary Only rows having signifi cantly better IS and FID than thezUnary Only and No Unaries rows. This result makes intuitive sense as it matches the standard generator

46、loss. It also marginally improves classifi cation performance. Thezunary term makes a more marginal difference, likely due to the relative ease of modeling relatively simple 4 distributions like isotropic Gaussians, though also does result in slightly improved classifi cation and generation in terms

47、 of FID especially without thexterm (zUnary Only vs. No Unaries). On the other hand, IS is worse with thezterm. This may be due to IS roughly measuring the generators coverage of the major modes of the distribution (the classes) rather than the distribution in its entirety, the latter of which may b

48、e better captured by FID and more likely to be promoted by a good encoder E. The requirement of invertibility in a (Big)BiGAN could be encouraging the generator to produce distinguishable outputs across the entire latent space, rather than “collapsing” large volumes of latent space to a single mode

49、of the data distribution. G capacity.To address the question of the importance of the generatorGin representation learning, we vary the capacity ofG(withEandD fi xed) in the SmallGrows. With a third of the capacity of the BaseGmodel (SmallG (32), the overall model is quite unstable and achieves sign

50、ifi cantly worse classifi cation results than the higher capacity base model4With two-thirds capacity (SmallG (64), generation performance is substantially worse (matching the results in 1 ) and classifi cation performance is modestly worse. These results confi rm that a powerful image generator is

51、indeed important for learning good representations via the encoder. Assuming this relationship holds in the future, we expect that better generative models are likely to lead to further improvements in representation learning. Standard GAN.We also compare BigBiGANs image generation performance again

52、st a standard unconditional BigGAN with no encoderEand only the standardFConvNet in the discriminator, with only thesxterm in the loss (row NoE(GAN). While the standard GAN achieves a marginally better IS, the BigBiGAN FID is about the same, indicating that the addition of the BigBiGANEand joint Ddo

53、es not compromise generation with the newly proposed unary loss terms described in Section 2. (In comparison, the versions of the model without unary loss term onxrowszUnary Only and No Unaries have substantially worse generation performance in terms of FID than the standard GAN.) We conjecture that

54、 the IS is worse for similar reasons that theszunary loss term leads to worse IS. Next we will show that with an enhancedEtaking higher input resolutions, generation with BigBiGAN in terms of FID is substantially improved over the standard GAN. High resolution E with varying resolution G.BiGAN 5 pro

55、posed an asymmetric setup in which Etakes higher resolution images thanGoutputs andDtakes as input, showing that anEtaking 128 128inputs with a64 64 Goutperforms a64 64 Efor downstream tasks. We experiment with this setup in BigBiGAN, raising theEinput resolution to256 256matching the resolution use

56、d in typical supervised ImageNet classifi cation setups and varying theGoutput andDinput resolution in64,128,256. Our results in Table 1 (rows High ResE(256) and Low/High ResG(*) show that BigBiGAN achieves better representation learning results as theGresolution increases, up to the fullEresolution

57、 of256 256. However, because the overall model is much slower to train with G at 256 256 resolution, the remainder of our results use the 128 128 resolution for G. Interestingly, with the higher resolutionE , generation improves signifi cantly (especially by FID), despiteGoperating at the same resol

58、ution (row High ResE(256) vs. Base). This is an encouraging result for the potential of BigBiGAN as a means of improving adversarial image synthesis itself, besides its use in representation learning and inference. E architecture.Keeping theE input resolution fi xed at 256, we experiment with varied and often largerEarchitectures, including several of the ResNet-50 variants explored in 20. In particular, we expand the capacity of the hidden layers by a factor of2or4, as well as swap the residual block structure to a reversible variant called RevNet 10 with the

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论