iccv2019论文全集9113--robust-optimization-and-generalization-in-kernel-s_第1页
iccv2019论文全集9113--robust-optimization-and-generalization-in-kernel-s_第2页
iccv2019论文全集9113--robust-optimization-and-generalization-in-kernel-s_第3页
iccv2019论文全集9113--robust-optimization-and-generalization-in-kernel-s_第4页
iccv2019论文全集9113--robust-optimization-and-generalization-in-kernel-s_第5页
已阅读5页,还剩6页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Distributionally Robust Optimization and Generalization in Kernel Methods Matthew Staib MIT CSAIL Stefanie Jegelka MIT CSAIL Abstract Distributionally robust optimization (DRO) has attracted attention in machine learning due to its connections to regularization, gen

2、eralization, and robustness. Existing work has considered uncertainty sets based on-divergences and Wasser- stein distances, each of which have drawbacks. In this paper, we study DRO with uncertainty sets measured via maximum mean discrepancy (MMD). We show that MMD DRO is roughly equivalent to regu

3、larization by the Hilbert norm and, as a byproduct, reveal deep connections to classic results in statistical learning. In par- ticular, we obtain an alternative proof of a generalization bound for Gaussian kernel ridge regression via a DRO lense. The proof also suggests a new regularizer. Our resul

4、ts apply beyond kernel methods: we derive a generically applicable approxi- mation of MMD DRO, and show that it generalizes recent work on variance-based regularization. 1Introduction Distributionally robust optimization (DRO) is an attractive tool for improving machine learning models. Instead of c

5、hoosing a modelfto minimize empirical riskEx Pnf(x) = 1 n P if(xi), an adversary is allowed to perturb the sample distribution within a setUcentered around the empirical distribution Pn. DRO seeks a model that performs well regardless of the perturbation: inffsupQUExQf(x). The induced robustness can

6、 directly imply generalization: if the data that formsPnis drawn from a population distributionP, andUis large enough to containP, then we implicitly optimize forPtoo and the DRO objective value upper bounds out of sample performance. More broadly, robustness has gained attention due to adversarial

7、examples 17,39,26; indeed, DRO generalizes robustness to adversarial examples 35, 36. In machine learning, the DRO uncertainty setU has so far always been defi ned as a-divergence ball or Wasserstein ball around the empirical distribution Pn. These choices are convenient, due to a number of structur

8、al results. For example, DRO with2-divergence is roughly equivalent to regularizing by variance 18,22,31, and the worst case distributionQ Ucan be computed exactly inO(nlogn)37. Moreover, DRO with Wasserstein distance is asymptotically equivalent to certain common norm penalties 15, and the worst ca

9、seQ Ucan be computed approximately in several cases 28,14. These structural results are key, because the most challenging part of DRO is solving (or bounding) the DRO objective. However, there are substantial drawbacks to these two types of uncertainty sets. Any-divergence uncertainty setUaroundPn c

10、ontains only distributions with the same (fi nite) support asPn. Hence, the populationPis typically not inU, and so the DRO objective value cannot directly certify out of sample performance. Wasserstein uncertainty sets do not suffer from this problem. But, they are more computationally expensive, a

11、nd the above key results on equivalences and computation need nontrivial assumptions on the loss function and the specifi c ground distance metric used. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. In this paper, we introduce and develop a new class of

12、DRO problems, where the uncertainty setUis defi ned with respect to the maximum mean discrepancy (MMD) 19, a kernel-based distance between distributions. MMD DRO complements existing approaches and avoids some of their drawbacks, e.g., unlike -divergences, the uncertainty set U will contain P if the

13、 radius is large enough. First, we show that MMD DRO is roughly equivalent to regularizing by the Hilbert normkfkH of the lossf(not the modelf). While, in general,kfkH may be diffi cult to compute, we show settings in which it is tractable. Specifi cally, for kernel ridge regression with a Gaussian

14、kernel, we prove a bound onkfkHthat, as a byproduct, yields generalization bounds that match (up to a small constant) the standard ones. Second, beyond kernel methods, we show how MMD DRO generalizes variance-based regularization. Finally, we show how MMD DRO can be effi ciently approximated empiric

15、ally, and in fact generalizes variance-based regularization. Overall, our results offer deeper insights into the landscape of regularization and robustness ap- proaches, and a more complete picture of the effects of different divergences for defi ning robustness. In short, our contributions are: 1.W

16、e prove fundamental structural results for MMD DRO, and its rough equivalence to penalizing by the Hilbert norm of the loss. 2.We give a new generalization proof for Gaussian kernel ridge regression by way of DRO. Along the way, we prove bounds on the Hilbert norm of products of functions that may b

17、e of independent interest. 3. Our generalization proof suggests a new regularizer for Gaussian kernel ridge regression. 4.We derive a computationally tractable approximation of MMD DRO, with application to general learning problems, and we show how the aforementioned approximation generalizes varian

18、ce regularization. 2Background and Related Work Distributionally robust optimization (DRO) 16,2, introduced by Scarf32, asks to not only perform well on a fi xed problem instance (parameterized by a distribution), but simultaneously for a range of problems, each determined by a distribution in an un

19、certainty setU. This results in more robust solutions. The uncertainty set plays a key role: it implicitly defi nes the induced notion of robustness. The DRO problem we address asks to learn a model f that solves (DRO)inf f sup QU ExQf(x),(1) where f(x) is the loss incurred under prediction f(x). In

20、 this work, we focus on data-driven DRO, whereUis centered around an empirical sample Pn= 1 n Pn i=1xi, and its size is determined in a data-dependent way. Data-driven DRO yields a natural approach for certifying out-of-sample performance. Principle 2.1(DRO Generalization Principle).Fix any modelf.

21、LetUbe a set of distributions containingPn. SupposeUis large enough so that, with probability1 ,Ucontains the population P. Then with probability 1 , the population loss ExPf(x) is bounded by ExPf(x) sup QU ExQf(x).(2) Essentially, if the uncertainty setUis chosen appropriately, the corresponding DR

22、O problem gives a high probability bound on population performance. The two key steps in using Principle 2.1 are 1.arguing thatUactually containsPwith high probability (e.g. via concentration);2.solving the DRO problem on the right hand side, or an upper bound thereof. In practice,Uis typically chos

23、en as a ball of radius?around the empirical samplePn:U = Q : d(Q,Pn) ?. Here,d is a discrepancy between distributions, and is of utmost signifi cance: the choice of d determines how large ? must be, and how tractable the DRO problem is. In machine learning, two choices of the divergencedare prevalen

24、t,-divergences 1,11,22, and Wasserstein distance 28,33,6 . The fi rst option,-divergences, have the formd(P,Q) = 2 R (dP/dQ)dQ. In particular, they include the2-divergence, which makes the DRO problem equiv- alent to regularizing by variance 18,22,31. Beyond better generalization, variance regulariz

25、ation has applications in fairness 20. However, a major shortcoming of DRO with-divergences is that the ballU = Q : d(Q,P0) ?only contains distributionsQwhose support is contained in the support ofP0. IfP0= Pnis an empirical distribution onnpoints, the ballUonly contains distributions with the same

26、fi nite support. Hence, the population distributionPtypically cannot belong toU, and it is not possible to certify out-of-sample perfomance by Principle 2.1. Though Principle 2.1 does not apply here, generalization bounds are still possible via other means 31. The second option, Wasserstein distance

27、, is based on a distance metricgon the data space. The p-Wasserstein distanceWpbetween measure,is given byWp(,) = infRg(x,y)pd(x,y) : (,)1/p, where(,)is the set of couplings ofand40. Wasserstein DRO has a key benefi t over-divergences: the setU = Q : Wp(Q,P0) ?contains continuous distributions. More

28、over, concentration results boundingWp(P,Pn)with high probability are available for many settings, e.g. 13,23,34,41. However, Wasserstein distance is much harder to work with, and nontrivial assumptions are needed to derive the necessary structural and algorithmic results for solving the associated

29、DRO problem. To our knowledge, in all Wasserstein DRO work so far, the ground metricgis limited to slight variations of either a Euclidean or Mahalanobis metric 7,8. Such metrics may be a poor fi t for complex data such as images or distributions. These assumptions restrict the extent to which Wasse

30、rstein DRO can utilize complex, nonlinear structure in the data. Maximum Mean Discrepancy (MMD).MMD is a distance metric between distributions that leverages kernel embeddings. LetHbe a reproducing kernel Hilbert space (RKHS) with kernelk and norm kkH . MMD is defi ned as follows: Defi nition 2.1. T

31、he maximum mean discrepancy (MMD) between distributions P and Q is dMMD(P,Q) :=sup fH:kfkH1 ExPf(x) ExQf(x).(3) Fact 2.1. Defi ne the mean embeddingPof the distributionPbyP= ExPk(x,). Then the MMD between distributions P and Q can be equivalently written dMMD(P,Q) = kP QkH.(4) MMD and (more generall

32、y) kernel mean embeddings have been used in many applications, particu- larly in two- and one-sample tests 19,21,25,9 and in generative modeling 12,24,38,5. We refer the interested reader to the monograph by Muandet et al.30 . MMD admits effi cient estimation, as well as fast convergence properties,

33、 which are of chief importance in our work. Further related work.Beyond-divergences and Wasserstein distances, work in operations research has considered DRO problems that capture uncertainty in moments of the distribution, e.g. 10 . These approaches typically focus on fi rst- and second-order momen

34、ts; in contrast, an MMD uncertainty set allows high order moments to vary, depending on the choice of kernel. Robust and adversarial machine learning have strong connections to our work and DRO more generally. Robustness to adversarial examples 39,17, where individual inputs to the model are perturb

35、ed in a small ball, can be cast as a robust optimization problem 26. When the ball is a norm ball, this robust formulation is a special case of Wasserstein DRO 35,36. Xu et al.42 study the connection between robustness and regularization in SVMs, and perturbations within a (possibly Hilbert) norm ba

36、ll. Unlike our work, their results are limited to SVMs instead of general loss minimization. Moreover, they consider only perturbation of individual data points instead of shifts in the entire distribution. Bietti et al.4show that many regularizers used for neural networks can also be interpreted in

37、 light of an appropriately chosen Hilbert norm 3. 3Generalization bounds via MMD DRO The main focus of this paper is Distributionally Robust Optimization where the uncertainty set is defi ned via the MMD distance dMMD: inf f sup Q:dMMD(Q,Pn)? ExQf(x).(5) 3 One motivation for considering MMD in this

38、setting are its possible implications for Generalization. Recall that for the DRO Generalization Principle 2.1 to apply, the uncertainty setUmust contain the population distribution with high probability. To ensure this, the radius ofUmust be large enough. But, the larger the radius, the more pessim

39、istic is the DRO minimax problem, which may lead to over-regularization. This radius depends on how quicklydMMD(P,Pn)shrinks to zero, i.e., on the empirical accuracy of the divergence. In contrast to Wasserstein distance, which converges at a rate ofO(n1/d)13, MMD between the empirical samplePnand p

40、opulation P shrinks as O(n1/2): Lemma 3.1 (Modifi ed from 30, Theorem 3.4).Suppose thatk(x,x) Mfor allx. LetPnbe an n sample empirical approximation to P. Then with probability 1 , dMMD(P,Pn) 2 q M n + q 2log(1/) n .(6) The constantMis dimension-independent for many common universal kernels, e.g. Ga

41、ussian, Laplace, and Matern kernels. With Lemma 3.1 in hand, we conclude a simple high probability bound on out-of-sample performance: Corollary 3.1.Suppose thatk(x,x) Mfor allx. Set? = 2pM/n+ p2log(1/)/n. Then with probability 1 , we have the following bound on population risk: ExPf(x) sup Q:dMMD(Q

42、,Pn)? ExQf(x).(7) We refer to the right hand side as the DRO adversarys problem. In the next section we develop results that enable us to bound its value, and consequently bound the DRO problem (5). 3.1Bounding the DRO adversarys problem The DRO adversarys problem seeks the distributionQin the MMD b

43、all so thatExQf(x)is as high as possible. Reasoning about the optimal worst-caseQ is the main diffi culty in DRO. With MMD, we take two steps for simplifi cation. First, instead of directly optimizing over distributions, we optimize over their mean embeddings in the Hilbert space (described in Fact

44、2.1). Second, while the adversarys problem(7)makes sense for generalf, we assume that the lossfis inH. In casef6 H, oftenkis a universal kernel, meaning under mild conditionsfcan be approximated arbitrarily well by a member of H 30, Defi nition 3.3. With the additional assumption thatf H, the riskEx

45、Pf(x)can also be written ashf,PiH. Then we obtain sup Q:dMMD(Q,P)? ExQf(x) sup QH:kQPkH?h f,QiH, (8) where we have an inequality because not every function inHis the mean embedding of some probability distribution. Ifkis a characteristic kernel 30 , Defi nition 3.2, the mappingP 7 Pis injective. In

46、this case, the only looseness in the bound is due to discarding the constraints thatQ integrates to one and is nonnegative. However it is diffi cult to constrain the mean embeddingQin this way as it is a function. The mean embedding form of the problem is simpler to work with, and leads to further i

47、nterpretations. Theorem 3.1. Let f,P H. We have the following equality: sup QH:kQPkH?h f,QiH= hf,PiH+ ?kfkH= ExPf(x) + ?kfkH. (9) In particular, the adversarys optimal solution is Q = P+ ? kkHf. Combining Theorem 3.1 with equation (8) yields our main result for this section: Corollary 3.2. Let f H,

48、let P be a probability distribution, and fi x ? 0. Then, sup Q:dMMD(P,Q)? ExQf(x) ExPf(x) + ?kfkHand therefore(10) inf f sup Q:dMMD(P,Q)? ExQf(x) inf f ExPf(x) + ?kfkH.(11) 4 Combining Corollary 3.2 with Corollary 3.1 shows that minimizing the empirical risk plus a norm onfleads to a high probabilit

49、y bound on out-of-sample performance. This result is similar to results that equate Wasserstein DRO to norm regularization. For example, Gao et al.15show that under appropriate assumptions onf, DRO with ap-Wasserstein ball is asymptotically equivalent to Ex Pnf(x)+?kxfkPn,q, wherekxfkPn,q = ? 1 n Pn

50、 i=1kxf(xi)k q ?1/q measures a kind of q-norm average ofkxf(xi)kat each data pointxi(hereqis such that1/p + 1/q = 1, andk k is the dual norm of the metric defi ning the Wasserstein distance). There are a few key differences between our result and that of Gao et al.15. First, the norms are different.

51、 Second, their result penalizes only the gradient off, while ours penalizesfdirectly. Third, except for certain special cases, the Wasserstein results cannot serve as a true upper bound; there are higher order terms that only shrink to zero as? 0. These higher order terms may not be so small: in hig

52、h dimensiond, the radius?of the uncertainty set needed so thatP Ushrinks very slowly, as O(n1/d) 13. Remark 3.1.Theorem 3.1 and Corollary 3.2 require thatfis in the RKHSH. Though this may seem restrictive, if the kernelkis universal, as is the case for many kernels used in practice such as Gaussian

53、and Laplace kernels, we can readily extend our results to all bounded continuous functions. Supposefis a bounded continuous function on a compact metric spaceX . By defi nition (e.g. 30, Defi nition 3.3), ifkis a universal kernel onX, then for any? 0, there is some0 Hwith supxX|f(x) 0(x)| 0, with pr

54、obability 1 , the following holds for all functions f satisfying kfk : RP(f) R Pn(f) + 82 n ? 1 + q log(1/) 2 ? .(19) 6 Proof.We reduce to Theorem 4.2. By Theorem 4.1, we know thatkf2k/2 kfk2 , which may be bounded above by2(and similarly forh). Therefore we can takef2= 2 f = and h2= 2 h = in Theorem 4.2. The result follows by bounding f2+ h2+ 2fh 2+ 2+ 2 = 42. Generalization bounds for kernel ridge regression are of course not new; we em

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论