iccv2019论文全集8868-variational-bayesian-decision-making-for-continuous-utilities_第1页
iccv2019论文全集8868-variational-bayesian-decision-making-for-continuous-utilities_第2页
iccv2019论文全集8868-variational-bayesian-decision-making-for-continuous-utilities_第3页
iccv2019论文全集8868-variational-bayesian-decision-making-for-continuous-utilities_第4页
iccv2019论文全集8868-variational-bayesian-decision-making-for-continuous-utilities_第5页
已阅读5页,还剩6页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Variational Bayesian Decision-making for Continuous Utilities Tomasz Ku smierczykJoseph SakayaArto Klami Helsinki Institute for Information Technology HIIT Department of Computer Science, University of Helsinki tomasz.kusmierczyk,joseph.sakaya,arto.klamihelsinki.fi Abstract Bayesian decision theory

2、outlines a rigorous framework for making optimal de- cisions based on maximizing expected utility over a model posterior. However, practitioners often do not have access to the full posterior and resort to approxi- mate inference strategies. In such cases, taking the eventual decision-making task in

3、to account while performing the inference allows for calibrating the posterior approximation to maximize the utility. We present an automatic pipeline that co-opts continuous utilities into variational inference algorithms to account for decision-making. We provide practical strategies for approxima

4、ting and maximiz- ing the gain, and empirically demonstrate consistent improvement when calibrating approximations for specifi c utilities. 1Introduction A considerable proportion of research on Bayesian machine learning concerns itself with the task of inference, developing techniques for an effi c

5、ient and accurate approximation of the posterior distributionp(|D)of the model parametersconditional on observed dataD. However, in most cases, this is not the end goal in itself. Instead, we eventually want to solve a decision problem of some kind and merely use the posterior as a summary of the in

6、formation provided by the data and the modeling assumptions. For example, we may want to decide to automatically shut down a process to avoid costs associated with its potential failure, and do not care about the exact posterior as long as we can make good decisions that still account for our uncert

7、ainty of the parameters. Focusing on inference is justifi ed by Bayesian decision theory 2 formalizing the notion that the posterior is suffi cient for making optimal decisions. This is achieved by selecting decisions that maximize the expected utility, computed by integrating over the posterior. Th

8、e theory, however, only applies when integrating over the true posterior which can be computed only for simple models. With approximate posteriors it is no longer optimal to separate inference from decision-making. Standard approximation algorithms try to represent the full posterior accurately, yet

9、 lack guarantees for high accuracy for parameter regions that are critical for decision-making. This holds for both distributional techniques, such as variational approximation 4 and expectation propagation 10,18, as well as Markov chain Monte Carlo (MCMC) even though the latter are asymptotically e

10、xact, a fi nite set of samples is still an approximation and it is often diffi cult to sample from the correct distribution. Loss-calibrated inference refers to techniques that adapt the inference process to better capture the posterior regions relevant to the decision-making task. First proposed by

11、Lacoste-Julien et al. 16in the context of variational approximation, the principle has been used also for calibrating MCMC 1, and recently for Bayesian neural networks 7. The core idea of loss calibration is to maximize the expected utility computed over the approximating distribution, instead of ma

12、ximizing the approxi- mation accuracy, while still retaining a reasonable posterior approximation. Figure 1 demonstrates how the calibration process shifts an otherwise sub-optimal approximation to improve the decisions, 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancou

13、ver, Canada. qWi=577,k=1 qWi=577,k=2 qWi=577,k=3 qWi=577,k=4 qZk=1,j=Muse qZk=2,j=Muse qZk=3,j=Muse qZk=4,j=Muse Approximate Posterior Densities Marginalize out latent variables 2468 Output y 0.0 0.1 0.2 0.3 Posterior Predictive Density LCVI prediction hLCVI VI prediction hVI true user no: 577 artis

14、t: Muse LCVI VI Figure 1: Loss-calibration (red) modifi es the posterior approximation (left) so that Bayes optimal decisions for the predictive distribution (right) are better in terms of a user-defi ned loss, here squared error, while still characterizing the posterior almost as well as the standa

15、rd variational approximation (blue). See Section 6.2 for detailed description of the experiment. but still represents the uncertainty over the parameter space. That is, we are merely fi ne-tuning calibrating the approximation instead of solely optimizing for the decision. Previous work on calibratin

16、g variational approximations only deals with classifi cation problems 7, 16 making discrete decision amongst fi nitely many classes. This allows algorithms based on explicit enumeration and summation of alternative decisions, which are inapplicable for continuous spaces. Lack of tools for continuous

17、 utilities has thus far ruled out, for example, calibration for regression problems. We provide these tools. We analyse the degree of calibration under linear transformations of utility, and describe how effi cient calibration can be carried out also when the user characterises the relative quality

18、of the decisions with losses instead of utilities. To cope with the challenges imposed by moving from discrete to continuous output spaces, we replace the enumeration over possible choices by nested Monte Carlo integration combined with double reparameterization technique, and provide algorithms for

19、 learning optimal decisions for a fl exible choice of utilities. We demonstrate the technique in predictive machine learning tasks on the eight schools model 9,29 and probabilistic matrix factorization on media consumption data. 2Background 2.1Bayesian decision theory Bayesian decision theory 2,24 i

20、s the axiomatic formalization of decision-making under uncertainty. Given a posterior distributionp(|D)of a parametric model conditioned on dataD, we desire to make optimal decisionsh. The value of individual decisions depends on the utility u(,h) 0that is a function of both: (1) the state of the wo

21、rldand (2) the decisionh. The optimal decisionshp maximize the gain (=the expected utility) Gu(h) = Z p(|D) u(,h)d. An equivalent formulation is obtained by evaluating individual decisions by a loss function(,h) and solving for optimal decisions by minimizing the risk Rl(h) = R p(|D)(,h)d. Even thou

22、gh some decision problems operate directly on model parameters, it is more typical to make decisions regarding predictionsy p(y|D). For such problems the utility is expressed as u(y,h), which together with the model induces the utility u(,h) = R p(y|,D)u(y,h)dy, where we use the notationp(y|,D)to in

23、dicate the prediction may depend on some covariates inD. This complicates computation because evaluating the gain requires nested integration overp(|D)and p(y|,D). In the remainder of the paper we focus on this more challenging family of decisions, and always use u(,h) to denote the expected utility

24、 induced by the predictive utility u(y,h). 2 2.2Variational inference Variational inference approximates the posteriorp(|D)with a proxy distributionq()parameterized by , typically by maximizing a lower bound LVI() for the marginal log-likelihood logp(D) = log Z q()p(D,) q() d Z q()log p(D,) q() d =:

25、 LVI(). Traditional methods use coordinate ascent updates, mean-fi eld approximations, and conjugate priors for computational tractability 4. Recently, several gradient-based optimization algorithms 5,22,28 have made variational inference feasible for non-conjugate models and richer approximation fa

26、milies, using gradient-based optimization of Monte Carlo estimates of the bound. The most effi cient techniques use reparameterization to compute the gradientsLVI() = Eq()logp(D,) logq(),by rewriting the distributionq()using a differentiable trans- formation = f(?,)of an underlying, parameter-free s

27、tandard distributionq0(?)28. We can then use Monte Carlo integration overq0(?)for evaluating the expectations, yet the value depends onand hence we can propagate gradients throughf()for learning. The reparameterization can be carried out either explicitly 20,25 or implicitly 8; the latter strategy m

28、akes reparameterization possible for almost any distribution. Our derivations and experiments are on simple parametric approximations, but we note that the loss calibration elements can be combined with wide range of recent advances in variational inference, such as generalized VI 14, boosting VI 11

29、,17 , more effi cient structural approximations 12, and normalizing fl ows for fl exible approximations 23. 2.3Loss-calibrated variational inference The idea of calibrating a variational approximation was proposed byLacoste-Julien et al. 16, based on lower bounding the logarithmic gain using Jensens

30、 inequality as logGu(h) = log Z q() q()p(|D) u(,h) d KL(q,p) + Z q()log u(,h)d |z U(,h) - utility-dependent term . The bound consists of two terms. The fi rst term, negative Kullback-Leibler divergence between the approximation and the posterior, can be further replaced by a lower bound for the marg

31、inal likelihood 4 analogous to standard variational approximation to provide the fi nal boundL(,h) := ELBO() + U(,h) logGu(h).The second term accounts for decision making. It is independent of the observedyand only depends on the current approximationq(), favoring approximations that optimize the ut

32、ility. For effi cient optimization the bound for multiple predictions can be written as sum over individual data instances as L(,h) = X iD (ELBOi() + U(,hi) logGu(h),(1) whereELBOiaccounts for an individual data pointyifor which the hypothesis ishi. This holds for predictive models that assume i.i.d

33、. predictions and additive log-gains, which captures most practical scenarios and allows for making decisionshiseparately for individual data points. For clarity, we drop the subscript i in hiduring derivations. For optimizing the bound,Lacoste-Julien et al. 16derived an EM algorithm that alternates

34、 between learning optimal and selecting optimal decisions: E-step: := argmax L(,h),M-step: h := argmax h L(,h) = argmax h U(,h). However, they used closed-form analytic updates for, for which incorporating the utility-dependent term is diffi cult, and only demonstrated the principle in classifi cati

35、on problems with discreteh. Cobb et al. 7derived a loss-calibrated variational lower bound for Bayesian neural networks with discrete decisionsh. UnlikeLacoste-Julien et al. 16, however, their updates forare gradient-based and applicable to generic utility-dependent terms as long as the decisions ar

36、e discrete. 3Loss calibration for continuous utilities Handling continuous decisions requires both more careful treatment of utilities and losses, described below, and new algorithms for optimizing the bound, provided in Section 4. 3 3.1The calibration effect Optimal decisions are invariant to linea

37、r transformations of the utility, so thatargmaxhGu(h) = argmaxhGu0(h)foru0(y,h) = u(y,h) + for 02. However, this does not hold for the loss-calibration procedure. Instead, translating the utilities with infl uences the degree of calibration. To see this we assume (for notational simplicity)infy,hu(y

38、,h) = 0and writeUcorresponding to u0(y,h) as Eq ? log ? + Z p(y|,D)u(y,h)dy ? = Eq ? log ? 1 + Z p(y|,D)u(y,h)dy ? |z expectation term +log, where the equality holds for any 0(negative values would lead to negative utilities). Since logis constant w.r.t variational parameters, only the expectation t

39、erm is relevant for optimization. However, its impact (=magnitude) relative toELBOdepends on the ratio . In particular, as (for any fi xed) the expectation term converges to0removing the calibration effect completely. In contrast, pushing 0maximizes the effect of calibration by maximizing the magnit

40、ude ofU. Hence, maximal calibration is achieved when = 0, i.e., wheninfy,hu0(y,h) = 0. Finally, for = 0the scaling constantcan be taken out of the expectation and hence has no effect on optimal solution. In summary, the calibration effect is maximized by using utilities with zero infi mum, and for s

41、uch utilities the procedure is scale invariant. The procedure is valid also when this does not hold, but the calibration effect diminishes and depends on the scaling in an unknown manner. 3.2Utilities and losses As stated in Section 2.1, decision problems can be formulated in terms of maximizing gai

42、n defi ned by a utilityu(y,h) 0 , or in terms of minimizing risk defi ned by a loss(y,h) 0. The calibration procedure above is provided for the gain, since both the bound for the marginal likelihood and the bound for the gain need to be in the same direction in Eq.(1) . To calibrate for user-defi ne

43、d loss (which tends to be more common in practical applications), we need to convert the loss into a utility. Unfortunately, the way this is carried out infl uences the fi nal risk evaluated using the original loss. Only linear transformations retain the optimal decisions, and the simplest one provi

44、ding non-negative utilities isu(y,h) = M (y,h)whereM supy,h(y,h)2, with equality providing optimal calibration as explained above. However, we cannot evaluateM = supy,h(y,h)for continuous unbounded losses. Furthermore, this value may be excessively large due to outliers, so thatM ? (y,h) for almost

45、all instances, effectively removing the calibration even if we knew how to fi nd the optimal value. As a remedy, we propose two practical strategies to calibrate for continuous unbounded losses, both based on the intuitive idea of bringing the utilities close to zero for the losses we expect to see

46、in practice. Robust maximumForu(y,h) = M (y,h), any value of(y,h) Mmay lead to u(,h) 0 and hence to negative input tologinU. However, this problem disappears if we linearize the logarithm in U around M similar to 16. Using Taylors expansion log u(,h) = log(M (,h) = logM (,h) M + O (,h) 2 M2 ! , and

47、dropping the error term,log u(,h) logM (,h) M and the utility-dependent term Eqlog u(,h)can be re-expressed asEq h logM (,h) M i = logM 1 MEq h (,h) i . We can ignorelogMas it is constant with respect to the decisionshand variational parameters. Now that u(,h)no longer appears inside a log, we can u

48、se alsoMthat is not a strict upper bound, but instead a robust estimator for maximum that excludes the tail of the loss distribution. We propose u(y,h) = Mq (y,h),(2) where Mqis the qth quantile (e.g. 90%) of the expected loss distribution. To obtain Mq we fi rst run standard VI for suffi ciently ma

49、ny iterations (until the losses converge), and then compute the loss for 4 every training instance. We then sort the resulting losses and setMqto match the desired quantile of this empirical distribution of losses. In many cases,Mqis considerably smaller thansupy,hl(y,h), which increases the calibra

50、tion effect. Non-linear loss transformationThe other alternative is to use transformations that guarantee non-negative utilities by mapping losses into positive values. A practical example is u(y,h) = e(y,h),(3) where the rate parametercan be related to the quantiles of the loss distribution as = M1

51、 q , by solving for a value for which linearization of the utility at(y,h) = 0would be zero for(y,h) = Mq. 4Algorithms for calibrating variational inference For practical computation we need to optimize Eq.(1)w.r.t. bothandhin a model-independent manner, which can be carried out using techniques des

52、cribed next. 4.1Monte Carlo approximation of U The fi rst practical challenge concerns evaluation and optimization ofU = R q()log u(,h)d = R q()log R p(y|,D)u(y,h)dyd. Since we already reparameterizedfor optimization ofELBO, we can as well approximate the outer expectation as U(,h) 1 S X q() log u(,

53、h) = 1 S X ?q0 log u(f(?,),h),(4) whereq0is the zero-parameter distribution,ftransforms samples fromq0(?)into samples fromq(), andSis a number of samples for Monte Carlo integration. For discrete outputs the inner expectation computing u(,h)becomes a sum over possible valuesY, which makesUstraightfo

54、rward to optimize both w.r.t.(via gradient ascent) andh(by enumeration). For continuousy, however, the integral remains as the main challenge in developing effi cient algorithms. We address this challenge by a double reparametrization scheme. Besides reparameterizing the approximationq(), we reparam

55、eterize also the predictive likelihoodp(y|,D), what was made possible for most densities by implicit reparameterization gradients 8. This enables approximating the inner integral with MC samples as u(,h) 1 Sy X p0 u(g(,D),h),(5) while preserving differentiability w.r.t. bothandh. Heredenotes samples

56、 from parameter-free distributionp0used to simulate samplesy p(y|f(?,),D)via the transformationg(). Similar derivation for approximation of the utility-dependent term exists for discrete decisions 7. This however, does not require the double reparameterization scheme proposed here. For evaluating U

57、we use a naive estimator that simply plugs (5) in place of u(f(.),h) into (4): U(,h) 1 S X ?q0 log 1 Sy X p0 u(g(,f(?,),D),h) . (6) Even though this estimator is slightly biased, it works well in practice. The bias could be reduced, for example, by employing the Taylor expansion ofEploguin a manner

58、similar to 27, or removed by boundingUwith Jensens inequality asU(,h) R q() R p(y|,D)logu(y,h)dyd, but such estimators display numerical instability for utilities close to zero and are not useful in practice. Finally, we note that for linearized utility-dependent term, for problems defi ned in terms of losses, we can directly use the simple unbiased estimator U(,h) 1 MSSy X ?q0 X p0 (g(,f(?,),D),h).(7) We note that, however

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论