版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Variational Bayesian Decision-making for Continuous Utilities Tomasz Ku smierczykJoseph SakayaArto Klami Helsinki Institute for Information Technology HIIT Department of Computer Science, University of Helsinki tomasz.kusmierczyk,joseph.sakaya,arto.klamihelsinki.fi Abstract Bayesian decision theory
2、outlines a rigorous framework for making optimal de- cisions based on maximizing expected utility over a model posterior. However, practitioners often do not have access to the full posterior and resort to approxi- mate inference strategies. In such cases, taking the eventual decision-making task in
3、to account while performing the inference allows for calibrating the posterior approximation to maximize the utility. We present an automatic pipeline that co-opts continuous utilities into variational inference algorithms to account for decision-making. We provide practical strategies for approxima
4、ting and maximiz- ing the gain, and empirically demonstrate consistent improvement when calibrating approximations for specifi c utilities. 1Introduction A considerable proportion of research on Bayesian machine learning concerns itself with the task of inference, developing techniques for an effi c
5、ient and accurate approximation of the posterior distributionp(|D)of the model parametersconditional on observed dataD. However, in most cases, this is not the end goal in itself. Instead, we eventually want to solve a decision problem of some kind and merely use the posterior as a summary of the in
6、formation provided by the data and the modeling assumptions. For example, we may want to decide to automatically shut down a process to avoid costs associated with its potential failure, and do not care about the exact posterior as long as we can make good decisions that still account for our uncert
7、ainty of the parameters. Focusing on inference is justifi ed by Bayesian decision theory 2 formalizing the notion that the posterior is suffi cient for making optimal decisions. This is achieved by selecting decisions that maximize the expected utility, computed by integrating over the posterior. Th
8、e theory, however, only applies when integrating over the true posterior which can be computed only for simple models. With approximate posteriors it is no longer optimal to separate inference from decision-making. Standard approximation algorithms try to represent the full posterior accurately, yet
9、 lack guarantees for high accuracy for parameter regions that are critical for decision-making. This holds for both distributional techniques, such as variational approximation 4 and expectation propagation 10,18, as well as Markov chain Monte Carlo (MCMC) even though the latter are asymptotically e
10、xact, a fi nite set of samples is still an approximation and it is often diffi cult to sample from the correct distribution. Loss-calibrated inference refers to techniques that adapt the inference process to better capture the posterior regions relevant to the decision-making task. First proposed by
11、Lacoste-Julien et al. 16in the context of variational approximation, the principle has been used also for calibrating MCMC 1, and recently for Bayesian neural networks 7. The core idea of loss calibration is to maximize the expected utility computed over the approximating distribution, instead of ma
12、ximizing the approxi- mation accuracy, while still retaining a reasonable posterior approximation. Figure 1 demonstrates how the calibration process shifts an otherwise sub-optimal approximation to improve the decisions, 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancou
13、ver, Canada. qWi=577,k=1 qWi=577,k=2 qWi=577,k=3 qWi=577,k=4 qZk=1,j=Muse qZk=2,j=Muse qZk=3,j=Muse qZk=4,j=Muse Approximate Posterior Densities Marginalize out latent variables 2468 Output y 0.0 0.1 0.2 0.3 Posterior Predictive Density LCVI prediction hLCVI VI prediction hVI true user no: 577 artis
14、t: Muse LCVI VI Figure 1: Loss-calibration (red) modifi es the posterior approximation (left) so that Bayes optimal decisions for the predictive distribution (right) are better in terms of a user-defi ned loss, here squared error, while still characterizing the posterior almost as well as the standa
15、rd variational approximation (blue). See Section 6.2 for detailed description of the experiment. but still represents the uncertainty over the parameter space. That is, we are merely fi ne-tuning calibrating the approximation instead of solely optimizing for the decision. Previous work on calibratin
16、g variational approximations only deals with classifi cation problems 7, 16 making discrete decision amongst fi nitely many classes. This allows algorithms based on explicit enumeration and summation of alternative decisions, which are inapplicable for continuous spaces. Lack of tools for continuous
17、 utilities has thus far ruled out, for example, calibration for regression problems. We provide these tools. We analyse the degree of calibration under linear transformations of utility, and describe how effi cient calibration can be carried out also when the user characterises the relative quality
18、of the decisions with losses instead of utilities. To cope with the challenges imposed by moving from discrete to continuous output spaces, we replace the enumeration over possible choices by nested Monte Carlo integration combined with double reparameterization technique, and provide algorithms for
19、 learning optimal decisions for a fl exible choice of utilities. We demonstrate the technique in predictive machine learning tasks on the eight schools model 9,29 and probabilistic matrix factorization on media consumption data. 2Background 2.1Bayesian decision theory Bayesian decision theory 2,24 i
20、s the axiomatic formalization of decision-making under uncertainty. Given a posterior distributionp(|D)of a parametric model conditioned on dataD, we desire to make optimal decisionsh. The value of individual decisions depends on the utility u(,h) 0that is a function of both: (1) the state of the wo
21、rldand (2) the decisionh. The optimal decisionshp maximize the gain (=the expected utility) Gu(h) = Z p(|D) u(,h)d. An equivalent formulation is obtained by evaluating individual decisions by a loss function(,h) and solving for optimal decisions by minimizing the risk Rl(h) = R p(|D)(,h)d. Even thou
22、gh some decision problems operate directly on model parameters, it is more typical to make decisions regarding predictionsy p(y|D). For such problems the utility is expressed as u(y,h), which together with the model induces the utility u(,h) = R p(y|,D)u(y,h)dy, where we use the notationp(y|,D)to in
23、dicate the prediction may depend on some covariates inD. This complicates computation because evaluating the gain requires nested integration overp(|D)and p(y|,D). In the remainder of the paper we focus on this more challenging family of decisions, and always use u(,h) to denote the expected utility
24、 induced by the predictive utility u(y,h). 2 2.2Variational inference Variational inference approximates the posteriorp(|D)with a proxy distributionq()parameterized by , typically by maximizing a lower bound LVI() for the marginal log-likelihood logp(D) = log Z q()p(D,) q() d Z q()log p(D,) q() d =:
25、 LVI(). Traditional methods use coordinate ascent updates, mean-fi eld approximations, and conjugate priors for computational tractability 4. Recently, several gradient-based optimization algorithms 5,22,28 have made variational inference feasible for non-conjugate models and richer approximation fa
26、milies, using gradient-based optimization of Monte Carlo estimates of the bound. The most effi cient techniques use reparameterization to compute the gradientsLVI() = Eq()logp(D,) logq(),by rewriting the distributionq()using a differentiable trans- formation = f(?,)of an underlying, parameter-free s
27、tandard distributionq0(?)28. We can then use Monte Carlo integration overq0(?)for evaluating the expectations, yet the value depends onand hence we can propagate gradients throughf()for learning. The reparameterization can be carried out either explicitly 20,25 or implicitly 8; the latter strategy m
28、akes reparameterization possible for almost any distribution. Our derivations and experiments are on simple parametric approximations, but we note that the loss calibration elements can be combined with wide range of recent advances in variational inference, such as generalized VI 14, boosting VI 11
29、,17 , more effi cient structural approximations 12, and normalizing fl ows for fl exible approximations 23. 2.3Loss-calibrated variational inference The idea of calibrating a variational approximation was proposed byLacoste-Julien et al. 16, based on lower bounding the logarithmic gain using Jensens
30、 inequality as logGu(h) = log Z q() q()p(|D) u(,h) d KL(q,p) + Z q()log u(,h)d |z U(,h) - utility-dependent term . The bound consists of two terms. The fi rst term, negative Kullback-Leibler divergence between the approximation and the posterior, can be further replaced by a lower bound for the marg
31、inal likelihood 4 analogous to standard variational approximation to provide the fi nal boundL(,h) := ELBO() + U(,h) logGu(h).The second term accounts for decision making. It is independent of the observedyand only depends on the current approximationq(), favoring approximations that optimize the ut
32、ility. For effi cient optimization the bound for multiple predictions can be written as sum over individual data instances as L(,h) = X iD (ELBOi() + U(,hi) logGu(h),(1) whereELBOiaccounts for an individual data pointyifor which the hypothesis ishi. This holds for predictive models that assume i.i.d
33、. predictions and additive log-gains, which captures most practical scenarios and allows for making decisionshiseparately for individual data points. For clarity, we drop the subscript i in hiduring derivations. For optimizing the bound,Lacoste-Julien et al. 16derived an EM algorithm that alternates
34、 between learning optimal and selecting optimal decisions: E-step: := argmax L(,h),M-step: h := argmax h L(,h) = argmax h U(,h). However, they used closed-form analytic updates for, for which incorporating the utility-dependent term is diffi cult, and only demonstrated the principle in classifi cati
35、on problems with discreteh. Cobb et al. 7derived a loss-calibrated variational lower bound for Bayesian neural networks with discrete decisionsh. UnlikeLacoste-Julien et al. 16, however, their updates forare gradient-based and applicable to generic utility-dependent terms as long as the decisions ar
36、e discrete. 3Loss calibration for continuous utilities Handling continuous decisions requires both more careful treatment of utilities and losses, described below, and new algorithms for optimizing the bound, provided in Section 4. 3 3.1The calibration effect Optimal decisions are invariant to linea
37、r transformations of the utility, so thatargmaxhGu(h) = argmaxhGu0(h)foru0(y,h) = u(y,h) + for 02. However, this does not hold for the loss-calibration procedure. Instead, translating the utilities with infl uences the degree of calibration. To see this we assume (for notational simplicity)infy,hu(y
38、,h) = 0and writeUcorresponding to u0(y,h) as Eq ? log ? + Z p(y|,D)u(y,h)dy ? = Eq ? log ? 1 + Z p(y|,D)u(y,h)dy ? |z expectation term +log, where the equality holds for any 0(negative values would lead to negative utilities). Since logis constant w.r.t variational parameters, only the expectation t
39、erm is relevant for optimization. However, its impact (=magnitude) relative toELBOdepends on the ratio . In particular, as (for any fi xed) the expectation term converges to0removing the calibration effect completely. In contrast, pushing 0maximizes the effect of calibration by maximizing the magnit
40、ude ofU. Hence, maximal calibration is achieved when = 0, i.e., wheninfy,hu0(y,h) = 0. Finally, for = 0the scaling constantcan be taken out of the expectation and hence has no effect on optimal solution. In summary, the calibration effect is maximized by using utilities with zero infi mum, and for s
41、uch utilities the procedure is scale invariant. The procedure is valid also when this does not hold, but the calibration effect diminishes and depends on the scaling in an unknown manner. 3.2Utilities and losses As stated in Section 2.1, decision problems can be formulated in terms of maximizing gai
42、n defi ned by a utilityu(y,h) 0 , or in terms of minimizing risk defi ned by a loss(y,h) 0. The calibration procedure above is provided for the gain, since both the bound for the marginal likelihood and the bound for the gain need to be in the same direction in Eq.(1) . To calibrate for user-defi ne
43、d loss (which tends to be more common in practical applications), we need to convert the loss into a utility. Unfortunately, the way this is carried out infl uences the fi nal risk evaluated using the original loss. Only linear transformations retain the optimal decisions, and the simplest one provi
44、ding non-negative utilities isu(y,h) = M (y,h)whereM supy,h(y,h)2, with equality providing optimal calibration as explained above. However, we cannot evaluateM = supy,h(y,h)for continuous unbounded losses. Furthermore, this value may be excessively large due to outliers, so thatM ? (y,h) for almost
45、all instances, effectively removing the calibration even if we knew how to fi nd the optimal value. As a remedy, we propose two practical strategies to calibrate for continuous unbounded losses, both based on the intuitive idea of bringing the utilities close to zero for the losses we expect to see
46、in practice. Robust maximumForu(y,h) = M (y,h), any value of(y,h) Mmay lead to u(,h) 0 and hence to negative input tologinU. However, this problem disappears if we linearize the logarithm in U around M similar to 16. Using Taylors expansion log u(,h) = log(M (,h) = logM (,h) M + O (,h) 2 M2 ! , and
47、dropping the error term,log u(,h) logM (,h) M and the utility-dependent term Eqlog u(,h)can be re-expressed asEq h logM (,h) M i = logM 1 MEq h (,h) i . We can ignorelogMas it is constant with respect to the decisionshand variational parameters. Now that u(,h)no longer appears inside a log, we can u
48、se alsoMthat is not a strict upper bound, but instead a robust estimator for maximum that excludes the tail of the loss distribution. We propose u(y,h) = Mq (y,h),(2) where Mqis the qth quantile (e.g. 90%) of the expected loss distribution. To obtain Mq we fi rst run standard VI for suffi ciently ma
49、ny iterations (until the losses converge), and then compute the loss for 4 every training instance. We then sort the resulting losses and setMqto match the desired quantile of this empirical distribution of losses. In many cases,Mqis considerably smaller thansupy,hl(y,h), which increases the calibra
50、tion effect. Non-linear loss transformationThe other alternative is to use transformations that guarantee non-negative utilities by mapping losses into positive values. A practical example is u(y,h) = e(y,h),(3) where the rate parametercan be related to the quantiles of the loss distribution as = M1
51、 q , by solving for a value for which linearization of the utility at(y,h) = 0would be zero for(y,h) = Mq. 4Algorithms for calibrating variational inference For practical computation we need to optimize Eq.(1)w.r.t. bothandhin a model-independent manner, which can be carried out using techniques des
52、cribed next. 4.1Monte Carlo approximation of U The fi rst practical challenge concerns evaluation and optimization ofU = R q()log u(,h)d = R q()log R p(y|,D)u(y,h)dyd. Since we already reparameterizedfor optimization ofELBO, we can as well approximate the outer expectation as U(,h) 1 S X q() log u(,
53、h) = 1 S X ?q0 log u(f(?,),h),(4) whereq0is the zero-parameter distribution,ftransforms samples fromq0(?)into samples fromq(), andSis a number of samples for Monte Carlo integration. For discrete outputs the inner expectation computing u(,h)becomes a sum over possible valuesY, which makesUstraightfo
54、rward to optimize both w.r.t.(via gradient ascent) andh(by enumeration). For continuousy, however, the integral remains as the main challenge in developing effi cient algorithms. We address this challenge by a double reparametrization scheme. Besides reparameterizing the approximationq(), we reparam
55、eterize also the predictive likelihoodp(y|,D), what was made possible for most densities by implicit reparameterization gradients 8. This enables approximating the inner integral with MC samples as u(,h) 1 Sy X p0 u(g(,D),h),(5) while preserving differentiability w.r.t. bothandh. Heredenotes samples
56、 from parameter-free distributionp0used to simulate samplesy p(y|f(?,),D)via the transformationg(). Similar derivation for approximation of the utility-dependent term exists for discrete decisions 7. This however, does not require the double reparameterization scheme proposed here. For evaluating U
57、we use a naive estimator that simply plugs (5) in place of u(f(.),h) into (4): U(,h) 1 S X ?q0 log 1 Sy X p0 u(g(,f(?,),D),h) . (6) Even though this estimator is slightly biased, it works well in practice. The bias could be reduced, for example, by employing the Taylor expansion ofEploguin a manner
58、similar to 27, or removed by boundingUwith Jensens inequality asU(,h) R q() R p(y|,D)logu(y,h)dyd, but such estimators display numerical instability for utilities close to zero and are not useful in practice. Finally, we note that for linearized utility-dependent term, for problems defi ned in terms of losses, we can directly use the simple unbiased estimator U(,h) 1 MSSy X ?q0 X p0 (g(,f(?,),D),h).(7) We note that, however
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 中考真题抢先练:历史第1期
- 豫中区小麦氮肥施用策略对生长与养分吸收的影响探究
- 谷氨酸信号通路对秀丽线虫社会性行为的神经环路及分子机制解析
- 调频式串联谐振耐压试验系统:原理、设计与应用的深度剖析
- 《非ST段抬高型急性冠脉综合征2024指南》科室专项解读
- 2026年进贤县妇幼保健所医护人员招聘笔试备考题库及答案详解
- 诺龙酶联免疫检测方法的构建与性能探究
- 语言学会议中语用视角下学术话语参与者的身份建构研究
- 语料库文体学视角下乔治·艾略特小说的多维解析
- 语境视角下《博雅汉语》教材注释设置的深度剖析与优化策略
- DB440100T114-2007城市绿化工程施工和验收规范
- 幼小衔接常规训练
- 中职思政《心理健康与职业生涯》说课大纲
- 仓库人员情绪管理制度
- 《中外历史纲要》下册综合检测试题
- 租赁模板脚手架维修保养技术规范
- 《电力管理信息系统工程初步设计文件内容深度规定》编制说明
- 2025年高考历史复习重难点框架表(超强)
- 2025人教版七年级道德与法治下册全册知识点
- (高清版)JJG(陕) 010-2024 压缩氢气加气机检定规程
- 检验科5S管理培训
评论
0/150
提交评论