




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Gaussian processesChuong B. Do (updated by Honglak Lee)November 22, 2008Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern: given a training set of i.i.d. examples sampled from some unknown distribution,1. solve a con
2、vex optimization problem in order to identify the single “best fit” model for the data, and2. use this estimated model to make “best guess” predictions for future test input points.In these notes, we will talk about a different flavor of learning algorithms, known as Bayesian methods. Unlike classic
3、al learning algorithm, Bayesian algorithms do not at- tempt to identify “best-fit” models of the data (or similarly, make “best guess” predictionsfor new test inputs).tead, theycompute a posterior distribution overmodels (or similarly,compute posterior predictive distributions for new test inputs).
4、Thesedistributions provide a useful way to quantify our uncertainty in model estimates, and to exploit our knowledge of this uncertainty in order to make more robust predictions on new test points.We focus on regression problems, where the goal is to learn a mapping from some inputspace X =of n-dime
5、nsional vectors to an output space Y = R of real-valued targets.RnIn particular, we will talk about a kernel-based fully Bayesian regression algorithm, knownas Gaussian process regression. The material covered in these notes draws heavily on many different topics that we discussed previously inclass
6、(namely, the probabilistic interpretation of linear regression1, Bayesian methods2, kernels3, and properties of multivariate Gaussians4).The organization of these notes is as follows.ection 1, we provide a brief reviewof multivariate Gaussian distributions and their properties.ection 2, we briefly r
7、eviewBayesian methods in the context of probabilistic linear regression. The central ideas under-lying Gaussian processes are presentedection 3, and we derive the full Gaussian processregression modelection 4.1See course lecture notes on “Supervised Learning, Discriminative Algorithms.”2See course l
8、ecture notes on “Regularization and Model Selection.”3See course lecture notes on “Support Vector Machines.”4See course lecture notes on “Factor Analysis.”11Multivariate GaussiansA vector-valued random variable x is said to have a multivariate normal (orRnn+Gaussian) distribution with mean Rn and co
9、variance matrix Sif 1p(x; , ) =exp 2(1x )T 1(x ) .(1)(2)n/2|1/2We write this as x N (, ). Here, recall from the section notes on linear algebra that S+refers to the space of symmetric positive definite n n matrices.5Generally speaking, Gaussian random variables are extremely useful in machine learni
10、ngand statistics for two main reasons. First, they are extremely common when modeling “noise” tatistical algorithms. Quite often, noise can be considered to be the accumulation of a large number of small independent random perturbations affecting the measurement process; by the Central Limit Theorem
11、, summations of independent random variables will tend to “look Gaussian.” Second, Gaussian random variables are convenient for many analytical manipulations, because many of the integrals involving Gaussian distributions that arise in practice have simple closed form solutions. In the remainder of
12、this section, we will reviewa number of useful properties of multivariate Gaussians.Consider a random vector x Rn with x N (, ). Suppose also that the variables inxrhave been partitioned into two sets x A = x 1 xr T Rr and x B = x r+1(and similarly for and ), such that xn T RnAAABBABBxA xBA B.x = =
13、= = E(x )(x )T = T .Here, AB = TsinceThe following propertieshold:BA1. Normalization. The density function normalizes, i.e.,Zp(x; , )dx = 1.xThis property, though seemingly trivial at first glance, turns out to be immensely useful for evaluating all sorts of integrals, even ones which appear to have
14、 no relation to probability distributions at all (see Appendix A.1)!2. Marginalization. The marginal densities,Zp(xA) =p(xA, xB; , )dxBZxBp(xB) =p(xA, xB; , )dxAxA5There are actually cases in which we would want to deal with multivariate Gaussian distributions where is positive semidefinite but not
15、positive definite (i.e., is not full rank).uch cases, 1 does notexist, so the definition of the Gaussian density given in (1) does not apply. For lecture notes on “Factor Analysis.”tance, see the course2are Gaussian:xA N (A, AA)xB N(B, BB).Conditioning. The conditional densities3.p(xA, xB; , )Rp(xA
16、| xB) =xA p(xA, x B; , )dx Ap(xA, xB; , )Rp(xB | xA) =xB p(xA, x B; , )dx Bare also Gaussian:11xA | xB N )B, + (xAABBAAABBABBBB11 ABxB | xA N B + BA(x ), .AABBBAAAAAA proof of this property is given in Appendix A.2. (See also Appendix A.3 for an easier version of the derivation.)Summation. The sum o
17、f independent Gaussian random variables (with the same4.dimensionality), y N (, ) and z N( , ), is also Gaussian:y + z N ( + , + ).Bayesian linear regression2(i)(i)mLet S = (x, y)be a training set of i.i.d. examples from some unknown distribution.i=1The standard probabilistic interpretation of linea
18、r regression states thaty(i) = T x(i) + (i),i = 1, . . . , mwhere the (i) are i.i.d. “noise” variables with independent N(0, 2) distributions. It follows that y(i) T x(i) N (0, 2), or equivalently, 1(y(i) T x(i)2P (y(i) | x(i), ) =For notational convenience, we define.exp222(x(1)Ty(1)y(2).y(m)(1)(2)
19、.(m) (x(2)Tmn R Rm R .mX =y = =. (x(m)T3Bayesian linear regression, 95% confidence region321012354321012345y(i)Figure 1: Bayesian linear regression for a one-dimensional linear regression problem,=x(i) + (i),(i)with N (0, 1) i.i.d. noise. The green region denotes the 95% confidenceregion for predict
20、ions of the model. Note that the (vertical) width of the green region islargest at the ends but narrowest in the middle. This region reflects the uncertain in the estimates for the parameter . In contrast, a classical linear regre2ssion model would display a confidence region of constant width, refl
21、ecting only the N (0, ) noise in the outputs.In Bayesian linear regression, we assume that a prior distribution over parameters istance, is N (0, I). Using Bayess rule, we obtain thealso given; a typical choice, forparameter posterior,Qmp()p(y (i) | x(i), ) p()p(S | )Qi=1RRp( | S) = .=(2)m i=1| x(i,
22、) )d p( )p(S | )dp(y (i) p( )Assuming the same noise model on testing points as on our training points, the “output” of Bayesian linear regression on a new test point x is not just a single guess “y”, but rather an entire probability distribution over possible outputs, known as the posterior predict
23、ivedistribution:Zp(y | x, S) =p(y | x, )p( | S)d.(3)For many types of models, the integrals in (2) and (3) are difficult to compute, and hence, we often resort to approximations, such as MAP estimation (see course lecture notes on “Regularization and Model Selection”).In the case of Bayesian linear
24、regression, however, the integrals actually are tractable! In particular, for Bayesian linear regression, one can show (after much work!) that1 1T1 | S Ny | x, S N2 AX y, A 1 T1T12Tx AX y, x Ax + 24where A = 1 2XT X + 1 I.2The derivation of these formulas is somewhat involved.6 Nonethe-less, from th
25、ese equations, we get at least a flavor of what Bayesian methods are all about: the posterior distribution over the test output y for a test input x is a Gaussian distribution this distribution reflects the uncertainty in our predictions y = T x + arising from both the randomness in and the uncertai
26、nty in our choice of parameters . In contrast, classical probabilistic linear regression models estimate parameters directly from the training databut provide no estimate of how reliable these learned parameters may be (see Figure 1).3Gaussian processesAs described ection 1, multivariate Gaussian di
27、stributions are useful for modeling finite collections of real-valued variables because of their nice analytical properties. Gaussian processes are the extension of multivariate Gaussians to infinite-sized collections of real- valued variables. In particular, this extension will allow us to think of
28、 Gaussian processes as distributions not just over random vectors but in fact distributions over random functions.73.1Probability distributions over functions with finite domaTo understand how one might paramterize probability distributions over functions, consider the following simple example. Let
29、X = x1, . . . , xm be any finite set of elements. Now, consider the set H of all possible functions mapping from X to R. For tance, one example of a function f0() H is given byf0(x2) = 2.3,f0(x2) = 7,f0(xm1) = ,f0(x1) = 5,. . . ,f0(xm) = 8.Since the domain of any f () Hm elements, we can always repr
30、esent f ()has onlycompactly as an m-dimensional vector,f = f (x ) f (x ) f (xm). In order to specifyT12a probability distribution over functions f () H, we must associate some “probability density” with each function in H. One natural way to do this is to exploit the one-to-one correspondence betwee
31、n functions f () H and their vector representations, f. In particular, if we specify that f N (, 2I), then this in turn implies a probability distribution over functions f (), whose probability density function is given byYm1 1 22 exp 22 (f (xi) i).p(h) =i=16For the complete derivation, see, fortanc
32、e, 1. Alternatively, read the Appendices, which gives anumber of arguments based on the “completion-of-squares” trick, and derive this formula yourself!7Let H be a class of functions mapping from X Y. A random function f () from H is a function whichis randomly drawn from H, according to some probab
33、ility distribution over H. One potential source ofconfusion is that you may be tempted to think of random functions as functions whose outputs areomeway stochastic; this is not the case.tead, a random function f (), once selected from H probabilistically, implies a deterministic mapping from inputs
34、in X to outputs in Y.5In the example above, we showed that probability distributions overfunctions with finitedomacan be represented using a finite-dimensional multivariate Gaussian distributionover function outputs f (x1), . . . , f (xm) at a finite number of input points x1, . . . , xm. Howcan we
35、specify probability distributions overfunctions when the domaize may be infinite?For this, we turn to a fancier type of probability distribution known as a Gaussian process.3.2Probability distributions over functions with infinite domaA stochastic process is a collection of random variables, f (x) :
36、 x X, indexed by elements from some set X , known as the index set.8 A Gaussian process is a stochastic process such that any finite subcollection of random variables has a multivariate Gaussian distribution.In particular, a collection of random variables f (x) : x X is said to be drawn from a Gauss
37、ian process with mean function m() and covariance function k(, ) if for any finite set of elements x1, . . . , xm X , the associated finite set of random variables f (x1), . . . , f (xm) have distribution,f (x1).f (xm)m(x1).m(xm)k(x1, x1).k(xm, x1) . . . k(x1, xm).k(xm, xm) N,We denote this using th
38、e notation,f () GP(m(), k(, ).Observe that the mean function and covariance function are aptly named since the above properties imply thatm(x) = Exk(x, x) = E(x m(x)(x m(x).for any x, x X .Intuitively, one can think of a function f () drawn from a Gaussian process prior as an extremely high-dimensio
39、nal vector drawn from an extremely high-dimensional multivariateGaussian. Here, each dimension of the Gaussian corresponds to an element x from the indexset X , and the corresponding component of the random vector represents the value of f (x). Using the marginalization property for multivariate Gau
40、ssians, we can obtain the marginalmultivariate Gaussian density corresponding to any finite subcollection of variables.What sort of functions m() and k(, ) give rise to valid Gaussian processes? In general, any real-valued function m() is acceptable, but for k(, ), it must be the case that for any8O
41、ften, when X = R, one can interpret the indices x X as representing times, and hence the variables f (x) represent the temporal evolution of some random quantity over time. In the models that are used for Gaussian process regression, however, the index set is taken to be the input space of our regre
42、ssion problem.6Samples from GP with k(x,z) = exp(|xz|2 / (2*tau2), tau = 0.500000Samples from GP with k(x z) = exp(|xz |2 / (2*tau2), tau = 2.000000Samp es from GP w th k(x,z) = exp( |xz|2 / (2*tau2), tau = 10.0000002.521.521.511.510.510.50.50000.50.510.511.511.522.51.520 1 2 35 6 7 8 9100 1 2 35 6
43、7 8 9100 1 2 35 6 7 8 9 10(a)(b)(c)Figure 2: Samples from a zero-mean Gaussian process prior with kSE(, ) covariance function, using (a) = 0.5, (b) = 2, and (c) = 10. Note that as the bandwidth parameter increases, then points which are farther away will have higher correlations than before, and hen
44、ce the sampled functions tend to be smoother overall.set of elements x1 , . . . , xm X , the resulting matrixk(x1, x1)k(x1, xm). .K =.k(xm, x1) k(xm, xm)is a valid covariance matrix corresponding to some multivariate Gaussian distribution. A standard result in probability theory states that this is
45、true provided that K is positive semidefinite. Sound familiar?The positive semidefiniteness requirement for covariance matrices computed based onarbitrary input points is, in fact, identical to Mercers condition for kernels! A function k(, ) is a valid kernel provided the resulting kernel matrix K d
46、efined as above is always positivesemidefinite for any set of input points x1, . . . , xm X. Gaussian processes, therefore, are kernel-based probability distributions in the sense that anyvalidkernel function canbe usedas a covariance function!3.3The squared exponential kernelIn order to get an intu
47、ition for how Gaussian processes work, consider a simple zero-mean Gaussian process,f () GP(0, k(, ).defined for functions h : X R where we take X = R. Here, we choose the kernel functionk(, ) to be the squared exponential9 kernel function, defined askSE (x, x) = exp 1 |x x|2|229In the context of SV
48、Ms, we called this the Gaussian kernel; to avoid confusion with “Gaussian” processes, we refer to this kernel here as the squared exponential kernel, even though the two are formally identical.7for some 0. What do random functions sampled from this Gaussian process look like?In our example, since we
49、 use a zero-mean Gaussian process, we would expect that for the function values from our Gaussian process will tend to be distributed around zero. Furthermore, for any pair of elements x, x X .f(x) and f(x ) will tend to have high covariance x and x are “nearby” in the input 1 2space (i.e., |x x | =
50、 |x x | 0, so exp(|x x | ) 1).22f(x) and f(x ) will tend to have low covariance when x and x are “far apart” (i.e.,|x x | 0, so exp(|x x2 | ) 0).12 2More simply stated, functions drawn from a zero-mean Gaussian process prior with the squared exponential kernel will tend to be “locally smooth” with h
51、igh probability; i.e., nearby function values are highly correlated, and the correlation drops off as a function of distance in the input space (see Figure 2).4Gaussian process regressionAs discussed in the last section, Gaussian processes provide a method for modelling probabil- ity distributions o
52、ver functions. Here, we discuss how probability distributions over functions can be used in the framework of Bayesian regression.4.1The Gaussian process regression model(i)(i)mLet S = (x, y)be a training set of i.i.d. examples from some unknown distribution.i=1In the Gaussian process regression mode
53、l,y(i) = f(x(i) + (i),i = 1, . . . , mwhere the (i) are i.i.d. “noise” variables with independent N(0, 2) distributions. Like in Bayesian linear regression, we also assume a prior distribution over functions f (); in particular, we assume a zero-mean Gaussian process prior,f () GP(0, k(, )for some valid covariance function k(,
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2024年山东华兴机械集团有限责任公司人员招聘笔试备考题库含答案详解
- 2024年滨州新能源集团有限责任公司及权属公司公开招聘工作人员递补笔试备考题库含答案详解(综合卷)
- 2023国家能源投资集团有限责任公司第一批社会招聘笔试备考题库含答案详解(达标题)
- 2025福建晋园发展集团有限责任公司权属子公司招聘7人笔试备考题库及一套答案详解
- 通信原理简明教程(第2版)教案全套 黄葆华 第1-8章 绪论-同步原理
- 2025年河北省定州市辅警招聘考试试题题库含答案详解(培优a卷)
- 2025年Z世代消费行为对新兴品牌成长的深度影响报告
- 2026年高考物理大一轮复习讲义 第一章 微点突破1 追及相遇问题
- 2025届高考专题复习:文言文复习之翻译
- 奶源质量控制策略
- 危重症患者压疮护理
- 养老院医生培训
- 2025正规离婚协议书样本范文
- 2025年山西文旅集团招聘笔试参考题库含答案解析
- 品管圈PDCA获奖案例提高护士对患者身份识别和查对制度的正确率
- 盐酸装卸车操作规程(3篇)
- 业主自治组织运作研究-洞察分析
- 零售连锁店标准化运营手册
- 2024年国家电网招聘之电工类考试题库附答案(满分必刷)
- TDT10722022国土调查坡度分级图制作技术规定
- 三年级语文下册 期末复习非连续文本阅读专项训练(五)(含答案)(部编版)
评论
0/150
提交评论