cs229 stanford machine learning section notes 9-gaussian_processes_第1页
cs229 stanford machine learning section notes 9-gaussian_processes_第2页
cs229 stanford machine learning section notes 9-gaussian_processes_第3页
cs229 stanford machine learning section notes 9-gaussian_processes_第4页
cs229 stanford machine learning section notes 9-gaussian_processes_第5页
已阅读5页,还剩9页未读 继续免费阅读

付费下载

VIP免费下载

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Gaussian processesChuong B. Do (updated by Honglak Lee)November 22, 2008Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern: given a training set of i.i.d. examples sampled from some unknown distribution,1. solve a con

2、vex optimization problem in order to identify the single “best fit” model for the data, and2. use this estimated model to make “best guess” predictions for future test input points.In these notes, we will talk about a different flavor of learning algorithms, known as Bayesian methods. Unlike classic

3、al learning algorithm, Bayesian algorithms do not at- tempt to identify “best-fit” models of the data (or similarly, make “best guess” predictionsfor new test inputs).tead, theycompute a posterior distribution overmodels (or similarly,compute posterior predictive distributions for new test inputs).

4、Thesedistributions provide a useful way to quantify our uncertainty in model estimates, and to exploit our knowledge of this uncertainty in order to make more robust predictions on new test points.We focus on regression problems, where the goal is to learn a mapping from some inputspace X =of n-dime

5、nsional vectors to an output space Y = R of real-valued targets.RnIn particular, we will talk about a kernel-based fully Bayesian regression algorithm, knownas Gaussian process regression. The material covered in these notes draws heavily on many different topics that we discussed previously inclass

6、(namely, the probabilistic interpretation of linear regression1, Bayesian methods2, kernels3, and properties of multivariate Gaussians4).The organization of these notes is as follows.ection 1, we provide a brief reviewof multivariate Gaussian distributions and their properties.ection 2, we briefly r

7、eviewBayesian methods in the context of probabilistic linear regression. The central ideas under-lying Gaussian processes are presentedection 3, and we derive the full Gaussian processregression modelection 4.1See course lecture notes on “Supervised Learning, Discriminative Algorithms.”2See course l

8、ecture notes on “Regularization and Model Selection.”3See course lecture notes on “Support Vector Machines.”4See course lecture notes on “Factor Analysis.”11Multivariate GaussiansA vector-valued random variable x is said to have a multivariate normal (orRnn+Gaussian) distribution with mean Rn and co

9、variance matrix Sif 1p(x; , ) =exp 2(1x )T 1(x ) .(1)(2)n/2|1/2We write this as x N (, ). Here, recall from the section notes on linear algebra that S+refers to the space of symmetric positive definite n n matrices.5Generally speaking, Gaussian random variables are extremely useful in machine learni

10、ngand statistics for two main reasons. First, they are extremely common when modeling “noise” tatistical algorithms. Quite often, noise can be considered to be the accumulation of a large number of small independent random perturbations affecting the measurement process; by the Central Limit Theorem

11、, summations of independent random variables will tend to “look Gaussian.” Second, Gaussian random variables are convenient for many analytical manipulations, because many of the integrals involving Gaussian distributions that arise in practice have simple closed form solutions. In the remainder of

12、this section, we will reviewa number of useful properties of multivariate Gaussians.Consider a random vector x Rn with x N (, ). Suppose also that the variables inxrhave been partitioned into two sets x A = x 1 xr T Rr and x B = x r+1(and similarly for and ), such that xn T RnAAABBABBxA xBA B.x = =

13、= = E(x )(x )T = T .Here, AB = TsinceThe following propertieshold:BA1. Normalization. The density function normalizes, i.e.,Zp(x; , )dx = 1.xThis property, though seemingly trivial at first glance, turns out to be immensely useful for evaluating all sorts of integrals, even ones which appear to have

14、 no relation to probability distributions at all (see Appendix A.1)!2. Marginalization. The marginal densities,Zp(xA) =p(xA, xB; , )dxBZxBp(xB) =p(xA, xB; , )dxAxA5There are actually cases in which we would want to deal with multivariate Gaussian distributions where is positive semidefinite but not

15、positive definite (i.e., is not full rank).uch cases, 1 does notexist, so the definition of the Gaussian density given in (1) does not apply. For lecture notes on “Factor Analysis.”tance, see the course2are Gaussian:xA N (A, AA)xB N(B, BB).Conditioning. The conditional densities3.p(xA, xB; , )Rp(xA

16、| xB) =xA p(xA, x B; , )dx Ap(xA, xB; , )Rp(xB | xA) =xB p(xA, x B; , )dx Bare also Gaussian:11xA | xB N )B, + (xAABBAAABBABBBB11 ABxB | xA N B + BA(x ), .AABBBAAAAAA proof of this property is given in Appendix A.2. (See also Appendix A.3 for an easier version of the derivation.)Summation. The sum o

17、f independent Gaussian random variables (with the same4.dimensionality), y N (, ) and z N( , ), is also Gaussian:y + z N ( + , + ).Bayesian linear regression2(i)(i)mLet S = (x, y)be a training set of i.i.d. examples from some unknown distribution.i=1The standard probabilistic interpretation of linea

18、r regression states thaty(i) = T x(i) + (i),i = 1, . . . , mwhere the (i) are i.i.d. “noise” variables with independent N(0, 2) distributions. It follows that y(i) T x(i) N (0, 2), or equivalently, 1(y(i) T x(i)2P (y(i) | x(i), ) =For notational convenience, we define.exp222(x(1)Ty(1)y(2).y(m)(1)(2)

19、.(m) (x(2)Tmn R Rm R .mX =y = =. (x(m)T3Bayesian linear regression, 95% confidence region321012354321012345y(i)Figure 1: Bayesian linear regression for a one-dimensional linear regression problem,=x(i) + (i),(i)with N (0, 1) i.i.d. noise. The green region denotes the 95% confidenceregion for predict

20、ions of the model. Note that the (vertical) width of the green region islargest at the ends but narrowest in the middle. This region reflects the uncertain in the estimates for the parameter . In contrast, a classical linear regre2ssion model would display a confidence region of constant width, refl

21、ecting only the N (0, ) noise in the outputs.In Bayesian linear regression, we assume that a prior distribution over parameters istance, is N (0, I). Using Bayess rule, we obtain thealso given; a typical choice, forparameter posterior,Qmp()p(y (i) | x(i), ) p()p(S | )Qi=1RRp( | S) = .=(2)m i=1| x(i,

22、) )d p( )p(S | )dp(y (i) p( )Assuming the same noise model on testing points as on our training points, the “output” of Bayesian linear regression on a new test point x is not just a single guess “y”, but rather an entire probability distribution over possible outputs, known as the posterior predict

23、ivedistribution:Zp(y | x, S) =p(y | x, )p( | S)d.(3)For many types of models, the integrals in (2) and (3) are difficult to compute, and hence, we often resort to approximations, such as MAP estimation (see course lecture notes on “Regularization and Model Selection”).In the case of Bayesian linear

24、regression, however, the integrals actually are tractable! In particular, for Bayesian linear regression, one can show (after much work!) that1 1T1 | S Ny | x, S N2 AX y, A 1 T1T12Tx AX y, x Ax + 24where A = 1 2XT X + 1 I.2The derivation of these formulas is somewhat involved.6 Nonethe-less, from th

25、ese equations, we get at least a flavor of what Bayesian methods are all about: the posterior distribution over the test output y for a test input x is a Gaussian distribution this distribution reflects the uncertainty in our predictions y = T x + arising from both the randomness in and the uncertai

26、nty in our choice of parameters . In contrast, classical probabilistic linear regression models estimate parameters directly from the training databut provide no estimate of how reliable these learned parameters may be (see Figure 1).3Gaussian processesAs described ection 1, multivariate Gaussian di

27、stributions are useful for modeling finite collections of real-valued variables because of their nice analytical properties. Gaussian processes are the extension of multivariate Gaussians to infinite-sized collections of real- valued variables. In particular, this extension will allow us to think of

28、 Gaussian processes as distributions not just over random vectors but in fact distributions over random functions.73.1Probability distributions over functions with finite domaTo understand how one might paramterize probability distributions over functions, consider the following simple example. Let

29、X = x1, . . . , xm be any finite set of elements. Now, consider the set H of all possible functions mapping from X to R. For tance, one example of a function f0() H is given byf0(x2) = 2.3,f0(x2) = 7,f0(xm1) = ,f0(x1) = 5,. . . ,f0(xm) = 8.Since the domain of any f () Hm elements, we can always repr

30、esent f ()has onlycompactly as an m-dimensional vector,f = f (x ) f (x ) f (xm). In order to specifyT12a probability distribution over functions f () H, we must associate some “probability density” with each function in H. One natural way to do this is to exploit the one-to-one correspondence betwee

31、n functions f () H and their vector representations, f. In particular, if we specify that f N (, 2I), then this in turn implies a probability distribution over functions f (), whose probability density function is given byYm1 1 22 exp 22 (f (xi) i).p(h) =i=16For the complete derivation, see, fortanc

32、e, 1. Alternatively, read the Appendices, which gives anumber of arguments based on the “completion-of-squares” trick, and derive this formula yourself!7Let H be a class of functions mapping from X Y. A random function f () from H is a function whichis randomly drawn from H, according to some probab

33、ility distribution over H. One potential source ofconfusion is that you may be tempted to think of random functions as functions whose outputs areomeway stochastic; this is not the case.tead, a random function f (), once selected from H probabilistically, implies a deterministic mapping from inputs

34、in X to outputs in Y.5In the example above, we showed that probability distributions overfunctions with finitedomacan be represented using a finite-dimensional multivariate Gaussian distributionover function outputs f (x1), . . . , f (xm) at a finite number of input points x1, . . . , xm. Howcan we

35、specify probability distributions overfunctions when the domaize may be infinite?For this, we turn to a fancier type of probability distribution known as a Gaussian process.3.2Probability distributions over functions with infinite domaA stochastic process is a collection of random variables, f (x) :

36、 x X, indexed by elements from some set X , known as the index set.8 A Gaussian process is a stochastic process such that any finite subcollection of random variables has a multivariate Gaussian distribution.In particular, a collection of random variables f (x) : x X is said to be drawn from a Gauss

37、ian process with mean function m() and covariance function k(, ) if for any finite set of elements x1, . . . , xm X , the associated finite set of random variables f (x1), . . . , f (xm) have distribution,f (x1).f (xm)m(x1).m(xm)k(x1, x1).k(xm, x1) . . . k(x1, xm).k(xm, xm) N,We denote this using th

38、e notation,f () GP(m(), k(, ).Observe that the mean function and covariance function are aptly named since the above properties imply thatm(x) = Exk(x, x) = E(x m(x)(x m(x).for any x, x X .Intuitively, one can think of a function f () drawn from a Gaussian process prior as an extremely high-dimensio

39、nal vector drawn from an extremely high-dimensional multivariateGaussian. Here, each dimension of the Gaussian corresponds to an element x from the indexset X , and the corresponding component of the random vector represents the value of f (x). Using the marginalization property for multivariate Gau

40、ssians, we can obtain the marginalmultivariate Gaussian density corresponding to any finite subcollection of variables.What sort of functions m() and k(, ) give rise to valid Gaussian processes? In general, any real-valued function m() is acceptable, but for k(, ), it must be the case that for any8O

41、ften, when X = R, one can interpret the indices x X as representing times, and hence the variables f (x) represent the temporal evolution of some random quantity over time. In the models that are used for Gaussian process regression, however, the index set is taken to be the input space of our regre

42、ssion problem.6Samples from GP with k(x,z) = exp(|xz|2 / (2*tau2), tau = 0.500000Samples from GP with k(x z) = exp(|xz |2 / (2*tau2), tau = 2.000000Samp es from GP w th k(x,z) = exp( |xz|2 / (2*tau2), tau = 10.0000002.521.521.511.510.510.50.50000.50.510.511.511.522.51.520 1 2 35 6 7 8 9100 1 2 35 6

43、7 8 9100 1 2 35 6 7 8 9 10(a)(b)(c)Figure 2: Samples from a zero-mean Gaussian process prior with kSE(, ) covariance function, using (a) = 0.5, (b) = 2, and (c) = 10. Note that as the bandwidth parameter increases, then points which are farther away will have higher correlations than before, and hen

44、ce the sampled functions tend to be smoother overall.set of elements x1 , . . . , xm X , the resulting matrixk(x1, x1)k(x1, xm). .K =.k(xm, x1) k(xm, xm)is a valid covariance matrix corresponding to some multivariate Gaussian distribution. A standard result in probability theory states that this is

45、true provided that K is positive semidefinite. Sound familiar?The positive semidefiniteness requirement for covariance matrices computed based onarbitrary input points is, in fact, identical to Mercers condition for kernels! A function k(, ) is a valid kernel provided the resulting kernel matrix K d

46、efined as above is always positivesemidefinite for any set of input points x1, . . . , xm X. Gaussian processes, therefore, are kernel-based probability distributions in the sense that anyvalidkernel function canbe usedas a covariance function!3.3The squared exponential kernelIn order to get an intu

47、ition for how Gaussian processes work, consider a simple zero-mean Gaussian process,f () GP(0, k(, ).defined for functions h : X R where we take X = R. Here, we choose the kernel functionk(, ) to be the squared exponential9 kernel function, defined askSE (x, x) = exp 1 |x x|2|229In the context of SV

48、Ms, we called this the Gaussian kernel; to avoid confusion with “Gaussian” processes, we refer to this kernel here as the squared exponential kernel, even though the two are formally identical.7for some 0. What do random functions sampled from this Gaussian process look like?In our example, since we

49、 use a zero-mean Gaussian process, we would expect that for the function values from our Gaussian process will tend to be distributed around zero. Furthermore, for any pair of elements x, x X .f(x) and f(x ) will tend to have high covariance x and x are “nearby” in the input 1 2space (i.e., |x x | =

50、 |x x | 0, so exp(|x x | ) 1).22f(x) and f(x ) will tend to have low covariance when x and x are “far apart” (i.e.,|x x | 0, so exp(|x x2 | ) 0).12 2More simply stated, functions drawn from a zero-mean Gaussian process prior with the squared exponential kernel will tend to be “locally smooth” with h

51、igh probability; i.e., nearby function values are highly correlated, and the correlation drops off as a function of distance in the input space (see Figure 2).4Gaussian process regressionAs discussed in the last section, Gaussian processes provide a method for modelling probabil- ity distributions o

52、ver functions. Here, we discuss how probability distributions over functions can be used in the framework of Bayesian regression.4.1The Gaussian process regression model(i)(i)mLet S = (x, y)be a training set of i.i.d. examples from some unknown distribution.i=1In the Gaussian process regression mode

53、l,y(i) = f(x(i) + (i),i = 1, . . . , mwhere the (i) are i.i.d. “noise” variables with independent N(0, 2) distributions. Like in Bayesian linear regression, we also assume a prior distribution over functions f (); in particular, we assume a zero-mean Gaussian process prior,f () GP(0, k(, )for some valid covariance function k(,

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论