下载本文档
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Scene Parsing by Integrating Function, Geometry and Appearance MsYibiao Zhao Department of StatisticsUniversity of California, Los ASong-Chun ZhuDepartment of Statistics and Computer Science University of California, Los AAbstractIndoor functional objects
2、exhibit large view and appear- ance variations, thus are difficult to be recognized by the traditional appearance-based classification paradigm. In this paper, we present an algorithm to parse indoor images based on two observations: i) The functionality is the most essential property to define an i
3、ndoor object, e.g. “a chair to sit on”; ii) The geometry (3D shape) of an object is designed to serve its function. We formulate the nature of the objectfunction into a stochastic grammar m. This mchar-acterizes a joint distribution over the function-geometry-(a)(b)appearance (FGA) hierarchy. The hi
4、erarchical structure includes a scene category, functional groups, functional ob- jects, functional parts and 3D geometric shapes. We use aFigure 1. Recognizing objects by appearance (a) or by function- ality (b). The functional objects are defined by the affordance how likely its 3D shape is able t
5、o afford a human action. The 3D shapes are inferred from an input 2D image in Fig.2.simulated annealing MCMC algorithm to find theuma posteriori (MAP) solution, i.e. a parse tree. We designfour data-driven steps to accelerate the search in the FGA space: i) group the line segments into 3D primitive
6、shapes,ii) assign functional labels to these 3D primitive shapes, iii) fill in missing objects/parts according to the functional la- bels, and iv) synthesize 2D segmentation maps and verify the current parse tree by the Metropolis-Hastings accep- tance probability. The experimental results on severa
7、l chal- lenging indoor datasets demonstrate the proposed approach not only significantly widens the scope of indoor scene pars- ing algorithm from the segmentation and the 3D recovery to the functional object recognition, but also yields improved overall performance.the categories with lowest accura
8、cies out of the twenty ob- ject categories.As shown in Fig.1(a), it is hard to identify these object la- bels based solely on the appearance of these image patches (cropped from the image in Fig.2). The classic sliding- window type of object detectors, which only observe a win- dow of image like the
9、se, will be insufficient to distinguish these image patches apart. From the other point of view in Fig.1(b), despite the appearances, people can immediately recognize objects to sit on (chair), to sleep on (bed) and to store in (cabinet) based on their 3D shapes. For example, a cuboid of 18 inch tal
10、l could be comfortable to sit on as a chair. Moreover, the functional context is helpful to iden- tify objects with similar shapes, such as the chair on the left and the nightstand on the right. Although they are in simi- lar shape, the nightstand is more likely to be placed beside the bed. The bed
11、and the nightstand offer a joint functional group to serve the activity of sleeping. Based on the above observations, we propose an algorithm to tackle the problem of indoor scene parsing by m ing the object function, the 3D geometry and the local appearance (FGA).There has been a recent surge in th
12、e detection of rect- angular structures, typically m ed by planar surfaces1. IntroductionIn recent years, object detection and labeling have made remarkable progress in the field of computer vision. How- ever, the detection of indoor objects and segmentation of indoor scenes are still challenging ta
13、sks. For example, in the VOC2012 Challenge 5, the state-of-the-art algorithms can only obtain an accuracy of 19.5% for the detection of chairs and 22.6% for the segmentation of chairs. Other in- door objects, like the sofa and the dining table, are among1Functional spaceor cuboids, in the indoor env
14、ironment. (i) Hedau.12, 13, Wang. 21, Lee. 17, 16 and Satkin. 19 adopted different approaches to mthe geo-metric layout of the background and/or foreground blocks with the Structured SVM (or Latent SVM). (ii) Another stream of algorithms including Han and Zhu 10, Zhao and Zhu 26 and Del Pero . 4, 3
15、that built genera- tive Bayesian m s to capture the prior statistics in the man-made scenes. (iii) Hu 15, Xiao . 24, Hejrati and Ramanan 14, Xiang and Savarese 22, Pepik .18 designed several new variants of the deformable part- based m s 6 by using detectors of projected 3D parts.(iv) Bar-Aviv . 1 a
16、nd Grabner . 8 detected chairs by the simulation of embodied agents in the 3D CAD data and depth data respectively. Gupta . 9 recently pro- posed to infer the human workable space by adapting the human poses to the scene.Overview of our approach: On top of a series of recent studies of computing the
17、 3D bounding boxes of indoor ob- jects and the room layout 9, 12, 13, 21, 17, 16, 10, 26, 4, 3, 19, our m is developed based on the following observa- tions of the function-geometry-appearance (FGA) hierarchy as shown in Fig.2.i) Function: An indoor scene is designed to serve a hand- ful of human ac
18、tivities inside. The indoor objects (furniture) in the scenes are designed to support human poses/actions,e.g. bed to sleep on, chair to sit on etc.Geometric spaceFigure 2. A function-geometry-appearance (FGA) hierarchy. The green arrows indicate the bottom-up steps, and the cyan arrows represent th
19、e top-down steps in the inference stage.In the functional space, we mthe probabilisticderivation of functional labels including scene categories(bedroom), functional groups (sleeping area), functional objects (bed and nightstand), and functional parts (the mat- tress and the headboard of a bed).ii)
20、Geometry: The 3D size (dimension) can be sufficient to evaluate how likely an object is able to afford a human action, known as the affordance 7. Fortunately, most of the furniture has regular structures, i.e. a rectangular cabi- net, therefore the detection of these objects is tractable by inferrin
21、g their geometric affordance. For objects like sofas and beds, we use a more fine-grained geometric m with compositional parts, i.e. a group of cuboids. For example, the bed with a headboard better explains the image signal as shown at the bottom of Fig.2.In the geometric space, each 3D shape is dir
22、ectly linked to a functional part in the functional space. The contex- tual relations are also involved when multiple objects are assigned to a same functional group, e.g. a bed and a night- stand for sleeping. The distribution of the 3D geometry are learned from a large set of 3D m s as shown in Fi
23、g.3.iii) Appearance: The appearance of the furniture vary ar- bitrarily large due to the variation of material property, the lighting condition, and the view point. In order to land our m on the input image, we use a straight-line detection, a surface orientation estimation and a coarse foreground d
24、e-tection as the local evidence to support the geometry m above.We design a four-step inference algorithm that enables a MCMC chain to travel up and down through the FGA hierarchy:i). A bottom-up appearance-geometry (AG) step groups noisy line segments in the A space into 3D primitive shapes,i.e. cu
25、boids and rectangles, in the G space;ii). A bottom-up geometry-function (GF) step assigns functional labels in the F space to detected 3D primitive shapes, e.g. to sleep on;iii). A top-down function-geometry (FG) step further fills in the missing objects and the missing parts in the G space accordin
26、g to the assigned functional labels, e.g. a missing nightstand of a sleeping group, a missing head- board of a bed;iv). A top-down geometry-appearance (GA) step syn- thesizes 2D segmentation maps in the A space, and makes an accept/reject decision of a current proposal by the Metropolis-Hastings acc
27、eptance probability.2geometric primitives (Gp):Appearance spaceBottom-up detectionTop-down verificationgenerated linesgenerated foregrounds generated orientationdetected linesdetected foregrounds detected orientationlfunctional scenebedroomcategory (Fs):backgroundfunctionalsittingstoringsleepinggrou
28、ps (Fg):functionaltablechaircabinetbedside tablepictureobjects (Fo):functionalmattress headboardparts (Fp):Figure 3. A collection of indoor functional objects from the3D WarehouseFigure 4. The distribution of the 3D sizes of the functional objects (in unit of inch).2. A stochastic scene grammar in F
29、GA spaceFig.2, and it is an optimal solution of our m imum a posteriori probability,P (pt|I) P (F)P (G|F)P (I|G).by max-We present a stochastic scene grammar m26 tocompute a parse tree pt for an input image I on the FGA(1)hierarchy. The hierarchy includes following random vari- ables in the FGA spac
30、es as shown in Fig.2 :We specify a hierarchy of an indoor scene over the func-tional space F, the geometric space G and the appearanceThe functional space F contains the scene categoriesspace A.2.1. The function mP (F)Fs, the functional groups Fg, the functional objects Fo, and the functional parts
31、Fp. All the variables in functional space take discrete labels;The function mcharacterizes the prior distribu-tion of the functional labels. We mthe distribution by the probabilistic context free grammar (PCFG): G =The geometric space G contains the 3D geometricprimitives Gp. Each Gp is parameterize
32、d by a con- tinuous 6D variable (a 3D size and a 3D position in the scene);(N, T, S, R), where N = Fs, Fg, Fo are the non-terminal nodes (circles in Fig.2), and T = Fp are thefunctional parts as terminal nodes in F space (squares inThe appearance space A contains a line segment mapFig.2), S is a sta
33、rt symbol and R = r : is a setAl, a foreground map Af , and a surface orientation map Ao. All of them can be either computed from theof production rules. In our problem, we define following production rules:image Al(I), Af (I), Ao(I) or generated by the 3DS Fs:S bedroom | living roomgeometric mAl(G)
34、, Af (G), Ao(G).The probabilistic distribution of our m is defined in terms of the statistics over the derivation of our function- geometry-appearance hierarchy. The parse tree is an in- stance of the hierarchy pt F, G, A as illustrated inFsFg:bedroom sleepingbackground | · · ·FgFo:sl
35、eeping bed | bednight stand | · · ·FoFp:bed headboardmattress | mattressThe symbol “|” separates alternative explanations of thegrammar derivation. Each alternative explanation has a3branching probility q( ) = P (|).Given aking size bed, full size bed etc. We estimate the mbyfunctiona
36、l parse containing the production rules 1 EM clustering, and we manually picked few typical primi- tives as the initial mean of Gaussian, e.g. a coffee table, a side table and a desk from the table category.1, · · · , n n, the probability under the PCFG is de-fined as,In order to lear
37、n a better affordance m, we collectednYa dataset of functional indoor furniture, as shown in Fig.3.(2)P (F) =q( )iiThe functional objects in the dataset are m ed with the real-world measurement, therefore we can generalize our m to the real images by learning from this dataset. We found that the rea
38、l-world 3D size of the objects has less vari- ance than the projected 2D size. As we can see, these func- tional categories are quite distinguishable solely based on their sizes as shown in Fig.4. For example, the coffee ta- bles and side tables are very short and usually lower than the sofas, and t
39、he beds generally have large widths compar- ing to the other objects. The object poses are aligned in the dataset. We keep four copies of Gaussian m for four al-i=1The mis learned by simply counting the frequency of#() .each production rules as q( ) =In this#()paper, we manually designed the grammar
40、 structure and learned the parameters of the production rules based on the labels of thousands of images in the SUN dataset 23 under the “bedroom” and the “living room” categories.2.2. The geometric mP (G|F)In the geometric space, we mthe distribution of 3D size (dimension) for each geometric primit
41、ive Gp giventernative orientations along x, x, y and y axes to makeits functional labels F, e.g. the size distribution of cuboidthe mrotation invariant in the testing stage.shaped bed mattresses. The higher level functional labels Fs, Fg, Fo introduce the contextual relations among these primitives,
42、 e.g. the distance distribution between a bed and a nightstand. Suppose we have k primitives in the scene3D compositional ms of functional object andfunctional groups 2 (ei|Fo), 3 (ei|Fg) are defined bythe distributions of the 3D relative relations among the parts of an objects Fo or the objects of
43、an functional group Fg. We also use a high-dimensional Gaussian to m the rela- tive relations. The Fig.5 shows some typical samples drawn from our learned distribution. This term enables the top- down prediction of the missing parts or missing objects as we will discuss in Sect.3.General physical co
44、nstraints 4 (ei) avoid invalid ge- ometric configurations that violate the physical laws: Two objects can not penetrate each other; the objects must becontained in the room. The m penalizes the pene- trating area between foreground objects f and the ex- ceeding area beyond the background room border
45、s b asGp = vi : i = 1, · · · , k, these geometric shapes forma graph G = (V, E) in the G space, where each primitiveis a graph node vi V , and each contextual relation is agraph edge e E. In this way, we derive a Markov Ran-dom Fields (MRFs) mat the geometric level. The joint probabil
46、ity is factorized over the graph cliques,YP (G|F) =1 (vi|Fp)viGpYY2 (ei|Fo)3 (ei|Fg)(3)eicl(Fo)eicl(Fg), where we take as a large num- Y1/z exp(f + b)TT (e )ber, and f = (v )(v ), = (v )(bg).i4ijbieicl(S)2.3. The appearance mP (I|G)where the e cl(X) denotes an edge whose two connect-iing nodes belon
47、g to the children (or descendant) of the X. These four kinds of cliques are introduced by the functional parts Fp, the functional objects Fo, the functional groupsWe define the appearance m by applying the idea of analysis-by-synthesis. In the functional space and the ge- ometric space, we specify h
48、ow the underlying causes gen- erate a scene image. There is still a gap between the syn- thesized scene and the observed image, because we can not render a real image without knowing the accurate lighting condition and material parameters. In order to fill this gap, we make use of the discriminative
49、 approaches: a line seg- ment detector 20, a foreground detector 12 and a sur-Fg and teral physical constraints respectively:Object affordance 1 (vi|Fp) is an “unary term” whichm s one to one correspondences between the geometric primitives Gp and the functional parts Fp. The probability measures ho
50、w likely an object is able to afford the action given its geometry. As shown in Fig.1, a cube around 1.5ft tall is comfortable to sit on despite its appearance, and a ”ta- ble” of 6ft tall loses its original function to place objects on while sitting in front of. We m the 3D sizes of theface orienta
51、tion detector 17 to produce a line map Al(I),a foreground map Af (I) and a surface orientation mapAo(I) respectively. By projecting detected 3D geomet-functional parts by a mixture of Gaussians. The m char- acterizes the Gaussian nature of the object sizes and allows the alternatives of canonical si
52、zes at the same time, such asric primitives onto the image plane, we evaluate our mby calculating the pixel-wise difference between the maps from top-down projection and the maps from bottom-up de-4Figure 5. Samples drawn from the distributions of 3D geometric m “sleeping”.s (a) the functional objec
53、t “sofa” and (b) the functional grouptection as shown in Fig.2.lays on a ray from the camera center C = (Cx, Cy, Cz)T to the pixel p = (x, y, 1)T . The ray P () is defined by (X, Y, Z)T = C + R1K1p, where is the positive scaling factor that indicates the position of the 3D point onthe ray. Therefore
54、, the 3D position of the pixel lies at the intersection of the ray and a plane (the object surface). We assume a camera is 4.5ft high. By knowing the distance and the normal of the floor plane, we can recover the 3D posi- tion for each pixel with the math discussed above. And any other plane contact
55、ing with the floor can be inferred by its contacting point with the floor. Then we can gradually re- cover the whole scene by repeating the process from bottom up. If there is any object too close to the camera without showing its bottoms, we will put it 3 feet away from the camera.P (I|G) exp(d(Al(
56、G), Al(I)(4)+ d(Af(G), Af(I) + d(Ao(G), Ao(I)These image features have been widely used in recent stud- ies 9, 12, 13, 21, 17, 16, 10, 26, 4, 3, 19, hence we will skip further discussion of details about them.2.4. Put objects back to the 3D worldAnother important component of our mis the recov-ery o
57、f a real world 3D geometry from the parse tree (Fig.2). It enables us to utilize the 3D geometric/contextual mea-surement to identify the object affordance/functional groups as discussed before.Single view camera calibration: We cluster line seg- ments to find three vanishing points whose corresponding dimensions are orthogonal to each other 12. The vanish- ing points are then used to determine the intrinsic and ex- trinsic calibration parameters 2, 11. We assume that the asp
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2026年泰国大型活动交通疏导试题含答案
- 北京警察学院《电力系统分析》2024 - 2025 学年第一学期期末试卷
- 河南省新乡市辉县市2024-2025学年八年级上学期期末生物试题(含答案)
- 2026年环保科技行业政策报告及碳中和技术
- 2026年及未来5年中国多肽蛋白行业发展前景预测及投资方向研究报告
- 护理课件制作中的互动元素
- 体育荣誉制度
- 会所会员卡制度
- 2025至2030中国智能穿戴设备市场现状及产业链投资规划报告
- 临沂市公安机关2025年第四季度招录警务辅助人员备考题库带答案详解
- 养老院老人生活设施管理制度
- 2026年直播服务合同
- 挂靠取消协议书
- 哲学史重要名词解析大全
- 银行借款抵押合同范本
- 新生儿休克诊疗指南
- DB37-T4975-2025分布式光伏直采直控技术规范
- 儿童糖尿病的发病机制与个体化治疗策略
- 水泥产品生产许可证实施细则2025
- 急性心梗合并急性心衰护理
- 电力线路施工项目竣工验收与交付方案
评论
0/150
提交评论