




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、A Self Validation Network for Object-Level Human Attention Estimation Zehua Zhang,1Chen Yu,2David Crandall1 1Luddy School of Informatics, Computing, and Engineering 2Department of Psychological and Brain Sciences Indiana University Bloomington zehzhang, chenyu, Abstract Due to the
2、foveated nature of the human vision system, people can focus their visual attention on only a small region of their visual fi eld at a time, which usually contains a single object. Estimating this object of attention in fi rst-person (egocentric) videos is useful for many human-centered real-world a
3、pplications such as augmented reality and driver assistance systems. A straightforward solution for this problem is to fi rst estimate the gaze with a traditional gaze estimator and generate object candidates from an off-the-shelf object detector, and then pick the object that the estimated gaze fal
4、ls in. However, such an approach can fail because it addresses the where and the what problems separately, despite that they are highly related, chicken-and-egg problems. In this paper, we propose a novel unifi ed model that incorporates both spatial and temporal evidence in identifying as well as l
5、ocating the attended object in fi rst-person videos. It introduces a novel Self Validation Module that enforces and leverages consistency of the where and the what concepts. We evaluate on two public datasets, demonstrating that the Self Validation Module signifi cantly benefi ts both training and t
6、esting and that our model outperforms the state-of-the-art. 1Introduction Humans can focus their visual attention on only a small part of their surroundings at any moment, and thus have to choose what to pay attention to in real time 43. Driven by the tasks and intentions we have in mind, we manage
7、attention with our foveated visual system by adjusting our head pose and our gaze point in order to focus on the most relevant object in the environment at any moment in time 8, 17, 29, 47, 62. This close relationship between intention, attention, and semantic objects has inspired a variety of work
8、in computer vision, including image classifi cation 26, object detection 27,46,50,52, action recognition 4,36,42,48, action prediction 53, video summarization 30, visual search modeling 51, and irrelevant frame removal 38, in which the attended object estimation serves as auxiliary information. Desp
9、ite being a key component of these papers, how to identify and locate the important object is seldom studied explicitly. This problem in and of itself is of broad potential use in real-world applications such as driver assistance systems and intelligent human-like robots. Inthispaper, wediscusshowto
10、identifyandlocatetheattendedobjectinfi rst-personvideos. Recorded by head-mounted cameras along with eye trackers, fi rst-person videos capture an approximation of what people see in their fi elds of view as they go about their lives, yielding interesting data for studying real-time human attention.
11、 In contrast to gaze studies of static images or pre-recorded videos, fi rst-person video is unique in that there is exactly one correct point of attention in each frame, as a camera wearer can only gaze at one point at a time. Accordingly, one and only one gazed object 33rd Conference on Neural Inf
12、ormation Processing Systems (NeurIPS 2019), Vancouver, Canada. exists for each frame, refl ecting the camera wearers real-time attention and intention. We will use the term object of interest to refer to the attended object in our later discussion. (a)(b) Figure 1: Among the many objects appearing i
13、n an ego- centric video frame of a persons fi eld of view, we want to identify and locate the object to which the person is vi- sually attending. Combining traditional eye gaze estima- tors and existing object detectors can fail when eye gaze prediction (blue dot) is slightly incorrect, such as when
14、 (a) it falls in the intersection of two object bounding boxes or (b) it lies between two bounding boxes sharing the same class. Red boxes shown actual attended object according to ground truth gaze and yellow dashed boxes show incorrect predictions. Some recent work 22,66,68 has discussed estimatin
15、g probability maps of ego-attention or predicting gaze points in egocentric videos. However, people think not in terms of points in their fi eld of view, but in terms of the ob- jects that they are attending to. Of course, the object of interest could be obtained by fi rst es- timating the gaze with
16、 the gaze estimator and generating object candidates from an off-the- shelf object detector, and then picking the ob- ject that the estimated gaze falls in. Because this bottom-up approach estimates where and what separately, it could be doomed to fail if the eye gaze prediction is slightly inaccura
17、te, such as falling between two objects or in the intersection of multiple object bounding boxes (Figure 1). To assure consistency, one may think of performing anchor-level attention estimation and directly predicting the attended box by mod- ifying existing object detectors. Class can be either pre
18、dicted simultaneously with the anchor- level attention estimation using the same set of features, as in SSD 40, or afterwards using the features pooled within the attended box, as in Faster-RCNN 49. Either way, these methods still do not yield satisfying performance, as we will show in Sec. 4.2, bec
19、ause they lack the ability to leverage the consistency to refi ne the results. We propose to identify and locate the object of interest by jointly estimating where it is within the frame as well as recognizing what its identity is. In particular, we propose a novel model which we cheekily call Mindr
20、eader Net or Mr. Net to jointly solve the problem. Our model incorporates both spatial evidence within frames and temporal evidence across frames, in a network architecture (which we call the Cogged Spatial-Temporal Module) with separate spatial and temporal branches to avoid feature entanglement. A
21、 key feature of our model is that it explicitly enforces and leverages a simple but extremely useful constraint: our estimate of what is being attended should be located in exactly the position of where we estimate the attention to be. This Self Validation Module fi rst computes similarities between
22、 the global object of interest class prediction vector and each local anchor box class prediction vector as the attention validation score to update the anchor attention score prediction, and then, with the updated anchor attention score, we select the attended anchor and use its corresponding class
23、 prediction score to update the global object of interest class prediction. With global context originally incorporated by extracting features from the whole clip using 3D convolution, the Self Validation Module helps the network focus on the local context in a spatially-local anchor box and a tempo
24、rally-local frame. We evaluate the approach on two existing fi rst-person video datasets that include attended object ground truth annotations. We show our approach outperforms baselines, and that our Self Validation Module not only improves performance by refi ning the outputs with visual consisten
25、cy during testing, but also it helps bridge multiple components together during training to guide the model to learn a highly meaningful latent representation. More information is available at http:/vision.soic.indiana. edu/mindreader/. 2Related Work Compared with many efforts to understand human at
26、tention by modeling eye gaze 2,7,16,20 22,24,34,35,45,59,60,64,66,68 or saliency 19,25,3133,39,55,67,69, there are relatively few papers that detect object-level attention. Lee et al. 30 address video summarization with hand-crafted features to detect important people and objects, while object-level
27、 reasoning plays a key 2 Input Single RGB Frame 300 x 300 x 3 Input RGB Sequence 15 x 300 x 300 x 3 Input Optical Flows 15 x 300 x 300 x 2 Cogged Spatial-Temporal Module Temporal Gear Branch Spatial Gear Branch Object of Interest Class Logits 1 x c Self Validation Module What? Where? globalclass L a
28、ttn L box L Anchor Box Class Logits 8732 x c Anchor Box Attention Logits 8732 x 1 Anchor Box Offsets 8732 x 4 Cosine Similarity 8732 x 1 Update Anchor Box Attention Logits 8732 x 1 Attended Anchor Boxs Offsets 1 x 4 Attended Anchor Boxs Location 1 x 4 Attended Anchor Boxs Class Logits 1 x c Update O
29、bject of Interest Class Logits 1 x c cube with ball boxclass L Figure 2: The architecture of our proposed Mindreader Net. Numbers indicate output size of each component (wherec is the number of object classes). Softmax is applied before computing the losses on global classifi cation Lglobalclass , a
30、nchor box classifi cationLboxclass, and attentionLattn (which is fi rst fl attened to be 8732-d). Please refer to supplementary materials for details about the Cogged Spatial-Temporal Module. role in Baradel et al.s work on understanding videos through interactions of important objects 4. In the par
31、ticular case of egocentric video, Pirsiavash and Ramanan 48 and Ma et al. 42 detect objects in hands as a proxy for attended objects to help action recognition. However, eye gaze usually precedes hand motion and thus objects in hand are not always those being visually attended (Fig. 1a). Shen et al.
32、 53 combine eye gaze ground truth and detected object bounding boxes to extract attended object information for future action prediction. EgoNet 5 , among the fi rst papers to focus on important object detection in fi rst-person videos, combines visual appearance and 3D layout information to generat
33、e probability maps of object importance. Multiple objects can be detected in a single frame, making their results more similar to saliency than human attention in egocentric videos. Perhaps the most related work to ours is Bertasius et al.s Visual-Spatial Network (VSN) 6, which proposes an unsupervi
34、sed method for important object detection in fi rst-person videos that incorporates the idea of consistency between the where and what concepts to facilitate learning. However, VSN requires a much more complicated training strategy of switching the cascade order of the two pathways multiple times, w
35、hereas we present a unifi ed framework that can be learned end-to-end. 3Our approach Given a video captured with a head-mounted camera, our goal is to detect the object that is visually attended in each frame. This is challenging because egocentric videos can be highly cluttered, with many competing
36、 objects vying for attention. We thus incorporate temporal cues that consider multiple frames at a time. We fi rst consider performing detection for the middle frame of a short input sequence (as in 42), and then further develop it to work online (considering only past information) by performing det
37、ection on the last frame. Our novel model consists of two main parts (Figure 2), which we call the Cogged Spatial-Temporal Module and the Self Validation Module. 3.1Cogged Spatial-Temporal Module The Cogged Spatial-Temporal Module consists of a spatial and a temporal branch. The “cogs” refer to the
38、way that the outputs of each layer of the two branches are combined together, reminiscent of the interlocking cogs of two gears (Figure 2). Please see supplementary material for more details. The Spatial Gear Branch,inspired by SSD300 40, takes a single frameItof sizeh wand performs spatial predicti
39、on of local anchor box offsets and anchor box classes. It is expected to work as an object detector, although we only have ground truth for the objects of interest to train it, so we do not add an extra background class as in 40, and only compute losses for the spatial-based tasks on the matched pos
40、itive anchors. We use atrous 10,65 VGG16 54 as the backbone and follow a similar anchor box setting as 40. We also apply the same multi-anchor matching strategy. 3 With the spatial branch, we obtain anchor box offset predictionsO Ra4and class predictions Cbox Rac, whereais the number of anchor boxes
41、 andcis the number of classes in our problem. Following SSD300 40, we have a = 8732, h = 300, and w = 300. The Temporal Gear BranchtakesNcontinuous RGB framesItN1 2 ,t+N1 2 as well asNcorre- sponding optical fl ow fi eldsFtN1 2 ,t+N1 2 , both of spatial resolutionh w(withN = 15, set empirically). We
42、 use Inception-V1 58 I3D 9 as the backbone of our temporal branch. With aggregated global features from 3D convolution, we obtain global object of interest class predictions Cglobal R1cand anchor box attention predictionsA Ra1. We match the ground truth box only to the anchor with the greatest overl
43、ap (intersection over union). The matching strategy is empirical and discussed in Section 4.3. 3.2Self Validation Module The Self Validation Module connects the above branches and delivers global and local context between the two branches at both spatial (e.g., whole frame versus an anchor box) and
44、temporal (e.g., whole sequence versus a single frame) levels. It incorporates the constraint on consistency between where and what by embedding a double validation mechanism: what where and where what. What where.With the outputs of the Cogged Spatial-Temporal Module, we compute the cosine similarit
45、ies between the global class predictionCglobaland the class prediction for each anchor box, Cboxi, yielding an attention validation score for each box i, V i attn= CglobalCT boxi |Cglobal| |Cboxi|. (1) Then the attention validation vectorVattn= V 1 attn,V 2 attn,.,V a attn Ra1is used to update the a
46、nchor box attention scoresAby element-wise summation,A0= A + Vattn. Since1 V i attn 1, we make the optimization easier by rescaling each Aito the range 1,1, A0= R(A) + Vattn= A (max(A) + min(A)/2 max(A) (max(A) + min(A)/2 + Vattn,(2) where max() and min() are element-wise vector operations. Where wh
47、at.Intuitively, obtaining the attended anchor box indexmis a simple matter of com- putingm = argmax(A0), and the class validation score is simplyVclass= Cboxm. Similarly, after rescaling, we take an element-wise summation,VclassandCglobal, to update the global object of interest class prediction (R(
48、)in Equation 2),C0 global = R(Cglobal) + R(Vclass). However, the hard argmax is not differentiable, and thus gradients are not able to backpropagate properly during training. We thus use soft argmax. Softmax is applied to the updated anchor box attention scoreA0to produce a weighting vector e A0for
49、class validation score estimation, Vclass= a X i=1 e A0iCboxi,with e A0i= eA 0 i Pa j=1e A0 j (3) Now we replace Vclasswith Vclassto update Cglobal, C0 global = R(Cglobal) + R(Vclass). This soft what where validation is closely related to the soft attention mechanism widely used in many recent paper
50、s 3,11,41,56,61,63. While soft attention learns the mapping itself inside the model, we explicitly incorporate the coherence of the where and what concepts into our model to self-validate the output during both training and testing. In contrast to soft attention which describes relationships between
51、 e.g. words, graph nodes, etc., this self-validation mechanism naturally mirrors the visual consistency of our foveated vision system. 3.3Implementation and training details We implemented our model with Keras 12 and Tensorfl ow 1. A batch normalization layer 23 is inserted after each layer in both
52、spatial and temporal backbones, and momentum for batch normalization is0.8. Batch normalization is not used in the four prediction heads. We found pretraining the spatial branch helps the model converge faster. No extra data is introduced as we 4 still only use the labels of the objects of interest
53、for pretraining. VGG16 54 is initialized with weights pretrained on ImageNet 14. We use Sun et al.s method 44,57 to extract optical fl ow and follow 9 to truncate the maps to20,20and then rescale them to1,1. The RGB input to the Temporal Gear Branch is rescaled to1,19, while for the Spatial Gear Bra
54、nch the RGB input is normalized to have 0 mean and the channels are permuted to BGR. When training the whole model, the spatial branch is initialized with the pretrained weights from above. The I3D backbone is initialized with weights pretrained on Kinetics 28 and ImageNet 14, while other parts are
55、randomly initialized. We use stochastic gradient descent with learning rate 0.03, momentum0.9, decay0.0001, andL2regularizer5e5. The loss function consists of four parts: global classifi cationLglobalclass, attentionLattn , anchor box classifi cationLboxclass, and box regression Lbox, Ltotal= Lgloba
56、lclass+ Lattn+ 1 Npos (Lboxclass+ Lbox),(4) where we empirically set = = = 1, andNposis the total number of matched anchors for training the anchor box class predictor and anchor box offset predictor.LglobalclassandLattnapply cross entropy loss, computed on the updated predictions of object of inter
57、est class and anchor box attention.Lboxclassis the total cross entropy loss andLboxis the total box regression loss over only all the matched anchors. The box regression loss follows 40,49 and we refer readers there for details. Our full model has64Mtrainable parameters, while the Self Validation Mo
58、dule contains no parameters, making it very fl exible so that it can be added to training or testing anytime. It is even possible to stack multiple Self Validation Modules or use only half of it. During testing, the anchor with the highest anchor box attention scoreA0iis selected as the attended anc
59、hor. The corresponding anchor box offset predictionOiindicates where the object of interest is, while the argmax of the global object of interest class score C0 globalgives its class. 4Experiments We evaluate our model on identifying attended objects in two fi rst-person datasets collected in very different contexts: child and adult toy play, and adults in kitchens. ATT 68 (Adult-Toddler Toy play) consists of fi
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 增强现实工厂监控创新创业项目商业计划书
- 天术散加减治疗小儿肠系膜淋巴结炎(脾胃虚寒证)临床疗效观察
- 钢纤维自密实再生混凝土基本性能研究
- 急性无大血管重度狭窄或闭塞性缺血性卒中晚时间窗动脉溶栓的临床研究
- 基于5-HT与5-HT3AR探讨经典名方橘皮竹茹汤干预化疗性恶心呕吐的机制研究
- 产能规划流程管理办法
- 保安服务许可管理办法
- 台阶课件语文教学
- 2025年驻马店教练员从业资格模拟考试含答案
- 2025年零售药店员工培训计划试题及答案(2篇)
- Unit 1 Making friends PartB Let's learn(说课稿)-2024-2025学年人教PEP版(2024)英语三年级上册
- 2025年山西省太原市人大常委会招聘劳务派遣制人员15人历年管理单位笔试遴选500模拟题附带答案详解
- 卖挂靠公司货车的合同(2篇)
- 《材料成型装备及自动化》教学大纲
- 防止口腔治疗中交叉感染
- DB52T+1844-2024+实验室化学废液收集与处理规范
- 2024年人教版二年级语文上册《第1单元1.小蝌蚪找妈妈》课文教学课件
- 土壤和地下水污染生态环境损害鉴定评估案例分析-笔记
- T-XJZJXH 0004-2024 牛奶中糠氨酸的快速测定方法拉曼光谱法
- 全国高中生物奥林匹克竞赛考试大纲
- (新版)拖拉机驾驶证科目一知识考试题库500题(含答案)
评论
0/150
提交评论