下载本文档
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Macau, China, November 4-8, 2019ResFlow: Multi-tasking of Sequentially Pooling Spatiotemporal Features for Action Recognition and Optical Flow EstimationTso-Hsin Yeh1, Chuan Kuo1, An-Sheng Liu1, Yu-Hung Liu1, Yu-Huan Yan
2、g1, Zi-Jun Li1, Jui-Ting Shen1,and Li-Chen Fu1,2, Fellow, IEEEAbstractSince deep-learning-based widely-used and is capable ofcapturing spatial relationship among images. Besides, optical flow has been a popular research topic for ac- tion recognition problem in recent year due to the char- acteristi
3、c of optical flow capable of capturing motions. Not until deep learning became popular, the traditional method had been the main-stream of predicting optical flow. More and more researchers devoted into investi- gating action recognition via deep learning methods.Generally, there are two main catego
4、ries to solve action recognition problem via CNN, two-stream 18 and 3D ConvNet 9, 10. Surprisingly, 11 combining both provides a fine result. Two-stream architecture di- vide a video clip into two parts, spatial relationship of pixels in an RGB image and temporal relationship of among consecutive op
5、tical flows, and separately feeds them into spatial and temporal stream. Generally, the two-stream architecture method dominates other works, but optical flows should be calculated beforehand which is extremely time-consuming. Now that spatial and temporal features are extracted, there are two thoug
6、hts to integrate them together, 5, 7 is to build connection between two streams to fuse them in earlier stage and 2, 3, 5, 6 is to design a mechanism to predict accurate action recognition. On the other hand, with no complex computational preprocessing, 3D ConvNet stacks sev- eral RGB images in a pr
7、e-defined time interval as input, which is memory-consuming for 3D convolution due to the bigger input size. Although 3D convNet can encode spatial and temporal information via three-dimensional convolution at the same time, extracting spatiotemporal features requires larger quantity of filters whic
8、h is not feasible due to hardware limitation.Since studies have shown optical flow has strong representation of temporal information and can be es- timated via an autoencoder 1214, why shouldnt we train an autoencoder to estimate optical flow where spa- tiotemporal has already been encoded? We argue
9、 that spatiotemporal features are encoded due to the charac- teristic of autoencoder and optical flow. Under this as- sumption, we can leverage these spaiotemporal featuresmethod has been generating genericmodel, most existing methods about action recognition use either two-stream structure, conside
10、ring spatial and temporal features separately, or C3D, costing lotsof prices in memory and time. We aim to design a robust system to extract spatiotemporal features with aggregation mechanism to integrate local features in temporal order. In light of this, we propose ResFlow to estimate optical flow
11、 and predict action recognition simultaneously. Leveraging the characteristic of opti- cal flow estimation, we extract spatiotemporal feature via an autoencoder. Via a novel Sequentially Pooling Mechanism which literally pool global spatiotemporal feature sequentially, we extract spatiotemporal feat
12、ure at each time and aggregate these local features into global feature. This design use only RGB images as in- put with temporal information encoded, pre-trained by optical flow, and sequentially aggregate spatiotemporal features in high efficiency. We evaluate our ability of estimating optical flo
13、w on FlyingChairs dataset and show the promising results of action recognition on UCF-101 dataset through a series of experiments.1. IntroductionVideo-based action recognition has been a hot topic in recent years. Nevertheless, it is still a very chal- lenging task with high computational cost in th
14、e field of computer vision. The pipeline can be roughly sepa- rated into three parts, spatiotemporal feature extraction, feature aggregation, and classification. Nowadays, most existing methods adopt Convolutional Neural Network (CNN) as feature extractor, a strong feature extractor of1All of author
15、s are with the Department of Electrical Engineer-ing, NationalUniversity, Taipei 10617,2Li-Chen Fu is with the NTU Research Center for AI and Ad-vanced Robotics, Taipei 10617,.tw978-1-7281-4003-2/19/$31.00 2019 IEEE28354. The sequentially aggregating mechanism of spa- tiotemporal featur
16、es is designed and feasible for a number of applications.2. Related Work2.1. Optical flowOptical flow estimation is one of the most funda- mental problem in computer vision field and extremely important for analyzing human actions. However, ac- curate optical flow estimation still remains a challeng
17、e due to illumination variation, large displacement, mo- tion blur, texture, etc. Therefore, a reliable optical flow estimation to obtain better features is required so that its implementation is feasible for real-world applications.Deep learning methods have been a popular solu- tion in all kind of
18、 areas, especially in computer vision. Flownet 12 has been one of the state-of-the-art in the field of estimating optical flow. The concept of autoen- coder has been applied to train the network with a pair of RGB images as input and the corresponding optical flow map as output and implemented in ot
19、her applica- tions, such as image denoising, image superresolution, etc. Besides, Flownet adds extra variational resolution refinements in the training progress to constraint the ar- chitecture and converge to good performance.Besides, Flownet2.0 13 is an advanced version of Flownet with several sub
20、modules concatenating for estimating optical flow. With small-displacement sub- modules concatenating, Flownet2.0 fuses two predicted optical flow to estimate large-scale displacement and small-displacement. SPynet 14 estimates optical flow by feeding the resized RGB images to the correspond- ing st
21、ages when adding and upsampling those stage re- sults element-wisely. With the idea of residually adding one result to each other, SPynet shows the way to apply integration processes of coarse-to-fine on opticalflow.Figure 1: ResFlow Overall Architecture.from each time step for action recognition.Ho
22、wever, it is challenging to predict action recog- nition by multiple consecutive pairs of RGB frames un- like Two-Stream or 3D ConvNet which uses stacked RGB frames at each time step to get longer period of temporal information. From the perspective of 15, it is feasible to combine a two stream arch
23、itecture in 3D ConvNet way. Even if each local spatiotemporal fea- ture includes both spatial and temporal information out of a pair of consecutive RGB frames, finding a suitable mechanism to aggregate local spatiotemporal features sequentially from the video clip, a series of consecutive RGB frames
24、, is still another crucial task.We propose a novel network, ResFlow, which es- timates optical flow and predicts action recognition si- multaneously. The architecture can be seen at Fig.1 The autoencoder is utilized to estimate optical flow and en- code spatiotemporal features for action recognition
25、. To integrate local spatiotemporal features from each time step, we leverage a novel feature aggregation mecha- nism, Sequentially Pooling Mechanism (SPM) , to pro- duce a global spatiotemporal feature. Thus, the task of action recognition can be done via a linearclassifier.In our ResFlow, we lever
26、age the benefit of gener- ating spatiotemporal features via an autoencoder. More specifically, ResFlow considers the spatiotemporal fea- ture at each time step as local spatiotemporal feature and sequentially aggregates them into global spatiotem- poral features. By leveraging SPM which sequentially
27、 gives a confidence score to each local spatiotemporal feature, global spatiotemporal feature has strong poten- tial to strengthen action recognition other than naive methods which ignore temporal ordered spatiotemporal feature, i.e., average-pooling or max-pooling.Several features of this paper are
28、 listed as below:2.2. Action RecognitionAction recognition has been widely researched in recent year. The methods to employ two-stream archi- tecture, which has two parallel streams to extract spatial and temporal features respectively, are widely used with different varieties. In two-stream CNN 1,
29、the fusion is inserted after fully connected layer. But later, 16 dis- cover that fusion after the last convolution layer even improves the performance. Afterwards, 5 and 7 ar- gue that early interaction between spatial and temporal stream, not only after the last convolution layer yet be- fore the
30、fully-connected layer but also the previous con- volution layers, improves performance.1.ResFlow estimates optical flow and achieves ac- tion recognition simultaneously by multi-tasking.ResFlow uses RGB images only to predict action recognition as well as optical flowsefficiently.A novel way is prop
31、osed to extract local spaiotem- poral features via an modified autoencoder.2.3.2836Figure 2: Autoencoder ArchitectureAll of those research focus on integrating spa- tiotemporal features from a video clip. Some naive methods, such as max-pooling and mean-pooling, have been used but are highly situati
32、on-depending. Thus, learning-based method 2, 3, 6 are proposed. Action- VLAD 3 utilizes unsupervised method to separate high dimensional features into small groups, and each group represents a certain meaningful attribute which contributes to action recognition. AdaScan 2 argue that the spatiotempor
33、al feature should be given a weight to describe its importance at each time step.For action recognition, integration stage plays an important role. The spatiotemporal models 1, 9 per- form better results than image-only methods. Apart from extracting spatiotemporal features via a single stream, 11 i
34、ntegrates architecture to generate spa- tiotemporal features and gets better results.Inspired by these methods, we design a Sequen- tially Pooling Mechanism to aggregate the previous and current local spatiotemporal features by giving a con- fidence score. The final global spatiotemporal feature wil
35、l be generated when running through all of the local spatiotemporal features sequentially.Figure 3: Encoder and Decoder (a)(b) Inspired by 17, we use the residual block as shown with slightly difference. We replaces ReLu with Leaky ReLu 18.in Fig 2, with multiple pairs of RGB images as input and the
36、 corresponding optical flows as output. In au- toencoder, we use five residual unit as encoder to ex- tract spatiotemporal features and four residual unit as decoder. Since Residual Network 17 has shown strong ability to address relationship among pixels, we lever- age ResNet for optical flow estima
37、tion.We argue that optical flow generated from each stage should not be subject to batch normalization layer19 after convolution layer to avoid rescaling the val- ues because optical flow is real vector value. The value describes the moving displacement of one pixel at first image to the second imag
38、e in X and Y axis.In the encoder and decoder, we utilize residual unitdescribed in 17 for our network. As shown in Fig 3, we use the same channels for each residual unit corre- sponding to the same layers in FlownetS 12. Consid- ering that adding residual unit into the autoencoder is still short to
39、generate fine-grained optical flow, we con- catenate the features from both encoder and decoder as well as the upsampling optical flow to ensure that the network learns spatiotemporal features at each layer.3. ResFlowAction recognition relies heavily on spatiotemporal features to distinguish accurat
40、e action class. Inspired by 12, we propose ResFlow which extracts spatiotem- poral features for action recognition and decodes optical flow via an autoencoder. Using pairs of RGB images as input, ResFlow jointly generate corresponding optical flows and action class. The encoded features are shared f
41、or optical flow estimation and action recognition.In Section 3.1, we elaborate optical flow estima- tion design concept, proposed autoencoder architecture refinement. The new design mechanism, Sequentially Pooling Mechanism, is introduced in Section .2. Optical Flow Refinement. Now that we ob
42、tain high-quality spatiotemporal feature, optical flow esti- mation is another important issue. To utilize variational resolution refinement, we use upsampling layer with stride 2 to resize the smaller optical flow to the current size and view this upsampled optical flow as a basis of the predicted
43、optical flow at bigger resolution. Thus, the final optical flow is generated by both upsampled and current predicted optical flow.Since the value of optical flow represents the mov- ing displacement of each pixel among two images, we use a weighted sum mechanism to element-wisely sum the final optic
44、al flow in each resolution stage. The progress of optical flow refinement is shown inFig. 4.The end-point-end error, EPE, is used as an indi-3.1. Optical Flow Estimation3.1.1. Autoencoder for Optical Flow. Inspired by 12, 15, we design a autoencoder architecture, shown2837time = tFtEncoderStFtsubFt1
45、cFigure 4: Optical Flow Refinement. After upsampling previous flow, final flow can be obtained from upsam- pled and current flow by the weighted-sum mechanism.Figure 5: Sequentially Pooling Mechanism. SPM se- quentially address the inputted spatiotemporal features and update the global features, Fc.
46、cated error to evaluate the performance and defined asEPE(u, v, u0, v0 ) = 2St = (Rt), St 0,1(u u ) + (v v )00(5)(6)22(1)i ji ji ji ji, jF = F + S tRtttcwhere u, v are the displacement in X and Y direction of the predicted optical flow estimation respectively, and u0, v0 are that of ground-truth opt
47、ical flow estimation. The error can be viewed as moving distance magnitudeerror between correct and predicted moving distance.Due to the fact that the upsampled optical flow should be alike with ground truth as much as possible, we use a weighted loss function to address this issue. The total loss a
48、t each stage is described aswhere St and Ft is denoted as the confidence score and the feature map generated after Resblock f 5 at time step t, t 2, ., T , respectively, and represents a neural network with multiple linear classifiers.Initially, the condensed global spatiotemporal fea-ture, F1, is e
49、qual to the first spatiotemporal feature, F1,cat time t = 1. Under this circumstance, SPM is able to distinguish whether the current residual input is relevant or not and give a confidence score based on both accu-mulated condensed spatiotemporal feature and residual input. To sum up, Fc is computed
50、 by aggregating local spatiotemporal features, Ft, at each time step, t, and the equation of which is shown asSubLoss(u, v,u0, v0, uup, vup) =EPE(u, v, u0, v0)+ EPE(uup, vup, u0, v0)(2)where uup, vup represent the displacement on X- and Y- direction of the upsampled optical flow estimation re- spect
51、ively. is the coefficient to control the tendency to upsampled or new generated optical flow. The overall error for training this network can be categorized ast=TF =(ttttF + S R )(7)ct=2, which is equivalent tok=5t=T00up upFlowLoss = SubLoss u , v(, , ,u v u ,v k)w F ),wttt= 1F =(tT t(8)kkkkk kkt=1c
52、k=1t=1(3), where w represents the proportion of each local spa- tiotemporal feature to the global spatiotemporal feature. Also, the summation of confidence score corresponding to each local spatiotemporal feature is equal to 1. As a result, SPM leverages share-weighted fully connected layers to fair
53、ly calculate confidence score.Noticeably, SPM sequentially generate the spa- tiotemporal features to determine the confidence score of current feature. Unlike directly training weights of each local spatiotemporal features to aggregate spa- tiotemporal feature, SPM judges the confidence under a fair
54、 criterion to generate global spatiotemporal features.where k is the coefficient at kth stage which is 0.32, 0.08, 0.02, 0.01, 0.01, respectively.3.2. Sequentially Pooling MechanismDespite the fact that the local temporal spatiotem- poral feature contains temporal information of its time interval, n
55、ot all the local features are contributed to pre- dict action recognition. Thus, confidence score for local spatiotemporal feature should be trained and learned. Inspired by 2, we design Sequentially Pooling Mech- anism, SPM, shown in Fig 5, which generates the con- fidence score, St, of each residu
56、al input, Rt, via three fully connected layers at time t, and each residualinput4. Experimentsis calculated by subtracting Ft1 with Ft. The equationcis written asIn this section, we evaluate ResFlow on two aspects including optical flow estimation and action recogni-t1ttR = F F(4)c2838.Neural Networ
57、kConfidence ScoreUpdateFeaturesTable 1: Optical Flow Estimation Comparison. We evaluate ResFlow of optical flow estimation on Fly- ingChairs dataset and compare with the state-of-the-art.Table 2: Refinement Evaluation. Each stage ofoptical flow estimated by ResFlow on FlyingChairs dataset.Table 3: A
58、ction Recognition Results. Comparison with the state-of-the-art on UCF101 dataset.out that adding residual unit has improved optical flow estimation. To show the ability of ResFlow, we visual- ize some cases of the optical flow outcome in Fig 6.Refinement Referred to Section 3.1.2, we introduce a novel design which refines upsampled optical flow and builds a connection of optical flow e
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2026天津职业技术师范大学第三批招聘方案(高技能人才岗位)2人备考题库含答案详解(完整版)
- 2025安徽焦岗湖投资集团有限公司政府投资工程审计人员公开招聘笔试历年备考题库附带答案详解
- 2026广西柳州市技工学校编外合同制教师招聘5人备考题库及一套参考答案详解
- 2025安徽大别山产业投资发展集团有限公司专业技术人才猎聘合成总及人员笔试历年常考点试题专练附带答案详解2套试卷
- 2026广西柳州市防洪办公室招聘编外人员1人备考题库带答案详解(基础题)
- 2026上半年安徽事业单位联考霍山县招聘43人备考题库附答案详解(达标题)
- 2026广东华南师范大学招聘幼儿教师1人备考题库附参考答案详解(满分必刷)
- 2026上海市闵行区田园外国语中学第二批教师招聘备考题库有答案详解
- 2026广西北海市第十一中学临聘教师招聘9人备考题库附答案详解(完整版)
- 2025天津市河西区瑞投数据运营管理有限责任公司招聘5人笔试参考题库附带答案详解
- QC/T 262-2025汽车渗碳齿轮金相检验
- T-CFLP 0016-2023《国有企业采购操作规范》【2023修订版】
- 谷雨生物2024环境、社会及管治(ESG)报告
- 2025金风变流器2.0MW故障代码手册V4
- 龙湖物业培训课件
- 反诈知识竞赛题库附答案(150 题)
- 2025年注册可靠性工程师资格认证考试题库500题(含真题、重点题)
- 个人购房合同样本大全
- T-CBMF 91-2020 T-CCPA 17-2020 城市综合管廊结构混凝土应用技术规程
- 电力配网工程各种材料重量表总
- 抗菌药物临床应用指导原则
评论
0/150
提交评论