iros2019国际学术会议论文集pdf0410_第1页
iros2019国际学术会议论文集pdf0410_第2页
iros2019国际学术会议论文集pdf0410_第3页
iros2019国际学术会议论文集pdf0410_第4页
iros2019国际学术会议论文集pdf0410_第5页
免费预览已结束,剩余1页可下载查看

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Macau, China, November 4-8, 2019ResFlow: Multi-tasking of Sequentially Pooling Spatiotemporal Features for Action Recognition and Optical Flow EstimationTso-Hsin Yeh1, Chuan Kuo1, An-Sheng Liu1, Yu-Hung Liu1, Yu-Huan Yan

2、g1, Zi-Jun Li1, Jui-Ting Shen1,and Li-Chen Fu1,2, Fellow, IEEEAbstractSince deep-learning-based widely-used and is capable ofcapturing spatial relationship among images. Besides, optical flow has been a popular research topic for ac- tion recognition problem in recent year due to the char- acteristi

3、c of optical flow capable of capturing motions. Not until deep learning became popular, the traditional method had been the main-stream of predicting optical flow. More and more researchers devoted into investi- gating action recognition via deep learning methods.Generally, there are two main catego

4、ries to solve action recognition problem via CNN, two-stream 18 and 3D ConvNet 9, 10. Surprisingly, 11 combining both provides a fine result. Two-stream architecture di- vide a video clip into two parts, spatial relationship of pixels in an RGB image and temporal relationship of among consecutive op

5、tical flows, and separately feeds them into spatial and temporal stream. Generally, the two-stream architecture method dominates other works, but optical flows should be calculated beforehand which is extremely time-consuming. Now that spatial and temporal features are extracted, there are two thoug

6、hts to integrate them together, 5, 7 is to build connection between two streams to fuse them in earlier stage and 2, 3, 5, 6 is to design a mechanism to predict accurate action recognition. On the other hand, with no complex computational preprocessing, 3D ConvNet stacks sev- eral RGB images in a pr

7、e-defined time interval as input, which is memory-consuming for 3D convolution due to the bigger input size. Although 3D convNet can encode spatial and temporal information via three-dimensional convolution at the same time, extracting spatiotemporal features requires larger quantity of filters whic

8、h is not feasible due to hardware limitation.Since studies have shown optical flow has strong representation of temporal information and can be es- timated via an autoencoder 1214, why shouldnt we train an autoencoder to estimate optical flow where spa- tiotemporal has already been encoded? We argue

9、 that spatiotemporal features are encoded due to the charac- teristic of autoencoder and optical flow. Under this as- sumption, we can leverage these spaiotemporal featuresmethod has been generating genericmodel, most existing methods about action recognition use either two-stream structure, conside

10、ring spatial and temporal features separately, or C3D, costing lotsof prices in memory and time. We aim to design a robust system to extract spatiotemporal features with aggregation mechanism to integrate local features in temporal order. In light of this, we propose ResFlow to estimate optical flow

11、 and predict action recognition simultaneously. Leveraging the characteristic of opti- cal flow estimation, we extract spatiotemporal feature via an autoencoder. Via a novel Sequentially Pooling Mechanism which literally pool global spatiotemporal feature sequentially, we extract spatiotemporal feat

12、ure at each time and aggregate these local features into global feature. This design use only RGB images as in- put with temporal information encoded, pre-trained by optical flow, and sequentially aggregate spatiotemporal features in high efficiency. We evaluate our ability of estimating optical flo

13、w on FlyingChairs dataset and show the promising results of action recognition on UCF-101 dataset through a series of experiments.1. IntroductionVideo-based action recognition has been a hot topic in recent years. Nevertheless, it is still a very chal- lenging task with high computational cost in th

14、e field of computer vision. The pipeline can be roughly sepa- rated into three parts, spatiotemporal feature extraction, feature aggregation, and classification. Nowadays, most existing methods adopt Convolutional Neural Network (CNN) as feature extractor, a strong feature extractor of1All of author

15、s are with the Department of Electrical Engineer-ing, NationalUniversity, Taipei 10617,2Li-Chen Fu is with the NTU Research Center for AI and Ad-vanced Robotics, Taipei 10617,.tw978-1-7281-4003-2/19/$31.00 2019 IEEE28354. The sequentially aggregating mechanism of spa- tiotemporal featur

16、es is designed and feasible for a number of applications.2. Related Work2.1. Optical flowOptical flow estimation is one of the most funda- mental problem in computer vision field and extremely important for analyzing human actions. However, ac- curate optical flow estimation still remains a challeng

17、e due to illumination variation, large displacement, mo- tion blur, texture, etc. Therefore, a reliable optical flow estimation to obtain better features is required so that its implementation is feasible for real-world applications.Deep learning methods have been a popular solu- tion in all kind of

18、 areas, especially in computer vision. Flownet 12 has been one of the state-of-the-art in the field of estimating optical flow. The concept of autoen- coder has been applied to train the network with a pair of RGB images as input and the corresponding optical flow map as output and implemented in ot

19、her applica- tions, such as image denoising, image superresolution, etc. Besides, Flownet adds extra variational resolution refinements in the training progress to constraint the ar- chitecture and converge to good performance.Besides, Flownet2.0 13 is an advanced version of Flownet with several sub

20、modules concatenating for estimating optical flow. With small-displacement sub- modules concatenating, Flownet2.0 fuses two predicted optical flow to estimate large-scale displacement and small-displacement. SPynet 14 estimates optical flow by feeding the resized RGB images to the correspond- ing st

21、ages when adding and upsampling those stage re- sults element-wisely. With the idea of residually adding one result to each other, SPynet shows the way to apply integration processes of coarse-to-fine on opticalflow.Figure 1: ResFlow Overall Architecture.from each time step for action recognition.Ho

22、wever, it is challenging to predict action recog- nition by multiple consecutive pairs of RGB frames un- like Two-Stream or 3D ConvNet which uses stacked RGB frames at each time step to get longer period of temporal information. From the perspective of 15, it is feasible to combine a two stream arch

23、itecture in 3D ConvNet way. Even if each local spatiotemporal fea- ture includes both spatial and temporal information out of a pair of consecutive RGB frames, finding a suitable mechanism to aggregate local spatiotemporal features sequentially from the video clip, a series of consecutive RGB frames

24、, is still another crucial task.We propose a novel network, ResFlow, which es- timates optical flow and predicts action recognition si- multaneously. The architecture can be seen at Fig.1 The autoencoder is utilized to estimate optical flow and en- code spatiotemporal features for action recognition

25、. To integrate local spatiotemporal features from each time step, we leverage a novel feature aggregation mecha- nism, Sequentially Pooling Mechanism (SPM) , to pro- duce a global spatiotemporal feature. Thus, the task of action recognition can be done via a linearclassifier.In our ResFlow, we lever

26、age the benefit of gener- ating spatiotemporal features via an autoencoder. More specifically, ResFlow considers the spatiotemporal fea- ture at each time step as local spatiotemporal feature and sequentially aggregates them into global spatiotem- poral features. By leveraging SPM which sequentially

27、 gives a confidence score to each local spatiotemporal feature, global spatiotemporal feature has strong poten- tial to strengthen action recognition other than naive methods which ignore temporal ordered spatiotemporal feature, i.e., average-pooling or max-pooling.Several features of this paper are

28、 listed as below:2.2. Action RecognitionAction recognition has been widely researched in recent year. The methods to employ two-stream archi- tecture, which has two parallel streams to extract spatial and temporal features respectively, are widely used with different varieties. In two-stream CNN 1,

29、the fusion is inserted after fully connected layer. But later, 16 dis- cover that fusion after the last convolution layer even improves the performance. Afterwards, 5 and 7 ar- gue that early interaction between spatial and temporal stream, not only after the last convolution layer yet be- fore the

30、fully-connected layer but also the previous con- volution layers, improves performance.1.ResFlow estimates optical flow and achieves ac- tion recognition simultaneously by multi-tasking.ResFlow uses RGB images only to predict action recognition as well as optical flowsefficiently.A novel way is prop

31、osed to extract local spaiotem- poral features via an modified autoencoder.2.3.2836Figure 2: Autoencoder ArchitectureAll of those research focus on integrating spa- tiotemporal features from a video clip. Some naive methods, such as max-pooling and mean-pooling, have been used but are highly situati

32、on-depending. Thus, learning-based method 2, 3, 6 are proposed. Action- VLAD 3 utilizes unsupervised method to separate high dimensional features into small groups, and each group represents a certain meaningful attribute which contributes to action recognition. AdaScan 2 argue that the spatiotempor

33、al feature should be given a weight to describe its importance at each time step.For action recognition, integration stage plays an important role. The spatiotemporal models 1, 9 per- form better results than image-only methods. Apart from extracting spatiotemporal features via a single stream, 11 i

34、ntegrates architecture to generate spa- tiotemporal features and gets better results.Inspired by these methods, we design a Sequen- tially Pooling Mechanism to aggregate the previous and current local spatiotemporal features by giving a con- fidence score. The final global spatiotemporal feature wil

35、l be generated when running through all of the local spatiotemporal features sequentially.Figure 3: Encoder and Decoder (a)(b) Inspired by 17, we use the residual block as shown with slightly difference. We replaces ReLu with Leaky ReLu 18.in Fig 2, with multiple pairs of RGB images as input and the

36、 corresponding optical flows as output. In au- toencoder, we use five residual unit as encoder to ex- tract spatiotemporal features and four residual unit as decoder. Since Residual Network 17 has shown strong ability to address relationship among pixels, we lever- age ResNet for optical flow estima

37、tion.We argue that optical flow generated from each stage should not be subject to batch normalization layer19 after convolution layer to avoid rescaling the val- ues because optical flow is real vector value. The value describes the moving displacement of one pixel at first image to the second imag

38、e in X and Y axis.In the encoder and decoder, we utilize residual unitdescribed in 17 for our network. As shown in Fig 3, we use the same channels for each residual unit corre- sponding to the same layers in FlownetS 12. Consid- ering that adding residual unit into the autoencoder is still short to

39、generate fine-grained optical flow, we con- catenate the features from both encoder and decoder as well as the upsampling optical flow to ensure that the network learns spatiotemporal features at each layer.3. ResFlowAction recognition relies heavily on spatiotemporal features to distinguish accurat

40、e action class. Inspired by 12, we propose ResFlow which extracts spatiotem- poral features for action recognition and decodes optical flow via an autoencoder. Using pairs of RGB images as input, ResFlow jointly generate corresponding optical flows and action class. The encoded features are shared f

41、or optical flow estimation and action recognition.In Section 3.1, we elaborate optical flow estima- tion design concept, proposed autoencoder architecture refinement. The new design mechanism, Sequentially Pooling Mechanism, is introduced in Section .2. Optical Flow Refinement. Now that we ob

42、tain high-quality spatiotemporal feature, optical flow esti- mation is another important issue. To utilize variational resolution refinement, we use upsampling layer with stride 2 to resize the smaller optical flow to the current size and view this upsampled optical flow as a basis of the predicted

43、optical flow at bigger resolution. Thus, the final optical flow is generated by both upsampled and current predicted optical flow.Since the value of optical flow represents the mov- ing displacement of each pixel among two images, we use a weighted sum mechanism to element-wisely sum the final optic

44、al flow in each resolution stage. The progress of optical flow refinement is shown inFig. 4.The end-point-end error, EPE, is used as an indi-3.1. Optical Flow Estimation3.1.1. Autoencoder for Optical Flow. Inspired by 12, 15, we design a autoencoder architecture, shown2837time = tFtEncoderStFtsubFt1

45、cFigure 4: Optical Flow Refinement. After upsampling previous flow, final flow can be obtained from upsam- pled and current flow by the weighted-sum mechanism.Figure 5: Sequentially Pooling Mechanism. SPM se- quentially address the inputted spatiotemporal features and update the global features, Fc.

46、cated error to evaluate the performance and defined asEPE(u, v, u0, v0 ) = 2St = (Rt), St 0,1(u u ) + (v v )00(5)(6)22(1)i ji ji ji ji, jF = F + S tRtttcwhere u, v are the displacement in X and Y direction of the predicted optical flow estimation respectively, and u0, v0 are that of ground-truth opt

47、ical flow estimation. The error can be viewed as moving distance magnitudeerror between correct and predicted moving distance.Due to the fact that the upsampled optical flow should be alike with ground truth as much as possible, we use a weighted loss function to address this issue. The total loss a

48、t each stage is described aswhere St and Ft is denoted as the confidence score and the feature map generated after Resblock f 5 at time step t, t 2, ., T , respectively, and represents a neural network with multiple linear classifiers.Initially, the condensed global spatiotemporal fea-ture, F1, is e

49、qual to the first spatiotemporal feature, F1,cat time t = 1. Under this circumstance, SPM is able to distinguish whether the current residual input is relevant or not and give a confidence score based on both accu-mulated condensed spatiotemporal feature and residual input. To sum up, Fc is computed

50、 by aggregating local spatiotemporal features, Ft, at each time step, t, and the equation of which is shown asSubLoss(u, v,u0, v0, uup, vup) =EPE(u, v, u0, v0)+ EPE(uup, vup, u0, v0)(2)where uup, vup represent the displacement on X- and Y- direction of the upsampled optical flow estimation re- spect

51、ively. is the coefficient to control the tendency to upsampled or new generated optical flow. The overall error for training this network can be categorized ast=TF =(ttttF + S R )(7)ct=2, which is equivalent tok=5t=T00up upFlowLoss = SubLoss u , v(, , ,u v u ,v k)w F ),wttt= 1F =(tT t(8)kkkkk kkt=1c

52、k=1t=1(3), where w represents the proportion of each local spa- tiotemporal feature to the global spatiotemporal feature. Also, the summation of confidence score corresponding to each local spatiotemporal feature is equal to 1. As a result, SPM leverages share-weighted fully connected layers to fair

53、ly calculate confidence score.Noticeably, SPM sequentially generate the spa- tiotemporal features to determine the confidence score of current feature. Unlike directly training weights of each local spatiotemporal features to aggregate spa- tiotemporal feature, SPM judges the confidence under a fair

54、 criterion to generate global spatiotemporal features.where k is the coefficient at kth stage which is 0.32, 0.08, 0.02, 0.01, 0.01, respectively.3.2. Sequentially Pooling MechanismDespite the fact that the local temporal spatiotem- poral feature contains temporal information of its time interval, n

55、ot all the local features are contributed to pre- dict action recognition. Thus, confidence score for local spatiotemporal feature should be trained and learned. Inspired by 2, we design Sequentially Pooling Mech- anism, SPM, shown in Fig 5, which generates the con- fidence score, St, of each residu

56、al input, Rt, via three fully connected layers at time t, and each residualinput4. Experimentsis calculated by subtracting Ft1 with Ft. The equationcis written asIn this section, we evaluate ResFlow on two aspects including optical flow estimation and action recogni-t1ttR = F F(4)c2838.Neural Networ

57、kConfidence ScoreUpdateFeaturesTable 1: Optical Flow Estimation Comparison. We evaluate ResFlow of optical flow estimation on Fly- ingChairs dataset and compare with the state-of-the-art.Table 2: Refinement Evaluation. Each stage ofoptical flow estimated by ResFlow on FlyingChairs dataset.Table 3: A

58、ction Recognition Results. Comparison with the state-of-the-art on UCF101 dataset.out that adding residual unit has improved optical flow estimation. To show the ability of ResFlow, we visual- ize some cases of the optical flow outcome in Fig 6.Refinement Referred to Section 3.1.2, we introduce a novel design which refines upsampled optical flow and builds a connection of optical flow e

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

最新文档

评论

0/150

提交评论