IROS2019国际学术会议论文集1903_第1页
IROS2019国际学术会议论文集1903_第2页
IROS2019国际学术会议论文集1903_第3页
IROS2019国际学术会议论文集1903_第4页
IROS2019国际学术会议论文集1903_第5页
已阅读5页,还剩2页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Action Recognition Based on 3D Skeleton and RGB Frame Fusion Guiyu Liu, Jiuchao Qian, Fei Wen, Xiaoguang Zhu, Rendong Ying, Peilin Liu1 AbstractAction recognition has wide applications in as- sisted living, health monitoring, surveillance, and human- computer interaction. In traditional action recognition methods, RGB video-based ones are effective but computationally inef- fi cient, while skeleton-based ones are computationally effi cient but do not make use of low-level detail information. This work considers action recognition based on a multimodal fusion be- tween the 3D skeleton and the RGB image. We design a neural network that uses a 3D skeleton sequence and a single middle frame from an RGB video as input. Specifi cally, our method picks up one frame in a video and extracts spatial features from it using two attention modules, a self-attention module and a skeleton-attention module. Further, temporal features are extracted from the skeleton sequence via a BI-LSTM sub- network. Finally, the spatial features and the temporal features are combined via a feature fusion network for action classifi ca- tion. A distinct feature of our method is that it uses only a single RGB frame rather than an RGB video. Accordingly, it has a light-weighted architecture and is more effi cient than RGB video-based methods. Comparative evaluation on two public datasets, NTU-RGBD and SYSU, demonstrates that, our method can achieve competitive performance compared with state-of- the-art methods. I. INTRODUCTION Human action recognition is a fundamental problem in computer vision, which arises in many important applica- tions, such as assisted living, health monitoring, surveil- lance and human-computer interaction. Recently, benefi ted from the powerful deep learning techniques, much progress has been made in human action recognition 1, 2, 3. Particularly, with the development of depth cameras (e.g., Microsoft Kinect 4), using 3D skeleton information for action recognition has achieved impressive performance 5, 6, 7, 8, 9, 10. Generally, in terms of the inputs, action recognition meth- ods can be roughly classifi ed into three classes: 2D video- based, depth image-based, and 3D skeleton based methods. Most of the early approaches are designed based on 2D videos, among which the two-stream network architecture based ones are of the most effective 1, 11, 3. In such methods, the two-stream architecture is designed to jointly process RGB images and optical fl ow images. The main concept behind these methods is based on the nature that, RGB images contain human bodies and object information, while fl ow images contain the movement information of the pixels. The depth image-based method is fi rstly proposed in 12 and, then, extended to fuse with the skeleton and RGB information in 13, 14. Depth image-based methods are insensitive to illumination variation, but vulnerable to 1All authors with the School of Electronic Information and Electrical Engineering, Shanghai Jiaotong University,wenfee viewpoint change. The third class uses 3D skeleton informa- tion as input. Skeleton information refers to 3D coordinates estimated from depth images. Thanks to its robustness and view-invariant property, 3D skeleton based action analysis has become increasingly popular recently 5, 6, 7, 8, 9, 10. However, such approaches are limited due to the fact that low level (e.g., pixel level) information is lost in extracted 3D skeleton and, meanwhile, 3D skeleton only information is often insuffi cient to distinguish among actions which involve various human-object interactions. To further improve the robustness and accuracy of action recognition, this work proposes a 3D skeleton and RGB fused method. A neural network is designed to fuse the temporal information extracted from a 3D skeleton sequence with the spatial information extracted from a middle frame of an RGB video. Inspired by the 2D video-based method 15, we extract the temporal information from the skeleton sequence. Just like stacked fl ow images, a skeleton sequence also contains the movement information of human bodies. But as a high-level feature, skeleton data has a much lower dimension than fl ow images, which can be effi ciently extracted from a 3D camera. While stacked fl ow images are usually processed using a convolutional neural network (CNN), we use a long short-term memory (LSTM) network to extract the temporal movement features from the skeleton sequence. In exploiting the spatial information, only the middle frame of the RGB video is used, from which the spatial information is extracted by a convolutional stream coupled with two attention modules. Fig. 1 shows the overall architecture of our model, which includes two attention modules, a self-attention module and a skeleton attention module. The skeleton attention module is directly determined by the skeleton information and does not need to be trained. By the two attention modules, the most interesting part of an action can be extracted. Finally, the spatial features from the RGB stream and the temporal features from the skeleton stream (the LSTM sub- network) are fused. The main contributions of this paper are as follows. First, based on the skeleton information, we propose a preprocess strategy and design a self-attention module to determine the visual saliency in the RGB frame associated with an action. Then, skeleton movement information is utilized to generate handcraft attention to reinforce the attention mechanism (the skeleton attention module). Further, we build an end to end network to fuse the information from the 3D skeleton sequence and the RGB frame. 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Macau, China, November 4-8, 2019 978-1-7281-4003-2/19/$31.00 2019 IEEE258 Fig. 1.The overall architecture of our model. It has two inputs, a skeleton sequence and an RGB frame. It has three parts, an RGB stream, a skeleton stream, and a feature fusion network. To the best of our knowledge, this work is the fi rst for action recognition which considers multimodal fusion between 3D skeleton and RGB frame. II. RELATED WORK 3D Skeleton Action Recognition: Discriminative features and temporal memory are two main factors considered in skeleton based action recognition. Most of traditional meth- ods utilize handcraft features to characterize actions, such as skeleton joint positions 16, 17 and pairwise relative joint positions 18, 19. For temporal memory, relatively simple time series models are popular 20, e.g., dynamic time wrapping (DTW) 21 and hidden Markov model 22, 23. Deep learning methods for skeleton-based action recogni- tion can be mainly divided into two categories, RNN/LSTM based and CNN based approaches. RNN based approaches directly take 3D skeleton joints coordinates as input, and use RNN/LSTM to memorize the development of actions 5, 6, 7, 8. CNN based approaches fi rstly transfer skeleton joints into various spectrum images. Then, these images are sent to CNN to extract spatial and temporal features 9, 10. In CNN based approaches, various kernels have been designed to exploit the temporal relationship. Attention Mechanism: Attention mechanism is fi rstly used in image caption 24 and, then, was widely used in many fi elds such as speech recognition 25, machine translation 26 and image segmentation 27. Traditional attention methods use a LSTM network to utilize the hidden states from previous time steps 28. In 2D video action recognition, Sharma et al. 29 use a multilayer LSTM model to focus on relevant spatial parts of the input frame. Since we only use a middle frame for RGB information extraction, LSTM is not applied in our model. Instead, similar to the method in person re-identifi cation 30, we use a convolutional model to extract the most interesting part from an RGB image. Fusion Methods in Action Recognition: The fusion of multiple feature streams can boost the classifi cation per- formance. In 2D action recognition, Karpathy et al. 31 propose a fusion convolutional network which fuses layers correspond to different input frames at various levels of a deep network. Except for feature fusion, score fusion like averaging softmax scores also achieved good results on UCF101 32 and HMDB51 33 datasets. In softmax scores, SVM can yield better performance than simple av- eraging. Further, Wu et al. 34 propose a new score fusing strategy utilizing the class relationships in the data after the network training. In 3D action skeleton recognition, Zhang et al. 35 use softmax fusion methods to fuse different geometric features such as joint-joint distance and line-line angles. Rahmani et al. 13 fuses depth image and skeleton data. Features from multimodal information are concatenated before sending to a fully connected layer for classifi cation. In our method, the RGB feature and the skeleton feature are fused to make better use of both the spatial information and the temporal information. III. METHODOLOGY The methodology is mainly divided into three parts. The fi rst part is skeleton feature extraction implemented by a stacked LSTM network. The second part is the RGB feature extraction stream including two attention modules: a self- attention module and a skeleton attention module. The last part is the feature fusion of temporal skeleton features and spatial features. A. Stacked LSTMs for Skeleton Feature Extraction LSTM 28 can handle sequential data with various time steps. Compared with RNN, LSTM could learn long-range dependency 36 and avoid the problem of gradient vanish- ing. A typical LSTM neuron contains an input gate it, a forget gate ft, a cell state ct, an output gate ot, and an 259 output response ht. The LSTM transition equations can be expressed as: it ft ot ut = sigm sigm sigm tanh W xt ht1 ! (1) ct= ft? ct1+ it ut(2) ht= ot tanh(ct)(3) As shown in Fig.1, the skeleton stream is used for processing skeleton sequence. The main part is a simple 3-layer LSTM. The input skeleton stream is denoted by I RN2T, where N is the number of skeleton points in the t-th frame, T is the number if the total time steps in the skeleton sequence. For each time step, the input information is xt RN3. The output feature is the cell state of the last time step. In the skeleton sequence, raw 3D skeleton coordinates are located in the camera coordinate system. The origin of the camera coordinate system is the location of the depth camera. In order to eliminate the infl uence of camera sensor position and action position, the pre-processing method VA-pre 23 is employed. By this pre-processing method, the origin of the coordinate system is transformed to the body center of the fi rst frame. B. The RGB Stream and the Attention Modules RGB stream consists of three parts: base convolution layers, self-attention module and skeleton attention module. Base convolution layers are used to extract feature maps. The self-attention module and the skeleton attention module are used to generate attention weights. Considering the accuracy and the number of parameters, the front layers of Xception 37 network are selected as base convolutional layers to extracting RGB feature maps. Xcep- tion network is trained on Imagenet for object recognition and localization for 1,000 classes. But in Kinect captured 3D datasets, RGB videos contain more of human bodies and less of objects, which is quite different from the ImageNet dataset. In order to transfer the Xception network into action recognition task, the self-attention module and the skeleton attention module are proposed to extract the salient part of an action. Preprocessing: Depth sensors can capture RGB videos and skeleton sequences at the same time. An RGB video shares one to one mapping with a skeleton sequence. Since we only intend to extract the information on human limbs and objects, an RGB video is redundant for spatial information. In the proposed method, we only choose the middle frame of an RGB video to extract spatial feature. At the same time, the skeleton sequence is used to capture temporal changes. In the chosen RGB frame, human action only accounts for a small part of the whole image. Especially in human- object interacted actions, background objects will produce Fig. 2.Self-attention module. The input is feature maps extracted by Xception network. Layers in dashed line box are used for generating attention weights. Pool means global average pooling. Linear means liner transformation which reduced the dimension to 256. This module has 2 repeated branches of the dashed line box. huge interference. In order to overcome this problem, we propose a projection crop method to preprocess images. Given the 3D skeleton coordinates and the camera pa- rameters, 2D pixel coordinates can be calculated through the projection equation. Thus the bounding box of the human subject can be easily determined. Projection crop means cropping the part of the human subject according to the bounding box. We could do some image augmentation through cropping. Assume w and h are the width and height of the bounding box. Using four corners of the bounding box as origin, we crop the original image with w + w0and h + h0where w0and h0are ranging from 100 to 300 pixels randomly. Thus the number of images is augmented by 4 times. Self-attention module: Inspired by the method for extract- ing body part features in person re-identifi cation 30, we propose a self-attention module to deal with feature maps. The self-attention module aims to extract the visual saliency of an action. It is different from the original soft attention module used in image caption 24, as explained as follows. The original soft attention mechanism can be expressed as (4) and (5), eti= fatt(ai,ht1)(4) ti= exp(eti) PL k=1exp(etk) .(5) where fattdenotes the attention module which is a mul- tilayer perceptron. t represents for time step while i for different image location. airepresents the image feature, while ht1represents the hidden context state of the previous time step. After etigoing through the softmax process of (5), ti is the fi nal attention weight which is a probability between 0 and 1. In the action recognition task, we do not have context information like image caption, hence, the Hidden state ht1 and time step t are not available. The mechanism of our self-attention module is expressed in (6) and (7), and the architecture is shown in Fig.2. 260 Fig. 3. The generating procedure of skeleton attention weight. The fi rst coordinates mean the fi rst frame while the second means the middle frame. Through calculating, the right hand has the biggest moving distance. The yellow part of the right hand is the attention mask we add. Resizing the attention mask gets the fi nal skeleton attention weight. In (6), 1 1 convolution is used to replace the fatt attention module,where L = wf hf, wfand whdenote the width and height of feature maps. For the only input ai, 11 convolution here has the same function as a multilayer perceptron. In (7), a sigmoid function is used to replace the softmax function. By using sigmoid functions we can also get the attention weight probability in the range 0, 1. ei= Conv1D(ai),for i = 1,.,L(6) i= sigmoid(ei),for i = 1,.,L(7) Typical results of our self-attention module is shown in Fig.4. We can see that self-attention can capture the most important part of an action as well as the global information. Skeleton attention module: In the self-attention module, class labels are the only supervision when training attention weights. Thus we propose another strong attention mecha- nism that can directly be generated from skeleton sequence. The generation procedure of skeleton attention is shown in Fig.3. At fi rst, we fi nd the skeleton joint with the largest moving distance dmax, which is computed as: dmax= |J1,jmax Jmid,jmax|2(8) with: jmax= arg max j |J1,j Jmid,j|2)(9) In (9), J1and Jmiddlestand for 3D joint locations of the fi rst frame and middle frame, respectively. By calculating the moving distance, the index of joint with the biggest changes jmaxis founded. At second, we start to generate attention mask M. M is the same size as RGB frame. When generating attention mask, set Mp= 1 in a square centered at jmaxwhile Mp= 0 at other location , p means pixel location of the mask. At last, resizing attention mask to wf hf which is the same size with feature maps, skeleton attention weight is generated. Skeleton attention weight is sent to do elementwise production with feature maps and then going through average pooling and linear transformation. Thus we get another attention feature of original feature maps. C. 3D Skeleton/2D Frame Network Fusion After utilizing skeleton information to supervise RGB attention, the results of the two sub-network streams are combined for action classifi cation.Inspired by the fusion methods of two streams 38, 39, 1, and the multi-task learning method 10, feature fusion is applied to improve the fi nal performance. The fusion part is shown in Fig.1. First, we concatenate the features from the skeleton stream and the RGB stream. Then, we do 2normalization on the concatenated feature. Subsequently, a fully connected layer with leaky ReLU as the activation function is used. This layer is intended to fi nd the non-linear relationship between temporal skeleton feature and spatial feature. The last two layers are fully connected softmax layers which are used for classifi cation. The training procedure is shown below. All the loss functions used in the training procedure are the cross-entropy loss. For the overall architecture in Fig.1, remove the feature fusion part, and make the two stream sub-networks independent networks by adding a fully connected (FC) layer and a softmax layer on the top. Train these two sub-net

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论