IROS2019国际学术会议论文集 2427

上传人：我*** IP属地：北京上传时间：2020-04-06 格式：PDF 页数：7 大小：1.93MB 积分：12 举报 版权申诉

已阅读5页，还剩2页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

SGANVO Unsupervised Deep Visual Odometry and Depth Estimation with Stacked Generative Adversarial Networks Tuo Feng1and Dongbing Gu1 Abstract Recently end to end unsupervised deep learning methods have demonstrated an impressive performance for visual depth and ego motion estimation tasks These data based learning methods don t rely on the same limiting assumptions that geometry based methods do The encoder decoder network has been widely used in the depth estimation and the RCNN has brought signifi cant improvements in the ego motion estimation Furthermore the latest use of Generative Adversarial Networks GANs in depth and ego motion estimation has demonstrated that the estimation could be further improved by generating pictures in the game learning process This paper proposes a novel unsupervised network system for visual depth and ego motion estimation Stacked Generative Adversarial Networks SGANVO It consists of a stack of GAN layers of which the lowest layer estimates the depth and ego motion while the higher layers estimate the spatial features It can also capture the temporal dynamic due to the use of a recurrent representation across the layers See Fig 1 for details We select the most commonly used KITTI 1 data set for evaluation The evaluation results show that our proposed method can produce better or comparable results in depth and ego motion estimation I INTRODUCTION Object s depth estimation and camera s ego motion esti mation are two essential tasks in autonomous robotic ap plications Impressive progress has been achieved by using various geometric based methods Recently unsupervised deep learning based methods have demonstrated a certain level of robustness and accuracy in some challenging scenes without the need of labeled training data set Inspired by the image wrapping technique spatial transformer 2 Garg et al 3 proposed an unsupervised Convolutional Neural Network CNN to estimate the depth by using the left right photo metric constraint of stereo image pairs Godard 4 further extended this method by employing a loss function with the left and right images wrapping across each other Because both left and right photo metric losses are penalized this method improved the accuracy of depth estimation Zhou et al 5 proposed two separate networks to infer the depth and ego motion estimation over three temporal consecutive monocular image frames The middle frame performs as the target frame and the previous and the following frames as the source frames Yin 6 proposed the GeoNet which not only estimates the depth and ego motion but also the optical fl ow for dynamic objects Similarly Godard et al 7 proceeded to merge the pose network with the depth network through sharing the network 1Tuo Feng and Dongbing Gu are with the School of Computer Science and Electronic Engineering University of Essex Colchester CO4 3SQ UK tfeng dgu essex ac uk Ft 0 E0D0 R0 d p R t G1 E1D1 R1 V A1 G2 E2D2 R2 A2 A1 A2 Training and Testing Network Structure El Error Unit Gl Generator Al V View Reconstructor Error Image Synthetic Feature Image Al It 1 l It l It r It l It r It l It r TrainingTesting Rl Recurrent Representation d Depth Estimator p Ego motion Estimator Dl Discriminator Ft 0 R0 outputs Feature Image It l Left Image at t Step It r Right Image at t Step It 1 l Left Image at t 1 Step Fake Right Image at t Step It r Fig 1 Our proposed SGANVO architecture for the depth and ego motion estimation It is a series of stacked GANs The bottom layer estimates the depth and ego motion The other layers estimate the spatial features weights Furthermore benefi t from the recent advance of deep learning methods for single image super resolution Pillai et al 8 proposed the sub pixel convolutional layer extension for obtaining depth super resolution Their superior pose network is bootstrapped with much more accurate depth estimation Most of recent work are all based on the encoder decoder neural networks but with different loss functions based on view reconstruction approach Li et al 9 designed their loss functions not only combining the loss functions defi ned in 4 and 5 but also adding the 3D points transformation loss However their results perform not very well in the scenes where dynamic and occluded objects occupy a large part of the fi eld of view due to artifi cially designed rigid transformation used To deal with dynamic and occluded objects in the scenes Ranjan 10 and Wang 11 introduced additional networks to estimate the optical fl ow jointly with the depth and pose networks from videos Ranjan used two networks to learn the optical fl ow and dynamic object masks and jointly compute the fl ow and mask loss functions to refi ne the depth and pose networks Wang added the optical fl ow estimation network to deal with the dynamic objects and refi ne the depth network Due to the use of a PWC network 12 to handle the stereo depth estimation IEEE Robotics and Automation Letters RAL paper presented at the 2019 IEEE RSJ International Conference on Intelligent Robots and Systems IROS Macau China November 4 8 2019 Copyright 2019 IEEE Expand on Temporal Field F2 F1 F0 F2 F1 F0 t 0t 1t 2t 3 F2 F1 F0 F2 F1 F0 F0 Fl 0 Initial State Feature Extraction Layer Ego motion Representation Layer State Transition Right Images Left Images S2 S1 S0 Sl Fig 2 The network is unfolded in time The temporal dynamic is captured in the recurrent representation Wang s Undepthfl ow improved the unsupervised depth es timation by an order of magnitude and its pose estimation also reached the optimal rank Almalioglu 13 introduced a visual odometry GANVO to estimate the depth map using Generative Adversarial Network 14 Its Generator network generates the depth map and the pose regressor infers the ego motion Together they build up the view reconstruction Then its Discriminator network games the original target image and the reconstructed image to refi ne the depth and pose networks This work is similar with 15 16 17 18 which use the Generator to generate the depth map instead of using estimation networks From these work it can be seen that the generative nature in GAN networks is benefi cial to the scenes with dynamic objects However these visual odometry related GANs focus on the depth estimation but not on the ego motion estimation In this paper we propose a novel unsupervised deep visual odometry system with Stacked Generative Adversarial Networks SGANVO see Fig 1 Our main contributions are as follows To the best of our knowledge this is the fi rst time to use the stacked generative and adversarial learning approach for joint ego motion and depth map estimation Our learning system is based on a novel unsupervised GAN scheme without the need for ground truth Our system possesses a recurrent representation which can capture the temporal dynamic features The outline of this paper is organized as follows Section II gives an overview of our proposed SGANVO system Section III describes various loss functions used for the Generator and Discriminator Section IV presents our experimental re sults of depth and ego motion estimation Finally conclusion and future work are drawn in Section V II PROCEDURE FOR PAPER SUBMISSION A Selecting a Template Heading 2 First confi rm that you have the correct template for your paper size This template has been tailored for output on the US letter paper size It may be used for A4 paper size if the paper size setting is suitably modifi ed B Maintaining the Integrity of the Specifi cations The template is used to format your paper and style the text All margins column widths line spaces and text fonts are prescribed please do not alter them You may note peculiarities For example the head margin in this template measures proportionately more than is customary This mea surement and others are deliberate using specifi cations that anticipate your paper as one part of the entire proceedings and not as an independent document Please do not revise any of the current designations III SYSTEM OVERVIEW We address the depth and the ego motion estimation as a whole visual odometry system in Fig 1 The system con sists of an ego motion representation layer for ego motion estimation and multiple feature extraction layers for feature estimation The error image Alfrom error unit El 1in current layer is propagated up to higher layer as the input The hidden state in recurrent representation layer Rlis propagated down to lower layer for capturing the temporal dynamic There are mainly two units Generator Gl and Discriminator Dl in each layer Fig 2 shows the network unfolded in time step After the initial states Slare set to the network the current left and right frames are fed to the network s input The network runs for a sequence of N consecutive frames After N steps the initial states are set to the network again A Generator In the bottom layer the Generator consists of depth estimator d ego motion estimator p and view reconstructor V It takes the hidden state from recurrent representation R0 as its input and generates an estimated image from view reconstructor V for current input image In the higher layer l 0 the Generator is a convolutional layer which takes the hidden state from recurrent representation Rlas its input and generates an estimated error image Alfor the input Alfrom lower layer The recurrent representation Rlis a convolutional Long Short Term Memory network ConvLSTM 19 and connected top down between layers The hidden state Rltis updated according to Rl t 1 R l 1 t El t 1 It is used to store the temporal dynamic from consecutive video which is defi ned as Rlt ConvLSTM El t 1 Rlt 1 UpSample Rl 1t 1 The hidden state Rltis updated through two passes a top down pass from the UpSample of Rl 1 and then a level pass calculated by previous hidden state Rl t 1 and previous error El t 1 Fig 3 Illustrated above are qualitative comparisons of our SGANVO image size 416 128 with Monodepth 4 image size 512 256 The depth map shows that our approach produces qualitatively better depth estimates with crisp boundaries The depth estimator is an encoder decoder network to generate the multi scale dense depth map like the one used in 4 Inspired by 8 we replace the UpSample branches in the depth decoder with the sub pixel convolutional branches used in 20 to generate the up scaled features Sub pixel convolutional branches consist of a sequence of three consec utive 2D convolutional layers with 64 32 4 output channels with 1 pixel stride The depth estimator directly generates the dense depth map by using a pair of stereo images separately concatenating the features from unit R0 t along with the image channel to train the network The ego motion estimator is a VGG based CNN architec ture It takes current image and the hidden state representing the information in previous frame as the input and generates the 6 DoF ego motion estimation between current frame and previous frame We decouple the translation and the rotation with two separate groups of fully connected layers after the last convolutional layer for better performance We feed the input data concatenating two continuous images and the features from unit R0 t along with image channel to train the ego motion estimator The view reconstructor V takes the depth estimate ego motion estimate and previous frame It 1as the input and generates a predicted image Itfor the current frame It By using the spatial transformer network 2 and the homoge neous coordinate of the pixel It 1 i j in the t 1 th frame we can derive its corresponding pixel It i j in the tth frame through It i j K Tt 1 td 1 t 1 i j K 1It 1 i j 2 where K is the camera intrinsic matrix dt i j is the es timated disparity Tt 1 tis the camera coordinate transfor TABLE I Discriminator Network Architecture Block TypeKernel SizeStridesFiltersInput Conv15 5216Image SELU1nonenonenoneConv1 Conv25 5232SELU1 SELU2nonenonenoneConv2 Conv35 5264SELU2 SELU3nonenonenoneConv3 Conv45 52128SELU3 SELU4nonenonenoneConv4 Conv54 411SELU4 mation matrix from the t 1 th frame to the tth frame generated by the ego motion estimator Based on this It 1and Itcan be constructed from Itand It 1 respectively The error units in the bottom layer or higher layers are computed as below E0 t ReLU It It ReLU It It 3 El t ReLU Alt Alt ReLU Alt Alt 4 where E0 t is the error image between the generated from the view reconstructor and the current frame at t time El t is the error image at the higher layer l 0 ReLU is an activation function operation The input error image Alt is defi ned as Alt Maxpool ReLU Conv El 1 t 5 where Conv denotes the convolutional operation The Gener ator in the higher layer l 0 consists of one convolutional unit with 3 dimension fi lters to generate error feature image Alt Alt ReLU Conv Rlt 6 B Discriminator The Discriminator is a convolutional network that can categorize the images fed to it In the bottom layer we feed the generated image Itand original image Itinto it In the higher layer l 0 the inputs to the Discriminator are the generated error image Altand the error image Altfrom the lower layer We used WGAN in our architecture which employs the Wasserstein distance instead of Kullback Leibler KL divergence or Jensen Shannon JS divergence for stabilizing the training process using gradient descents The Discriminator architecture is shown in Table 1 Considering the GPU s memory space we set all the Discriminators fi ve simple convolutional layers connected by four SELU block units as provided in 21 IV TRAINING PROCEDURE The SGANVO training follows the improved WGAN training procedure 26 We jointly compute the losses of Generators and Discriminators in all the layers and take the weighted sum as the fi nal G loss and D loss We TABLE II Depth estimation results on KITTI dataset using the split of Eigen et al 22 For training data K KITTI CS Cityscapes 23 M Monocular and S Stereo For testing data our SGANVO uses the monocular dataset like others work and our view reconstruction component uses this left image sequence to generate the right view image sequence like 24 MethodDatasizeDatasetTrainAbs RelSq RelRMSERMSE log 1 25 1 252 1 253 Garg et al 3 620 188KM0 1691 0805 1040 2730 7400 9040 962 SfMLearner 5 416 128KM0 2081 7686 8560 2830 6780 8850 957 SfMLearner 5 416 128CS KM0 1981 8366 5650 2750 7180 9010 960 GeoNet 6 416 128KM0 1551 2965 8570 2330 7930 9310 973 GeoNet 6 416 128CS KM0 1531 3285 7370 2320 8020 9340 972 Vid2Depth 25 416 128KM0 1631 2406 2200 2500 7620 9160 968 Vid2Depth 25 416 128CS KM0 1591 2315 9120 2430 7840 9230 970 GANVO 13 416 128KM0 1501 4145 4480 2160 8080 9390 975 UnDeepVO 9 416 128KS0 1831 7306 5700 268 Godard et al 4 640 192KS0 1291 1125 1800 2050 8510 9520 978 SGANVO416 128KS0 0650 6734 0030 1360 9440 9790 991 feed the stereo data sequence of KITTI dataset into the depth estimator and the monocular data sequences of KITTI dataset into the ego motion estimator The loss functions are defi ned as below A Generator Loss Our G losses mainly include three parts The fi rst one is from the Discriminator and defi ned as LD g l lE D xl 7 where lis the weight at lth layer and D xl is the output from the Discriminator when its input is from generated image xlin its layer E is the mean operation The second one is the weighted sum of all the error images within a sequence of N consecutive frames LN g N t 1 t l l nl nl kEl tk1 8 where tis the temporal weight factor lis the layer weight factor nlis the total number of ReLU units of the lth layer and k k1is the L1norm The last one is the disparity consistency loss which is used to improve the depth smoothness Denote dl t and dr t the left and right disparity maps respectively The disparity consistency loss is defi ned as Ld g N t 1 i j kdl t i j drt i j k1 9 The fi nal G loss is the weighted sum of above three parts Lfinal g LD g LNg Ldg 10 where and are the weight parameters 0 100 200 300 400 500 200 100 0 100 200 300 400 z m x m Ground Truth SGANVO Sequence Start 300 200 100 0 100 200 300 400 0 100 200 300 400 500 600 700 z m x m Ground Truth SGANVO Sequence Start Fig 4 Our proposed SGANVO system to estimate the ego motion on the KITTI odometry benchmark Sequence 09 left and Sequence 10 right Our results are rendered in blue while the ground truth is rendered in red B Discriminator Loss The D loss is defi ned as LD d l largmax D Dld 11 where Dldis the Discriminator loss in the lth layer when its inputs are real image x and generated image x Afterwards the Dld is defi ned as Dld E D xl E D xl DE OD x0l 2 1 2 12 where E is the mean operation and O is the solving gradient operation The innovation of WGAN GP is on the last item of the above loss function where the x0is the random interpolation samples between real value and generated value defi ned as x0 x x 1 13 In our implementation we choose D 10 and is a random number from uniform distribution between 0 and 1 TABLE III Ego motion estimation results on KITTI dataset with our proposed SGANVO We also compare with the four learning methods Monocular ESP VO 27 SfMlearner 5 Stereo UndeepVO 9 Undepthfl ow 11 and one geometric based methods ORB SLAM Our SGANVO SfMLearner and UndeepVO use images with 416 128 ESP VO and ORB SLAM use 1242 376 images Undepthfl ow uses 832 256 images The best results among the learning methods are made in bold Seq MonocularStereo SGANVOESP VO 27 SfMLearner 5 ORB SLAM 28 UndeepVO 9 Undepthfl ow 11 416 128 1242 376 416 128 1242 376 416 128 832 256 trel rrel trel rrel trel rrel trel rrel trel rrel trel rrel 0310 566 306 726 4613 083 790 670 185 006 17 042 400 776 336 0810 685 130 650 184 492 13 053 251 313 354 9316 764 063 280 463 401 50 063 991 467 247 2923 534 806 140 176 201 98 074 671 833 525 0217 525 381 230 223 152 48 09 4 952 37 18 773 2115 300 267 013 6113 985 36 10 5 893 569 7710 214 363 983 680 4810 634 6519 679 13 trel average translational RMSE drift on length of 100m 800m rrel average rotational RMSE drift 100m on length of 100m 800m train sequence 03 04 05 06 07 test sequence

人人文库> 全部分类> 教育资料 > 课件下载

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

IROS2019国际学术会议论文集 2427

文档简介

温馨提示

最新文档

评论

IROS2019国际学术会议论文集 2427

文档简介

温馨提示

最新文档

评论

相关文档