IROS2019国际学术会议论文集 0378

上传人：我*** IP属地：北京上传时间：2020-04-10 格式：PDF 页数：7 大小：560.47KB 积分：12 举报 版权申诉

已阅读5页，还剩2页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

RGB to TSDF Direct TSDF Prediction from a Single RGB Image for Dense 3D Reconstruction Hanjun Kim Jiyoun Moon and Beomhee Lee Abstract In this paper we present a novel method to predict 3D TSDF voxels from a single image for dense 3D reconstruction 3D reconstruction with RGB images has two inherent problems scale ambiguity and sparse reconstruction With the advent of deep learning depth prediction from a single RGB image has addressed these problems However as the predicted depth is typically noisy de noising methods such as TSDF fusion should be adapted for the accurate scene reconstruction To integrate the two step processing of depth prediction and TSDF generation we design an RGB to TSDF network to directly predict 3D TSDF voxels from a single RGB image The TSDF using our network can be generated more effi ciently in terms of time and accuracy than the TSDF converted from depth prediction We also use the predicted TSDF for a more accurate and robust camera pose estimation to complete scene reconstruction The global TSDF is updated from TSDF prediction and pose estimation and thus dense isosurface can be extracted In the experiments we evaluate our TSDF prediction and camera pose estimation results against the conventional method I INTRODUCTION 3D reconstruction is a vital element for understanding the surroundings in many applications such as robotics autonomous navigation and augmented reality AR This can be achieved by various depth cameras such as 3D LiDAR and Kinect However the sensors do not normally work in every environment For example 3D LiDAR which is an example of pulsed time of fl ight ToF camera is mostly used in autonomous navigation but is cost prohibitive and hardly portable Another type of ToF depth camera which emits modulated continuous wave e g Kinect v2 have a poor performance in an outdoor environment due to the interference of sunlight Structured light camera e g Kinect v1 is also diffi cult to use it in an outdoor environment since it is sunlight sensitive and has a short ranging distance On the other hand RGB camera is small portable and low cost therefore it is commonly used in many engi neering applications However it is still a fi eld of interest because of an ill posed problem such as light sensitivity and scale ambiguity Structure from motion SfM is one of the methods that estimate the camera pose between images and reconstruct sparse 3D feature points using correspondences This work was supported by the Bio Mimetic Robot Research Center funded by Defense Acquisition Program Administration and by Agency for Defense Development UD160027ID Hanjun Kim Jiyoun Moon and Beomhee Lee are with the Automa tion and Systems Research Institute ASRI Department of Electrical and Computer Engineering Seoul National University 1 Gwanak ro Gwanak gu Seoul Republic of Korea k3k5good jiyounmoon bhlee snu ac kr a b c d Fig 1 TSDF prediction using RGB to TSDF network Our network takes a single RGB image and predicts TSDF a a single RGB image b a depth image c ground truth TSDF from the depth image d our TSDF prediction result colors are painted along the z axis 1 Dense reconstruction can be performed with a multi view stereo MVS method 2 after SfM but MVS is im plemented offl ine In a similar manner visual odometry using a monocular camera estimates the camera pose using feature matching or photometric error minimization and reconstructs 3D scene in real time However 3D reconstruction with visual odometry has problems such as sparsity and scale ambiguity thus it is diffi cult to apply it to the fi elds like AR that requires scale invariant dense reconstruction To resolve the problems of visual odometry depth pre diction from a single RGB image has been proposed in 3 4 5 Since this method generates a dense depth image with a scale it is achievable to perform scale invariant dense reconstruction However contrary to seeking the modes of a probability in the occupancy grid using depth images the object surface can be extracted through zero crossing using truncated signed distance fi eld TSDF Moreover TSDF has a de noising effect on the multiple noisy measurements by a weighted average of all individual TSDFs 6 With these advantages of TSDF we propose a direct TSDF prediction method from a single RGB image as shown in Fig 1 Since we predict TSDF from an RGB image dense 3D reconstruction can be performed with only commercial monocular camera unlike 6 The proposed method is also more effi cient in terms of computation task and prediction 2019 IEEE RSJ International Conference on Intelligent Robots and Systems IROS Macau China November 4 8 2019 978 1 7281 4003 2 19 31 00 2019 IEEE6714 accuracy than TSDF computation after depth prediction In our work TSDF is predicted through our RGB to TSDF network consisting of 2D encoder and 3D decoder with fully convolutional layers The predicted TSDF is utilized in aligning two sequential frames by jointly optimizing the objective function for a camera pose estimation The aligned TSDF is then fused into the global TSDF Finally the global TSDF voxels are obtained using a weighted moving average and the dense isosurface can then be extracted through a zero crossing method To verify the prediction performance of TSDF we com pared our method with depth prediction method against NYU Depth v2 dataset 7 We also compared our method with depth image based iterative closest point ICP algo rithm to verify the performance of camera pose estimation against Microsoft 7 scenes dataset 8 II RELATED WORKS A 3D Geometry Prediction from a Single Image using a Deep Learning Recently estimating 3D geometric structure from a single image has greatly improved the performance with the ad vances in the fi eld of deep learning 9 suggests a 3D VAE GAN network to reconstruct the complete shape of a 3D object from a single image 10 also proposes a method to reconstruct a 3D object from a single RGB image by learning the T network However since these methods are trained only from a single image for the object it is diffi cult to apply them to sequential input images or 3D scene reconstruction Depth prediction from a single RGB image is one of the ways to reconstruct 3D scene 11 12 3 As this method can predict a dense depth image from a single RGB image it can be applied to various fi elds such as RGB D SLAM 3 or visual odometry 4 5 On the other hand we propose a method that predicts 3D TSDF voxels instead The TSDF enables isosurfaces to be readily extractable and reduces the noise of measurements by the accumulation of TSDF values Since we predict TSDF voxels directly unlike depth prediction TSDF information can also be used in camera pose estimation B Visual Odometry and 3D Reconstruction using RGB Images As an active research area recently visual odometry or 3D reconstruction with RGB images has been developed in various ways such as a direct method using photometric error 13 and an indirect method using feature points 14 Since the indirect method is based on the feature descriptor it is robust to brightness changes lens attenuation and rolling shutter effect but it is limited in sparsely texture environment 13 On the other hand the direct method can reconstruct 3D scene more densely than the indirect method However it is still diffi cult to implement a dense reconstruction with no textures and also depends on the camera parameter and exposure changes 14 Moreover it is impossible for both methods to estimate the absolute scale without supplementary methods 4 To overcome these problems a method using depth pre diction by deep learning has been proposed in 11 12 3 Motivated by these applications using depth prediction we propose a method that directly predicts TSDF from a single RGB image to maximize the advantages of TSDF in dense reconstruction III METHODOLOGY A Method Overview Our purpose is to reconstruct dense 3D scenes using sequential RGB images To reconstruct dense 3D scenes we fi rst predict 3D TSDF voxels from a single RGB image using our RGB to TSDF network For a training target TSDF data is generated from the fi lled in ground truth depth images and compared with TSDF prediction results according to the loss After training our network a camera pose transformation between two TSDFs is estimated sequentially by minimizing the objective function with the use of TSDF value The aligned current TSDF is then fused into the global TSDF by a weighted running average similar to 15 6 Since discrete TSDF voxels are fused unlike 15 6 the weights are distributed to neighbouring voxel centers by a trilinear interpolation Finally the isosurface can be extracted via marching cube algorithm The details of our method are explained in the following sections B Training Data Generation To train the RGB to TSDF network 3D TSDF voxels are generated from the ground truth depth which is paired with the RGB image at the same view point The true singed distance fi eld SDF stored in a each voxel is defi ned by the Euclidean distance from the nearest surface However since the implementation to fi nd the nearest surface from each voxel center requires high computation time we adopt projective SDF introduced in 15 Projective SDF is defi ned by the distance from the nearest surface along each ray of camera The sign of SDF is negative when the voxel is located behind the surface occlusion and vice versa Since projective SDF is inherently view dependent due to the different perspective view from each camera pose the 1 1 0 0 proj a 0 flip 1 1 occlusionfreetruncation c b occlusionfreetruncation 1 1 Fig 2 Two TSDF encoding methods a Example of TSDF encodings top projective encoding bottom fl ipped encoding b projective TSDF function c fl ipped TSDF function 6715 Darknet 19 2D Encoder3D DecoderRGB InputTSDF Output 256x80 x128 160 x256 HxW DxHxW Fig 3 RGB to TSDF network structure Taking a single RGB image as the input the network predicts 3D TSDF voxels in the camera FOV distance is truncated and is normalized to diminish the view dependency like 15 Projective TSDF is obtained from a depth image using a pinhole camera model For a given depth image the perspec tive projection function and the back projection function are defi ned by p x z fx cx y z fy cy T 1 1 x D x D x u c x fx v cy fy 1 T 2 where fxand fyare focal lengths and cx cy is principal point in camera intrinsic matrix K D x is a depth value with a pixel x u v T and p x y z Tis a 3D point in the camera coordinate Using 1 projective TSDF proj v with a voxel center v vx vy vz is represented as follows proj v v if v sgn v else 3 v D x vz q 1 vx vz 2 vy vz 2 4 x v 5 where is a truncation threshold v is a SDF of v and is the nearest neighbour look up in the pixel coordinate If an absolute value of v is greater than proj v is determined according to the sign function sgn of v as 1 or 1 in 3 On the other hand another TSDF encoding method is suggested by Song et al in 16 It is the fl ipped TSDF encoding flip v is defi ned as follows by replacing 3 flip v sgn v 1 v if v c 7 where y is a target value y is a prediction value and c is a batch dependent parameter with c 1 5 max yi yi that is 20 of the maximum absolute error in the current batch Huber loss is equal to L2when the error is small than c and equal to L1when the error is larger than c In our experiments we train the network using three loss functions L1 L2 and H for TSDF regression To deal with sparsity of TSDF data and focus on the voxels within the truncation interval we balance the data distribution between inside the truncation interval and outside the truncation interval free and occlusion regions We sample training data in the free and occlusion regions with a Bernoulli probability p nin nout with an online manner where is the sampling ratio ninis the number of voxels inside the truncation interval and noutis the number of voxels outside the truncation interval Consequently the voxels in the free and occlusion regions are sampled around the number of expectation ninin each TSDF The is set to 2 in our case as a result of parameter tuning The loss is defi ned by all the voxels inside the interval and sampled voxels outside the interval and voxels over FOV are ignored The total loss is expressed as follows Ltotal y y Lin y y Lout y y 8 where Linis the loss for inside the truncation interval and Loutis the loss for outside the truncation interval The superscript of a TSDF value represents the mask whether the voxel is inside or outside the truncation interval The mask is determined by the ground truth TSDF D TSDF Supported ICP and 3D Reconstruction 1 TSDF Supported Optimization To reconstruct a 3D scene with predicted TSDFs we need to fi nd the best alignment between consecutive TSDFs target TSDF and source TSDF inside the truncation interval Points set P of source TSDF and points set Q of target TSDF is composed of voxel centers with p P and q Q respectively P can be aligned to Q by minimizing an objective function iteratively Since TSDF has not only geometric information but also TSDF information in the voxel center we can utilize two pieces of information to defi ne the objective function similar to 22 The joint objective function using both geometric infor mation and TSDF information is defi ned by E T 1 Etsdf T Egeo T 9 where T SE3is a camera pose transformation for an alignment between two TSDFs and 0 1 is a weight parameter balancing the two objective functions Given all pairs of corresponding points set p q C two objective functions are each defi ned as follows Egeo T X p q C q Tp nq 2 10 Etsdf T X p q C q fq Tp p 2 11 where qis a continuous TSDF along the tangent plane at q with a normal nq and fqis the function that orthogonally projects transformed source point Tp to the tangent plane at q In each iteration the correspondence pairs in C are recreated by searching for the nearest target point q of the transformed source point T 0p with the current transformation T 0 10 is equivalent to the objective of point to plane ICP algorithm 23 and 11 is a modifi ed version of photometric objective introduced in 24 The joint objective function 9 is minimized in each optimization iteration by Gauss Newton method The estimated transformation T is then obtained and two sequential TSDF voxels are aligned into the same coordinate accordingly Since TSDF provides the additional objective term to match TSDFs it can lead to a more accurate estimation than 23 or 25 2 TSDF Fusion via Trilinear Interpolation Once the current source TSDF is aligned to the target TSDF the source TSDF can be represented with the global coordinate frame using the accumulated transformation Therefore the new TSDF is fused into the global TSDF to store weighted running average of TSDFs similar to 6 However instead of fusing the new TSDF value at the voxel center like 6 we distribute TSDF weights to the 3D spatial voxels because the aligned points are not located in the voxel center For each point pjin the new TSDF trilinear interpolation distributes the weight to 8 neighboring voxels vi Npjrespectively Consequently for each voxel viin the global TSDF dis tributed TSDF values and weights for all neighbouring points are summed for the current time t as follows t vi X j t pj 1 dj i x 1 d j i y 1 d j i z 12 wt vi X j 1 dj i x 1 d j i y 1 d j i z 13 In 12 and 13 TSDF value and weight of each voxel vifor the point pj are defi ned by the normalized distance dj i x d j i y and dj i z between viand pjalong the each x y and z axis After summing distributions for the time t TSDF values and weights of each voxel are fi nally obtained by a weighted 6717 running average as follows t v Wt 1 v t 1 v wt v t v Wt 1 v wt v 14 Wt v Wt 1 v wt 1 v 15 where v and W v are TSDF value and weight at the voxel center v in the global TSDF After the TSDF fusion 3D object surface can be reconstructed in mesh or point clouds form using a zero crossing where the sign of TSDF value is changed The zero crossing is mostly implemented via marching cube 26 or raycasting 27 algorithm and we utilize the marching cube algorithm For a parallel computation trilinear interpolation and summation for each voxel are accelerated on the GPU IV EXPERIMENTAL RESULTS We train our network on a single NVIDIA Geforce GTX 1080TI with 11GB of GPU memory We use a batch size of 32 for a training and use Adam optimizer 28 with momentum parameters 1 0 9 2 0 999 The learning rate is 10 4for 50 epochs and is reduced to 10 5for 25 epochs We use NYU Depth v2 dataset 7 to train and test our network NYU Depth v2 dataset consists of 464 scenes captured with Microsoft Kinect RGB D camera We use the offi cial train test split consisting in 249 training scenes and 215 test scenes in NYU Depth v2 dataset For a training a total of 48K RGB and depth image pairs are used from the training scenes and augmented on the fl y like 5 To evaluate TSDF prediction result against the ground truth we use offi cial 654 images from the test scenes Depth image is rectifi ed to RGB image and fi lled with cross bilateral fi lter Also Microsoft 7 scenes dataset 8 is used to evaluate the performance of TSDF supported ICP and 3D reconstruction as an unseen stream of RGB images A 3D TSDF Prediction Evaluation In this section we evaluate our RGB to TSDF network architecture and compare TSDF prediction result with depth prediction method To measure TSDF prediction accuracy we adopt four metrics squared relative difference Sqrel 1 n Pn i y i y i 2 y i root mean squared error RMSE q 1 n Pn i y i y i 2 absolute relative difference Absrel 1 n Pn i y i y i y i and mean absolute error MAE 1 n Pn i y i y i Errors are measured only inside the trunca tion interval of target TSDF The target TSDFs are computed TABLE II COMPARISON OFTSDFPREDICTION RESULTS EncodeLossSqrelRMSEAbsrelMAERun time proj L26 4180 6598 9400 526 0 023s proj L17 3620 6969 5470 556 proj Huber6 5770 6689 1480 531 fl ip L20 6470 5841 3630 477 3 TSDF proj 10 0600 75611 5140 608 0 035s 3 TSDF fl ip 1 4120 7011 9830 555 a b c d e Fig

人人文库> 全部分类> 教育资料 > 课件下载

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

IROS2019国际学术会议论文集 0378

文档简介

温馨提示

最新文档

评论

IROS2019国际学术会议论文集 0378

文档简介

温馨提示

最新文档

评论

相关文档