IROS2019国际学术会议论文集0378

上传人：我*** IP属地：北京上传时间：2020-06-05 格式：PDF 页数：7 大小：605.96KB 积分：12 举报 版权申诉

已阅读5页，还剩2页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

RGB-to-TSDF: Direct TSDF Prediction from a Single RGB Image for Dense 3D Reconstruction Hanjun Kim, Jiyoun Moon, and Beomhee Lee AbstractIn this paper, we present a novel method to predict 3D TSDF voxels from a single image for dense 3D reconstruction. 3D reconstruction with RGB images has two inherent problems: scale ambiguity and sparse reconstruction. With the advent of deep learning, depth prediction from a single RGB image has addressed these problems. However, as the predicted depth is typically noisy, de-noising methods such as TSDF fusion should be adapted for the accurate scene reconstruction. To integrate the two-step processing of depth prediction and TSDF generation, we design an RGB-to-TSDF network to directly predict 3D TSDF voxels from a single RGB image. The TSDF using our network can be generated more effi ciently in terms of time and accuracy than the TSDF converted from depth prediction. We also use the predicted TSDF for a more accurate and robust camera pose estimation to complete scene reconstruction. The global TSDF is updated from TSDF prediction and pose estimation, and thus dense isosurface can be extracted. In the experiments, we evaluate our TSDF prediction and camera pose estimation results against the conventional method. I. INTRODUCTION 3D reconstruction is a vital element for understanding the surroundings in many applications, such as robotics, autonomous navigation, and augmented reality (AR). This can be achieved by various depth cameras, such as 3D LiDAR and Kinect. However, the sensors do not normally work in every environment. For example, 3D LiDAR, which is an example of pulsed time of fl ight (ToF) camera, is mostly used in autonomous navigation, but is cost-prohibitive and hardly portable. Another type of ToF depth camera, which emits modulated continuous wave (e.g., Kinect v2), have a poor performance in an outdoor environment due to the interference of sunlight. Structured light camera (e.g., Kinect v1) is also diffi cult to use it in an outdoor environment since it is sunlight-sensitive and has a short ranging distance. On the other hand, RGB camera is small, portable, and low-cost, therefore, it is commonly used in many engi- neering applications. However, it is still a fi eld of interest because of an ill-posed problem, such as light-sensitivity and scale ambiguity. Structure-from-motion (SfM) is one of the methods that estimate the camera pose between images and reconstruct sparse 3D feature points using correspondences *This work was supported by the Bio-Mimetic Robot Research Center funded by Defense Acquisition Program Administration and by Agency for Defense Development (UD160027ID). Hanjun Kim, Jiyoun Moon, and Beomhee Lee are with the Automa- tion and Systems Research Institute (ASRI), Department of Electrical and Computer Engineering, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul, Republic of Koreak3k5good, jiyounmoon, bhleesnu.ac.kr (a)(b) (c)(d) Fig. 1.TSDF prediction using RGB-to-TSDF network. Our network takes a single RGB image and predicts TSDF. (a) a single RGB image (b) a depth image (c) ground truth TSDF from the depth image (d) our TSDF prediction result. (colors are painted along the z axis) 1. Dense reconstruction can be performed with a multi- view-stereo (MVS) method 2 after SfM, but MVS is im- plemented offl ine. In a similar manner, visual odometry using a monocular camera estimates the camera pose using feature matching or photometric error minimization, and reconstructs 3D scene in real-time. However, 3D reconstruction with visual odometry has problems, such as sparsity and scale- ambiguity, thus it is diffi cult to apply it to the fi elds like AR that requires scale-invariant dense reconstruction. To resolve the problems of visual odometry, depth pre- diction from a single RGB image has been proposed in 3, 4, 5. Since this method generates a dense depth image with a scale, it is achievable to perform scale-invariant dense reconstruction. However, contrary to seeking the modes of a probability in the occupancy grid using depth images, the object surface can be extracted through zero-crossing using “truncated signed distance fi eld” (TSDF). Moreover, TSDF has a de-noising effect on the multiple noisy measurements by a weighted average of all individual TSDFs 6. With these advantages of TSDF, we propose a direct TSDF prediction method from a single RGB image as shown in Fig. 1. Since we predict TSDF from an RGB image, dense 3D reconstruction can be performed with only commercial monocular camera unlike 6. The proposed method is also more effi cient in terms of computation task and prediction 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Macau, China, November 4-8, 2019 978-1-7281-4003-2/19/$31.00 2019 IEEE6714 accuracy than TSDF computation after depth prediction. In our work, TSDF is predicted through our RGB-to-TSDF network consisting of 2D encoder and 3D decoder with fully convolutional layers. The predicted TSDF is utilized in aligning two sequential frames by jointly optimizing the objective function for a camera pose estimation. The aligned TSDF is then fused into the global TSDF. Finally, the global TSDF voxels are obtained using a weighted moving average, and the dense isosurface can then be extracted through a zero-crossing method. To verify the prediction performance of TSDF, we com- pared our method with depth prediction method against NYU-Depth-v2 dataset 7. We also compared our method with depth image based iterative closest point (ICP) algo- rithm to verify the performance of camera pose estimation against Microsoft 7-scenes dataset 8. II. RELATED WORKS A. 3D Geometry Prediction from a Single Image using a Deep Learning Recently, estimating 3D geometric structure from a single image has greatly improved the performance with the ad- vances in the fi eld of deep learning. 9 suggests a 3D-VAE- GAN network to reconstruct the complete shape of a 3D object from a single image. 10 also proposes a method to reconstruct a 3D object from a single RGB image by learning the T-network. However, since these methods are trained only from a single image for the object, it is diffi cult to apply them to sequential input images or 3D scene reconstruction. Depth prediction from a single RGB image is one of the ways to reconstruct 3D scene 11, 12, 3. As this method can predict a dense depth image from a single RGB image, it can be applied to various fi elds such as RGB-D SLAM 3 or visual odometry 4, 5. On the other hand, we propose a method that predicts 3D TSDF voxels instead. The TSDF enables isosurfaces to be readily extractable and reduces the noise of measurements by the accumulation of TSDF values. Since we predict TSDF voxels directly unlike depth prediction, TSDF information can also be used in camera pose estimation. B. Visual Odometry and 3D Reconstruction using RGB Images As an active research area recently, visual odometry or 3D reconstruction with RGB images has been developed in various ways such as a direct method using photometric error 13 and an indirect method using feature points 14. Since the indirect method is based on the feature descriptor, it is robust to brightness changes, lens attenuation, and rolling shutter effect, but it is limited in sparsely texture environment 13. On the other hand, the direct method can reconstruct 3D scene more densely than the indirect method. However, it is still diffi cult to implement a dense reconstruction with no textures and also depends on the camera parameter and exposure changes 14. Moreover, it is impossible for both methods to estimate the absolute scale without supplementary methods 4. To overcome these problems, a method using depth pre- diction by deep learning has been proposed in 11, 12, 3. Motivated by these applications using depth prediction, we propose a method that directly predicts TSDF from a single RGB image to maximize the advantages of TSDF in dense reconstruction. III. METHODOLOGY A. Method Overview Our purpose is to reconstruct dense 3D scenes using sequential RGB images. To reconstruct dense 3D scenes, we fi rst predict 3D TSDF voxels from a single RGB image using our RGB-to-TSDF network. For a training, target TSDF data is generated from the fi lled-in ground truth depth images, and compared with TSDF prediction results according to the loss. After training our network, a camera pose transformation between two TSDFs is estimated sequentially by minimizing the objective function with the use of TSDF value. The aligned current TSDF is then fused into the global TSDF by a weighted running average similar to 15, 6. Since discrete TSDF voxels are fused unlike 15, 6, the weights are distributed to neighbouring voxel centers by a trilinear interpolation. Finally, the isosurface can be extracted via marching cube algorithm. The details of our method are explained in the following sections. B. Training Data Generation To train the RGB-to-TSDF network, 3D TSDF voxels are generated from the ground truth depth which is paired with the RGB image at the same view point. The true singed distance fi eld (SDF) stored in a each voxel is defi ned by the Euclidean distance from the nearest surface. However, since the implementation to fi nd the nearest surface from each voxel center requires high computation time, we adopt projective SDF introduced in 15. Projective SDF is defi ned by the distance from the nearest surface along each ray of camera. The sign of SDF is negative when the voxel is located behind the surface (occlusion) and vice versa. Since projective SDF is inherently view-dependent due to the different perspective view from each camera pose, the 1 -1 0 0 proj (a) 0 flip 1 -1 occlusionfreetruncation (c) (b) occlusionfreetruncation 1 -1 Fig. 2.Two TSDF encoding methods. (a) Example of TSDF encodings (top: projective encoding, bottom: fl ipped encoding) (b) projective TSDF function (c) fl ipped TSDF function. 6715 Darknet-19 2D Encoder3D DecoderRGB InputTSDF Output 256x80 x128160 x256 (HxW)(DxHxW) Fig. 3.RGB-to-TSDF network structure. Taking a single RGB image as the input, the network predicts 3D TSDF voxels in the camera FOV. distance is truncated and is normalized to diminish the view dependency like 15. Projective TSDF is obtained from a depth image using a pinhole camera model. For a given depth image, the perspec- tive projection function and the back-projection function are defi ned by (p) = ?x z fx+ cx, y z fy+ cy ?T ,(1) 1(x,D(x) = D(x) ?u c x fx , v cy fy ,1 ?T ,(2) where fxand fyare focal lengths and (cx,cy) is principal point in camera intrinsic matrix K. D(x) is a depth value with a pixel x := (u,v)T, and p := (x,y,z)Tis a 3D point in the camera coordinate. Using (1), projective TSDF proj(v) with a voxel center v := (vx,vy,vz) is represented as follows: proj(v) = ( (v)/if |(v)| sgn(v)else ,(3) (v) = (D(x) vz) q 1 + (vx/vz)2+ (vy/vz)2,(4) x = (v),(5) where is a truncation threshold, (v) is a SDF of v, and . is the nearest neighbour look up in the pixel coordinate. If an absolute value of (v) is greater than , proj(v) is determined according to the sign function (sgn(.) of (v) as 1 or -1 in (3). On the other hand, another TSDF encoding method is suggested by Song et al. in 16: It is the fl ipped TSDF encoding. flip (v) is defi ned as follows by replacing (3): flip(v) = ( sgn(v)(1 |(v)|/)if |(v)| c ,(7) where y is a target value, y is a prediction value, and c is a batch-dependent parameter with c = 1 5 max(|yi yi|), that is 20% of the maximum absolute error in the current batch. Huber loss is equal to L2when the error is small than c and equal to L1when the error is larger than c. In our experiments, we train the network using three loss functions (L1, L2, and H) for TSDF regression. To deal with sparsity of TSDF data and focus on the voxels within the truncation interval, we balance the data distribution between inside the truncation interval and outside the truncation interval (free and occlusion regions). We sample training data in the free and occlusion regions with a Bernoulli probability p = nin nout with an online manner, where is the sampling ratio. ninis the number of voxels inside the truncation interval and noutis the number of voxels outside the truncation interval. Consequently, the voxels in the free and occlusion regions are sampled around the number of expectation ninin each TSDF. The is set to 2 in our case as a result of parameter tuning. The loss is defi ned by all the voxels inside the interval and sampled voxels outside the interval, and voxels over FOV are ignored. The total loss is expressed as follows: Ltotal(y, y) = Lin(y, y) + Lout(y+, y+),(8) where Linis the loss for inside the truncation interval and Loutis the loss for outside the truncation interval. The superscript of a TSDF value represents the mask whether the voxel is inside or outside the truncation interval. The mask is determined by the ground truth TSDF. D. TSDF Supported ICP and 3D Reconstruction 1) TSDF Supported Optimization: To reconstruct a 3D scene with predicted TSDFs, we need to fi nd the best alignment between consecutive TSDFs: target TSDF and source TSDF inside the truncation interval. Points set P of source TSDF and points set Q of target TSDF is composed of voxel centers with p P and q Q respectively. P can be aligned to Q by minimizing an objective function iteratively. Since TSDF has not only geometric information but also TSDF information in the voxel center, we can utilize two pieces of information to defi ne the objective function similar to 22. The joint objective function using both geometric infor- mation and TSDF information is defi ned by E(T) = (1 )Etsdf(T) + Egeo(T),(9) where T SE3is a camera pose transformation for an alignment between two TSDFs and 0,1 is a weight parameter balancing the two objective functions. Given all pairs of corresponding points set (p,q) C, two objective functions are each defi ned as follows: Egeo(T) = X (p,q)C (q Tp) nq)2,(10) Etsdf(T) = X (p,q)C (q(fq(Tp) (p)2,(11) where qis a continuous TSDF along the tangent plane at q with a normal nq, and fqis the function that orthogonally projects transformed source point Tp to the tangent plane at q. In each iteration, the correspondence pairs in C are recreated by searching for the nearest target point q of the transformed source point T 0p with the current transformation T 0. (10) is equivalent to the objective of point-to-plane ICP algorithm 23 and (11) is a modifi ed version of photometric objective introduced in 24. The joint objective function (9) is minimized in each optimization iteration by Gauss-Newton method. The estimated transformation T is then obtained, and two sequential TSDF voxels are aligned into the same coordinate accordingly. Since TSDF provides the additional objective term to match TSDFs, it can lead to a more accurate estimation than 23 or 25. 2) TSDF Fusion via Trilinear Interpolation: Once the current source TSDF is aligned to the target TSDF, the source TSDF can be represented with the global coordinate frame using the accumulated transformation. Therefore, the new TSDF is fused into the global TSDF to store weighted running average of TSDFs similar to 6. However, instead of fusing the new TSDF value at the voxel center like 6, we distribute TSDF weights to the 3D spatial voxels because the aligned points are not located in the voxel center. For each point pjin the new TSDF, trilinear interpolation distributes the weight to 8 neighboring voxels vi Npjrespectively. Consequently, for each voxel viin the global TSDF, dis- tributed TSDF values and weights for all neighbouring points are summed for the current time t as follows: t(vi) = X j t(pj)(1 dj i,x)(1 d j i,y)(1 d j i,z), (12) wt(vi) = X j (1 dj i,x)(1 d j i,y)(1 d j i,z). (13) In (12) and (13), TSDF value and weight of each voxel vifor the point pj are defi ned by the normalized distance dj i,x,d j i,y and dj i,z between viand pjalong the each x, y, and z axis. After summing distributions for the time t, TSDF values and weights of each voxel are fi nally obtained by a weighted 6717 running average as follows: t(v) = Wt1(v)t1(v) + wt(v)t(v) Wt1(v) + wt(v) ,(14) Wt(v) = Wt1(v) + wt1(v),(15) where (v) and W(v) are TSDF value and weight at the voxel center v in the global TSDF. After the TSDF fusion, 3D object surface can be reconstructed in mesh or point clouds form using a zero-crossing, where the sign of TSDF value is changed. The zero-crossing is mostly implemented via marching cube 26 or raycasting 27 algorithm, and we utilize the marching cube algorithm. For a parallel computation, trilinear interpolation and summation for each voxel are accelerated on the GPU. IV. EXPERIMENTAL RESULTS We train our network on a single NVIDIA Geforce GTX 1080TI with 11GB of GPU memory. We use a batch size of 32 for a training and use Adam optimizer 28 with momentum parameters 1= 0.9,2= 0.999. The learning rate is 104for 50 epochs, and is reduced to 105for 25 epochs. We use NYU-Depth-v2 dataset 7 to train and test our network. NYU-Depth-v2 dataset consists of 464 scenes captured with Microsoft Kinect RGB-D camera. We use the offi cial train/test split consisting in 249 training scenes and 215 test scenes in NYU-Depth-v2 dataset. For a training, a total of 48K RGB and depth image pairs are used from the training scenes and augmented on the fl y like 5. To evaluate TSDF prediction result against the ground truth, we use offi cial 654 images from the test scenes. Depth image is rectifi ed to RGB image and fi lled with cross bilateral fi lter. Also, Microsoft 7-scenes dataset 8 is used to evaluate the performance of TSDF supported ICP and

人人文库> 全部分类> 教育资料 > 课件下载

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

IROS2019国际学术会议论文集0378

文档简介

温馨提示

最新文档

评论

IROS2019国际学术会议论文集0378

文档简介

温馨提示

最新文档

评论

相关文档