IROS2019国际学术会议论文集0759_第1页
IROS2019国际学术会议论文集0759_第2页
IROS2019国际学术会议论文集0759_第3页
IROS2019国际学术会议论文集0759_第4页
IROS2019国际学术会议论文集0759_第5页
已阅读5页,还剩2页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Monocular Depth Estimation in New Environments With Absolute Scale Tom Roussel, Luc Van Eycken, Tinne Tuytelaars AbstractIn this work we propose an unsupervised training method that fi netunes a single image depth estimation CNN towards a new environment. The network, which has been pretrained on stereo data, only requires monocular input for fi netuning. Unlike other unsupervised methods, it produces depth estimations with absolute scale a feature that is essential for most practical applications, yet has mostly been overlooked in the literature. First, we show how our method allows adapting a network trained on one dataset (Cityscapes) to another (KITTI). Next, by splitting KITTI in subsets, we show the sensitivity of pretrained models to a domain shift. We then demonstrate that, by fi netuning the model using our method, it is possible to improve the performance on the target subset, without using stereo or any form of groundtruth depth and with preservation of the correct absolute scale. I. INTRODUCTION Unsupervised learning of single image depth estimation models from video data has been gaining in popularity over the last few years 1, 2, 3, 4, 5, 6, 7. This family of techniques reformulates the depth estimation problem as a novel view generation problem: given the camera motion between two images, a new view can be generated using the estimated depth, by warping one image to the other. By comparing the generated image with the actually recorded one, the quality of the estimated depth can be evaluated and the model (usually a CNN) can be updated. These techniques can be split in two groups depending on the input data: stereo or monocular. With stereo unsupervised training 8, 9, 10, 2, the baseline between the left and right cameras is known. The novel view objective is to recreate one image of the stereo pair, given the other one. This teaches the network to predict metric scale depth values, if the baseline is given in metric units. On the other hand, there is monocular unsupervised training 1, 5, 6, 11, 12, 7, 2. This only requires monocular input. Depth is learnt by warping to frames in the video sequence before or after the source frame. However, this introduces the issue of scale ambiguity, as the absolute scale of the environment cannot be recovered without prior information. This means the network will produce depth values with an arbitrary scale, which in many cases is not even consistent over time. This fatal fl aw is refl ected by the evaluation methods used in fully unsupervised depth estimation works 1. Rather than evaluating the estimated depth values directly, an oracle is typically used fi rst to correct the scale. In particular, before computing the standard error metrics, the estimated depth map is rescaled by the ratio of the median depth values of *Authors are affi liated with KU Leuven and can be contacted using fi rstname.lastnameesat.kuleuven.be Fig. 1.Graph showing the ratios of the median ground truth depth and median estimated depth for the method of Zhou et al. 1 and our method. Our method predicts the correct absolute scale, while Zhou et al. do not. The images are examples showing the difference in estimations between both methods. both the estimated depth map and the ground truth depth map. This is usually done per image. To demonstrate the importance of this correction, we show a plot with these individual scale factors over the KITTI test set, for a state-of-the-art depth estimation network 1 in Figure 1 (blue). The scale correction factor varies a lot: the minimum and maximum are about 4 and 14 respectively. Our proposed method on the other hand (in orange) consistently gives a scale factor close to one and is much more stable. The network used here is the one detailed in section IV-C. Having only relative, and not absolute, depth estimates, clearly limits the potential applications. For example, applications such as obstacle avoidance for autonomous vehicles (e.g. 13) need accurate, metric scale depth to make safe decisions. This is not possible with the aforementioned unsupervised methods, and motivated us to develop our alternative scheme. This paper has two major contributions that help us overcome the scale ambiguity problem: We show that one can use a well understood, off the shelf SLAM algorithm to replace typically used camera pose estimation networks. Note that the latter need to be trained and hence do not generalize well beyond the circumstances seen at training time. We show how to leverage prior information in the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Macau, China, November 4-8, 2019 978-1-7281-4003-2/19/$31.00 2019 IEEE1735 form of a pretrained network, fi netuning the network in an unsupervised fashion while preserving the absolute scale. Starting from a generic, pretrained network is common practice for image classifi cation, typically building on top of a representation learned on ImageNet. Likewise, we argue that learning depth estimation for a given domain (i.e., a particular city or neighbourhood) should not start completely from scratch, but rather re-use existing pretrained models. A straightforward implementation of this idea unfortunately fails miserably: when simply fi netuning the network as a whole on the new monocular data, the absolute scale information gets lost, as we show in section IV. To preserve the scale, a more carefully designed strategy is needed. Our work is the fi rst, to the best of our knowledge, to propose how to perform this fi netuning in an unsupervised manner without re-introducing the problem of scale ambiguity. In particular, we build on the framework of Zhou et al. 1. We replace their pose CNN with an off-the-shelf SLAM algorithm, namely ORB-SLAM2 14, to perform the camera relocalization. This algorithm is hand crafted and geometry- based, making it generalize well over different sequences. It takes the entire sequence into account, not just the temporally adjacent frames. To resolve the scale ambiguity, we run ORB-SLAM2 in RGB-D mode, using the depth estimations provided by the pretrained network in addition to the RGB input. This depth channel may be of poor quality, due to the domain shift, yet suffi ces to pass on the scale information to the SLAM algorithm. The pretrained network is trained with stereo to produce metric scale depth maps, which in turn causes the SLAM algorithm to produce metric scale transformations between cameras. This is key to ensure that we keep the correct scale when then using this pose and applying a warping loss to fi netune the depth estimation network. In section II we discuss the related work of this paper, followed by an explanation of our training pipeline in section III. We show and discuss the results of two experiments in section IV and conclude in section V. II. RELATEDWORK A. Visual SLAM Visual Localization and Mapping is a classic computer vision problem that has many applications in the robotics community. The objective is to localize a moving camera in an unknown environment and map this area. There have been many approaches to tackle this problem 14, 15, 16 using monocular videos. All of these suffer from scale ambiguity, which is typically resolved by using RGB-D or stereo cameras 14. Another approach to reach an absolute scale with monocular SLAM is by fusing IMU data into the pipeline 17, 18. Finally, the authors of 19 show that it is possible to resolve the scale ambiguity issues of monocular SLAM by using a depth estimation network that is trained to produce absolute scale estimations. In this work we use this idea when extracting the camera trajectories. B. Stereo Unsupervised Depth Estimation As with any data driven method, the amount of data that is available for training is critical for depth estimation, and gathering depthmaps requires rather expensive sensors. A rather cheap depth sensor is a stereo camera. While you can fi nd pixel correspondences between the stereo pair to construct a depth map, it was shown in 20 that it is possible to train a depth estimation network without explicitly con- structing these depth maps as supervisory signal fi rst. Their insight was that you can reformulate the depth estimation problem as a novel view generation problem. This bypasses the need to construct an imperfect depth map from the stereo images. There has been further progress using this training pipeline, focusing on adding further constraints 8, 9, 10, 2. A simple regularizer was used by 20, namely an L2 Fig. 2. Schematic representation of our proposed scale-preserving unsupervised fi netuning method 1736 loss over the depth map gradient, but this performs poorly around the edges of objects. 8 showed that better results can be obtained using an edge aware smoothing regularizer. This way the boundaries of objects become less blurry. The same authors also propose another constraint, requiring the network to produce depth maps that give consistent results when warping from left to right and vice-versa. C. Unsupervised Monocular Depth Estimation It is not only possible to generate a novel view when the camera performs left to right translation, as in a stereo setup. The same idea can also be applied for any arbitrary movement, as shown by 1. This makes it possible to learn to estimate depth from monocular sequences, although there is the additional challenge of estimating an unknown pose difference of the camera. This is typically overcome by learning a separate model that estimates the 6 dof camera motion and learning both models end-to-end. This idea has again been expanded by adding extra constraints 5, 6, 11, 12. In 7, they add a constraint based on standard structure from motion. They enforce pixels to move along the epipolar lines by estimating the essential matrix for each frame pair. One of the biggest advantages of this technique is that it can use data recorded by a single camera. On the other hand, a large downside is that, just like any monocular geometric technique, it suffers from scale ambiguity. In other words, this cannot produce depth maps with an absolute scale. The authors of 2 have combined monocular and stereo training, to jointly learn from both types of data. However, this requires both monocular videos and stereo data to be available at training time. III. METHOD The goal of our proposed method is to produce metric scale depth estimates in an environment for which there are no depth labels available from depth sensors such as LIDAR or stereo images. Due to the fundamental scale ambiguity problem of monocular structure from motion, this requires prior knowledge about the environment. Here, we suggest to leverage the prior knowledge from a network that has been pretrained to produce metric scale depth maps on a different environment, and then adapt it to the test setting. For simplicity, we use terms typically used in domain adaptation. We call the initial environment, on which the model is pretrained, the source environment. The environment we are adapting to is denoted as the target environment. Our method can then be summarized like this: 1) Pretrain on the source stereo dataset. 2) Recover the camera trajectories of the monocular target dataset using SLAM. 3) Finetune the model on the monocular target dataset using the recovered trajectories. A. Camera trajectory Most monocular unsupervised techniques 1, 5, 6, 12 train a CNN to estimate the pose difference between two images. Instead, we opt to use ORB-SLAM2 14, an open- source off-the-shelf Visual SLAM algorithm, to recover the full trajectory of the camera throughout the sequence. ORB- SLAM2 can use the input of a depth sensor to circumvent the scale ambiguity inherent in monocular SLAM. We use the depth estimation generated by our pretrained model for this. This depth estimation is far from perfect, due the domain shift, but good enough for this usage thanks to the outlier removal mechanism integrated in ORB-SLAM2. SLAM al- gorithms are known to suffer from error accumulation across longer sequences, causing high errors when comparing 2 temporally distant frames. However, since the error increases gradually, it does not affect comparisons between local, frame-to-frame poses. B. Training through Warping The cornerstone of our method is training by warping images and applying a reconstruction loss. This is the same technique introduced by 1. We do this both when pretrain- ing on stereo data and when fi ne-tuning on monocular data. It goes as follows: To warp one frame, I2, to another, I1, we need the intrinsic calibration matrix K, the estimated depth d1(p1) for each pixel in the target image and the motion of the camera in the form of an SE(3) transformation matrix, T12. This takes the following form: T12= tx Rty tz 0001 (1) where R is the 3x3 rotation matrix and tx, tyand tzare the elements of the translation vector of the camera. We then project the pixel coordinates p1= (u1,v1) of the target frame I1onto the source frame I2. x2 y2 z2 = K T12 d1(u1,v1) K1 u1 v1 1 (2) p2= u 2 v2 = x 2 z2 y2 z2 (3) With these transformed pixels one can then reconstruct the target frame by sampling (and linearly interpolating) from the source image. Finally we use this reconstructed frame to create an L1 reconstruction loss: Lreconstruction= I2(p2) I11(4) Simple sampling from a matrix is not a differentiable opera- tion, so we apply the differentiable sampling operation from 21. This is a bilinear sampler. In our training pipeline, we apply this loss in two different scenarios. During the pretraining step, we leverage the idea presented by 20. We use the stereo pairs to train the network to produce metric depth estimations. In this case, 1737 Fig. 3.Example images taken from Cityscapes after cropping. the transformation matrix Tstereois simple: Tstereo= 100b 0100 0010 0001 (5) where b is the stereo baseline in meters. I2and I1are the left and right stereo images respectively. When fi ne-tuning, this transformation matrix is unknown and needs to be estimated. We get this estimation by using ORB-SLAM2 to recover the trajectory up to a metric scale as described in section III-A. In this case, I1and I2are two consecutive frames. A schematic representation of how this training works can be found in fi gure 2. Because of the nature of this reconstruction loss, there is no unique solution that minimizes this loss. To resolve this, we add an edge- aware smoothing regularizer as in 8. Additionally, before applying this regularization we normalize the depth using the mean depth of the current depthmap as in 3. Note that this normalization is only applied for this regularizer; it is not applied to the reconstruction loss, nor is it applied at test time. di= Ndi N j=1dj (6) Lregularizer= i ? ? ? ? ? di x ? ? ? ? ?e ? ? ? Ii x ? ? ? + ? ? ? ? ? di y ? ? ? ? ?e ? ? ? Ii y ? ? ? (7) with N the total number of pixels in the depth map and i an index running over these. The total loss used to train the network is then: L = Lreconstruction+ wregLregularizer(8) with wrega regularization parameter. IV. EXPERIMENTS A. Implementation details All our experiments are implemented using Tensor- Flow 22. We use the DispNet architecture 23 in the same confi guration as 1. We use batch normalization 24 during training with a decay of 0.95 and apply L2 weight regular- ization. Weights are randomly initialized when pretraining. The input and output resolutions of our network are set to 128x416, with a batch size of 16. When training we do a grid search over a set of learning rates between 1e-5 to 1e-3 Fig. 4.Estimated camera trajectories of sequences within KITTI. The network used for ”depth estimated” is the pretrained network that was only trained on Cityscapes. and wregbetween 1e-3 and 1e-4 and select the best network based on the log RMSE on the validation set. We do this for both steps, pretraining and fi ne-tuning. During the stereo pretraining, we randomly swap Itand Iswith a chance of 50% per batch. For camera localization, we customize the ORB-SLAM2 codebase to load the TensorFlow network and provide depth estimations for each new image as they come in. It then uses these depth maps as it would use the input of a depth sensor. We split the sequences up into sequences of 200 images, and run SLAM separately on each. B. Evaluation metrics When evaluating we use the standard evaluation metrics, also used by 25. These are the following: Threshold:% of difor which(9) max(d estim i dgt i , dgt i destim i ) = 1.25n(10) Abs. Rel.: 1 N i |destim i dgt i | dgt i (11) Sq. Rel.: 1 N i (destim i dgt i )2 dgt i (12) RMS: 1 N i (destim i dgt i )2(13) RMS log: 1 N i (logdestim i logdgt i )2(14) We also apply the same mask as 25 before computing the evaluation metrics. This masks out areas which have no LIDAR data or where the measurements are less accurate. C. Pretraining on Cityscapes To show how the network handles a substantial domain shift, we set up an experiment where we pretrain the network on the Cityscapes dataset 26 using stereo data and then fi netune with our method on the KITTI dataset, using only monocular data. For the pretraining, we use both train and train extra. The aspect ratio of these images are not the same as those in the KITTI dataset. Therefore, we crop out 20% from the bottom of each frame to leave out the hood of the car, and crop the rest from the top of the frame until 1738 Fig. 5. Network output on the test set before and after fi ne-tuning on KITTI. Network was pretrained on Cityscapes. From top to bottom: Network input, ground truth, pretrained network, fi netuned network. TABLE I SCALE CORRECTED RESULTS ONKITTI EIGENTEST SET. THE SECOND COLUMN DENOTES ON WHICH DATASETS THE NETWORK WAS TRAINED:

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论