IROS2019国际学术会议论文集 0698_第1页
IROS2019国际学术会议论文集 0698_第2页
IROS2019国际学术会议论文集 0698_第3页
IROS2019国际学术会议论文集 0698_第4页
IROS2019国际学术会议论文集 0698_第5页
免费预览已结束,剩余1页可下载查看

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Adaptive Loss Balancing for Multitask Learning of Object Instance Recognition and 3D Pose Estimation Takashi Hosono1 Yuuna Hoshi2 Jun Shimamura1and Atsushi Sagata1 Abstract Object instance recognition and 3D pose estima tion are important elements in robot vision technology State of the art methods improve the accuracy of both instance recognition and pose estimation using multitask learning These methods use unifi ed balancing parameters to integrate the loss of each task which means task diffi culties are the same for all objects However the method we propose can adjust the balancing parameters for each object This idea is based on the assumption that task diffi culties are different for each object since the distinctiveness of object instances and poses depends on their appearance and shape Our method sequentially estimates task diffi culties for CNN based on the amount of loss change and calculates balancing parameters for each object Our experiments show that our method improves the accuracy of both object instance recognition and pose estimation compared with state of the art methods using the common LineMOD dataset I INTRODUCTION Object instance recognition and 3D pose estimation is a very challenging task because of occlusions background clutter and scale changes and it can be used as a foundation for many applications such as augmented reality and robot grasping Recently the challenge of object instance recognition and 3D pose estimation often involves 3D object retrieval from already detected objects because of its scalability Our work is motivated by recent successful descriptor learning using a convolutional neural network CNN The best perform ing methods using multitask learning of feature descriptor learning and pose regression 1 2 Balancing of losses is generally important in multitasking learning which integrates multiple losses Reducing the weight for diffi cult tasks and increasing the weight for easy tasks has been particularly effective 3 4 This helps avoid the local minimum for diffi cult tasks and prevents unnecessary increases in the gradient calculated from dif fi cult tasks When considering multitask learning of object retrieval and pose estimation our assumption is that the task diffi culties of object retrieval and the diffi culty of pose estimation differ for each object since the distinctiveness of object instances and poses depend on their appearance and shape Namely the balancing parameter should be adjusted according to the object However in existing methods that are mentioned above the balancing parameter is determined uniformly or only by object independent task diffi culties 1NTT MediaIntelligenceLaboratories NTTCorporation Kana gawa Japan takashi hosono ks jun shimamura ec atsushi sagata hw hco ntt co jp 2College of Computing Georgia Institute of Technology Atlanta USA yhoshi3 gatech edu Therefore the differences in the task diffi culties cannot be taken into account for each object To address this issue we focus on adjusting balancing parameters of multitask learning for each object Figure 1 illustrates the concept of our method In the case of balancing parameters are calculated without considering task diffi culty for each object losses that does not match task diffi culties of each object are propagated Fig 1 a and it seems to reduce the effect of multitask learning Figure 1 b outlines our approach Our method adjusts the balancing parameters according to task diffi culties for each object For example for holepuncher in Figure 1 b pose estimation is intuitively more diffi cult than object retrieval since it has a smaller appearance change due to pose change than object change Similarly for cat and ape in Figure 1 b object retrieval seems to be more diffi cult than pose estimation Therefore our method considers the difference in task diffi culties for each object Furthermore because adjusting the balancing parameter exhaustively for each object is diffi cult and the task diffi culty for CNN changes when the model is updated our method automatically and sequentially adjusts balancing parameters We apply this idea to the method proposed by Bui et al 2 and performed evaluations using the common LineMOD dataset 5 We follow the reasonable experiment setting of Bui et al 2 which uses only a depth image as input and uses only synthetic images as template image for object retrieval Experimental evaluations demonstrate that adjusting balancing parameters for each object affects performance Our experiments also shows that our method which can adaptively adjust balancing parameters improves the accuracy of both object instance recognition and pose estimation compared with state of the art methods The remainder of this paper is organized as follows In Section 2 we briefl y review existing methods for object instance recognition with 3D poses and multitask learning In Section 3 we detail our method In Section 4 we report on our experiment results that confi rm our method s effective ness through comparison with state of the art methods using the LineMOD dataset 5 In Section 5 we conclude the paper with a summary of key points and mention our future work II RELATED WORK A Object Instance Recognition and 3D Pose Estimation The recent object instance recognition and 3D pose es timation methods can be roughly classifi ed into classifi ca tion based methods and descriptor learning based methods 2019 IEEE RSJ International Conference on Intelligent Robots and Systems IROS Macau China November 4 8 2019 978 1 7281 4003 2 19 31 00 2019 IEEE2587 ape holepuncher cat CNN pose loss descriptor loss weight of loss a Uniform loss balancing ape holepuncher cat CNN pose change object change i ii iii pose change object change i ii iii b Adaptive loss balancing according to task diffi culties of each object Fig 1 Concept of our method The green arrows indicate the pose loss and the red arrows indicate the descriptor loss The sizes of the circles on the arrows indicate the balancing weight for multitask learning a Previous methods uniformly determine balancing parameters b Our method determines the balancing parameters for each object Estimating the poses for the objects in the red boxes is relatively diffi cult as is the case regarding instance recognition for the objects in the green boxes Images in the dotted line boxes intuitively show our assumption that the task diffi culty differs depending on the object i shows a reference image ii shows a different pose or object compared to the reference image and iii shows an absolute value image of the difference between i and ii As looking at the holepuncher it is considered that the pose estimation is more diffi cult since the appearance change due to the pose change is smaller than the appearance change due to the object change Similarly in case of the cat recognition of instance object is more diffi cult In classifi cation based methods a two stage framework of classifi cation and pose estimation is often used 6 7 8 9 10 To explain a few Schwarz et al 6 use CNN as a feature extractor and classify instances with a SVM Then they estimate poses with a CNN feature and estimate the results of classifi cation with SVR In addition Xiang et al 7 propose a method that recognizes objects and estimates pose via instance segmentation Another approach is classifying the object and pose together with an end to end learning framework 11 12 In these approaches the object class and the quantized 3D poses are estimated by a classifi er In general the number of classes is large for object instance recognition However all the methods mentioned above that rely on a classifi er typically grow their complexity linearly with the number of classes In a descriptor learning based method instance recognition and pose estimation are performed by a nearest neighbor NN search with the feature vectors extracted from the query image and the template images using descriptors 13 14 15 1 2 This approach has the advantage of high scalability because the effi cient NN search method has an average complexity of O 1 16 17 Furthermore this approach can recognize objects without retraining even when objects in a database increase or decrease For example the method propose by Wohlhart and Lepetit 13 uses CNN as a descriptor and learns it using triplet loss Triplet loss takes three images The three images are called anchor puller and pusher The anchor is a reference image the puller is an image similar to the anchor and the pusher is an image dissimilar to the anchor The loss function tries to map the output feature vectors of the images so that the anchor and the puller are close and the anchor and the pusher are far 18 In 13 the puller is selected as an image of the same object and has a similar pose to the anchor the pusher is selected as an image of an object different from the anchor or that has a very different pose This descriptor learning that considering the pose outperforming methods that uses hand crafted descriptors Zakharv et al 15 improve this approach by dynamically changing the distance between anchors and pushers depending on the pose difference Bui et al 2 show that accuracy of both object retrieval and pose estimation improves by multitask learning of descriptor learning and pose regression In 2 the balancing parameter of multitask learning is experimentally determined one value common to all object whereas our method determines balancing parameters for each object and updates them sequentially according to task diffi culties for CNN B Multitask Learning Multitask learning 19 learns the related information across tasks to improve the prediction performance of the model In general many approaches use a naive weighted sum of losses where balancing parameters of loss are uniform or manually tuned 20 21 Although the perfor mance of multitask learning highly depends on the balancing parameter its tuning is diffi cult Therefore some methods automatically tune balancing parameters In 3 4 the dif fi culty of each task is automatically estimated the weight of more diffi cult tasks is lighter and the weight of easier tasks is set heavier This helps diffi cult tasks to avoid the local minimum and prevent unnecessary increases in the gradient calculated from diffi cult tasks In contrast to these methods our method determines the balancing parameters according to task diffi culties for each object based on the assumption that task diffi culties are different for each object in object instance recognition and pose estimation In fact the experiment in Section 4 shows that adjusting the balancing parameters for each task and each object is better Another approach in focal loss 22 increases the weight of the more diffi cult task to prevent losses calculated from easy and large number of training samples e g background samples from becoming dominant In this work we studied 2588 CNN anchor image puller image pusher image feature vector pose triplets descriptor loss pose loss integrated loss pose difficulty apecat hole puncher descriptor difficulty apecat hole puncher Fig 2 An overview of our method Given triplets si sj sk multitask learning of descriptor learning and pose regression is performed by our adaptive loss balancing For iteration we calculate the losses L c poseand L c d of each task are calculated for object c from feature vector f x and pose p obtained from CNN These losses are integrated based on task diffi culties r c pose r c d for each object already detected patches i e all samples are foreground and only follow knowledge in 3 4 III APPROACH We propose a multitask learning method that dynamically balances the descriptor loss and pose loss and apply it to Bui s 2 method Figure 2 overviews our method To train CNN training set Strainis given Strainis a data pair x1 c1 q1 xN cN qN where x represents the image patch c is its class and q is the pose vector expressed in quaternion and each sample is represented by s x c q Note that we only use depth images in our experiments In the learning step we fi rst create triplets si sj sk from Strain where sirepresented the anchor sjrepresented the puller and skrepresented the pusher Then the loss for descriptor learning and pose regression is calculated from the triplets and they are integrated by the balancing parameter calculated for each object Here according to the knowledge of 3 and 4 the balancing parameters are adjusted so that the more diffi cult task for the current CNN has a smaller weight In the estimation step synthesis dataset Sdb x1 c1 q1 xM cM qM is used given for retrieval Query s c and q are estimated by NN search with the feature vector obtained from the query image and Sdbimages q can also be estimated from the pose regression In this section fi rst we describe adaptive loss balancing that considered the task diffi culty of each object Then we describe our approach for calculating descriptor learning loss and pose regression loss A Adaptive Loss Balancing We calculate the balancing parameters based on the task diffi culties for each object Inspired by 4 the task diffi culty is calculated by loss reduction First we calculate k c t the moving average of the current loss L c t as follows k c t L c t 1 k 1 c t 1 where t d pose is either the descriptor learning or the pose regression is the current training iteration and 0 1 is a discount factor Using k c t we defi ne the task diffi culty r c t as follows r c t k c t k 1 c t 2 A large r c t means that the current optimization step did not reduce the loss much In other words optimization for the current CNN is diffi cult In particular if r 1 it seems that the task steps into a local minimum Therefore we adjust the balancing parameters so that the weight of tasks with relatively large r c t is light Specifi cally the overall loss function L c MTL for each object and balancing parameter c are defi ned as follows L c MTL cL c d 1 c L c pose 3 c r c pose r c d r c pose 4 B Descriptor Learning In order to learn good descriptors descriptors of the same object should be mapped together and descriptors of different objects should be mapped far apart In addition we have to learn features that are not affected by the differences between real and synthetic images because the input image is real whereas the Sdbhas only synthetic images Here we use the triplets and pairwise loss as introduced in 15 defi ned as follows Ld Ltriplets Lpairs 5 To calculate the Ltriplets triplets si sj sk T is given from Strain We chose samples of the same object and a close pose for anchor siand puller sj In contrast we chose pusher skto be a different object from the anchor or the same object with very different pose Using these triplets we defi ne Ltripletsover a batch as follows Ltriplets X si sj sk T max 0 1 kf xi f xk k2 2 kf xi f xj k2 2 m 6 where k k2is the l2 norm f x is the feature vector obtained using the CNN as a descriptor m is the margin and is 2589 the parameter that adjusts the effect of the margin In order to map feature vectors that are similar in pose closer to each other and map feature vectors that are dissimilar in pose farther away from each other the margin is calculated dynamically as follows m 2arccos qi qk if ci ck else 7 where 2 The pair wise loss Lpairis calculated on pairs si sj P and defi ned as follows Lpairs X si sj P kf xi f xj k2 2 8 This term aims at outputting the same descriptors from the real image and the synthetic image which share the same pose and the same object but differ in background and lighting conditions C Pose Regression In pose regression feature vector f x calculates the quaternion by the fully connected layer added after the feature descriptor layer and the function Lpose is defi ned as follows Lpose kq q k qk2 k2 2 9 where q is the corresponding ground truth pose IV EXPERIMENTS To confi rm the effectiveness of our proposed method we conduct experiments with the LineMOD dataset 5 In order to generate the dataset we use public real images1that are already cropped On the other hand synthetic images are generated by ourselves Therefore we describe how the dataset is generated Next we describe details of the implementation and conditions of the experiments Finally we discuss the experimental results A Dataset Generation In the same manner as related works 13 15 2 we partition synthetic images and real images obtained from LineMOD dataset into the following three sets Strain Triplets for training Anchor images are real images or synthetic images with added background noise Puller and pusher images are only synthetic images Sdb Template images for object retrieval All images are synthetic images Stest Real images for evaluation used exclusively in the evaluation phase Hereinafter we will detail how the synthetic images are rendered how background noise is added and how the data set is partitioned 1https www tugraz at institute icg research team lepetit research projects object detection and 3d pose estimation 0 45 45 Fig 3 Viewpoints for rendering objects If the object is rotation invariant we only use green points If the object is symmetrical only use green and blue points For other objects we use all points For each view point we render images by rotating the camera around the axis pointing at the object center 1 Rendering Synthetic Images First we render fi fteen objects from various viewpoints using 3D mesh models Figure 3 details the viewpoints We set the viewpoints by following 23 We include images that adversely affect the pose regression because symmetrical objects and rotation invariant objects have the same appearance from different viewpoints Therefore such objects are rendered only from viewpoints where every appearance is unique In particular the cup eggbox and glue objects are symmetrical and the bowl object is rotation invariant By using these viewpoints we add in plane rotation at each viewpoint by rotating the camera from 45 to 45 degrees using a stride of 15 degrees Then we generate an image patch

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论