IROS2019国际学术会议论文集 1339_第1页
IROS2019国际学术会议论文集 1339_第2页
IROS2019国际学术会议论文集 1339_第3页
IROS2019国际学术会议论文集 1339_第4页
IROS2019国际学术会议论文集 1339_第5页
免费预览已结束,剩余1页可下载查看

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Generating an image of an object s appearance from somatosensory information during haptic exploration Kento Sekiya1 Yoshiyuki Ohmura and Yasuo Kuniyoshi2 Abstract Visual occlusions caused by the environ ment or by the robot itself can be a problem for object recognition during manipulation by a robot hand Under such conditions tactile and somatosensory information are useful for object recognition during manipulation Humans can visualize the appearance of invisible objects from only the somatosensory in formation provided by their hands In this paper we propose a method to generate an image of an invisi ble object s posture from the joint angles and touch information provided by robot fi ngers while touching the object We show that the object s posture can be estimated from the time series of the joint angles of the robot hand via regression analysis In addition conditional generative adversarial networks can gen erate an image to show the appearance of the invisible objects from their estimated postures Our approach enables user friendly visualization of somatosensory information in remote control applications I INTRODUCTION Object and environmental recognition are crucial pro cesses for object handling in the real world Progress in computer vision has enabled robots to detect objects and recognize them Additionally computer vision is helpful in shape recognition and pose estimation applications for the purposes of robotic manipulation However com puter vision is often useless during object manipulation because the robot hand or the surrounding environment hides part or the entirety of the object In such situations visual processing of the changes in the position and pose of an object that has been touched by the robot becomes diffi cult Humans can recognize and manipulate objects in situ ation where the visual information has been lost e g in the dark or when the object is in a pocket Klatzky et al showed that humans can recognize the type of an object with only a few touches 1 Furthermore humans seem to be able to visualize an individual object s informa tion during haptic exploration 2 While somatosensory information mainly consists of self motion and posture related information humans frequently pay attention to the object s posture and pose rather than their hand s pose Because the object s posture and pose are more important than self motion during manipulation this 1Kento Sekiya is with the Faculty of Engineering the University of Tokyo Japansekiya isi imi i u tokyo ac jp 2Yoshiyuki Ohmura and Yasuo Kuniyoshi are with the Graduate School of Information Science and Technology the University of Tokyo Japan ohmura kuniyosh isi imi i u tokyo ac jp Somatosensory information PostureReal images 128 128 145 n 1 Fig 1 System used to match an image to somatosensory information via an object s posture attention bias is reasonable However the method used to extract the object s information from the somatosensory information is poorly understood We believe that this ability is crucial for eff ective object manipulation In this paper we show that the postures of sev eral known objects can be estimated from time series somatosensory information and provide a model that generates an image of the appearance of sample objects during haptic exploration We propose a method that combines regression networks with conditional generative adversarial networks cGANs 3 Regression networks estimate an object s pose from somatosensory informa tion and we evaluate how much of the time series hand data contains the object pose information A cGAN is a generative model that generates an image of an object corresponding to that object s pose We also evaluate whether or not the generated image shows the object s pose correctly Our proposed approach can be used to complement the visual information of objects when they are covered by the surrounding environment The robot can present the somatosensory information as an image that a human can understand easily and our approach enables user friendly visualization of somatosensory information in remote control applications II Related work A Object recognition In the computer vision fi eld high level object recogni tion has been achieved Through the use of deep neural networks techniques for feature extraction from images have improved and the acceleration of the processing 2019 IEEE RSJ International Conference on Intelligent Robots and Systems IROS Macau China November 4 8 2019 978 1 7281 4003 2 19 31 00 2019 IEEE8132 hand data n 145 random noise n 2 Regression nets mlp 145 50 10 2 n cos sin pose n 100 Generator mlp 100 256 512 1024 16384 n 128 128 Embedding Discriminator mlp 16384 512 512 512 1 128 128 real images Random selection Embedding real or fake fake real generated images Regression nets Conditional generative adversarial networks Fig 2 Model composed of regression nets and cGAN The conditional labels of cGAN are constructed from the estimated object s pose with regression mlp means multi layer perceptron and numbers are layer sizes time has enabled real time object recognition 4 5 6 7 8 In the neuroscience fi eld the ability of humans to recognize objects by touch has often been discussed Hern andez P erez et al showed that tactile object recog nition generates patterns of activity in a multisensory area that is known to encode objects 2 Monaco et al also showed that the area of the brain related to visual recognition is activated during haptic exploration of shapes 9 Furthermore the relationship between visual perception and object manipulation has also been discussed in recent years 10 Therefore it is believed that humans can imagine visual information from the tactile information acquired during haptic exploration B Image generation Generative modeling has been studied in both the computer vision and natural language processing fi elds Recently deep neural networks have made a major con tribution to image generation using generative modeling Examples of the deep generative models that have been developed include the variational autoencoder VAE 11 and generative adversarial networks GANs 12 GANs include two networks known as the generator and the discriminator and the generator can generate high resolution images that we cannot discriminate from real images but GANs have a problem with training instability To solve this problem various studies have proposed improved GAN models 3 13 14 In this pa per we have focused on cGANs 3 which can control the generated images using a conditional vector In cGANs conditional vectors are merged into the inputs of both the generator and the discriminator so the generator can learn weights that represent images that correspond to conditional vectors III Methods A Overview To generate an image of an object s appearance from somatosensory information during haptic exploration with supervised learning it is necessary to collect a set composed of an image of the object s appearance and the somatosensory information during haptic ex ploration However in the real world the robotic hand generally covers objects during haptic exploration so it is diffi cult to collect the object image and somatosensory information simultaneously We propose a system to match images to the somatosensory information via the object s posture which is measured using a rotation sensor We collect the somatosensory information and the object posture data simultaneously and collect the object posture and image data simultaneously Finally we match the images to the somatosensory information as shown in Fig 1 Fig 2 shows the model used to generate an image of an object s appearance from somatosensory information during haptic exploration To determine whether an object s information can be extracted from somatosen sory information alone during this exploration we used regression nets that estimate an object s pose from the somatosensory information The cGAN trains a gener ator that generates images from noise and conditional vectors that are constructed from the estimated object s pose B Regression nets We used regression nets to extract the object s pose from the somatosensory information The regression nets were trained using a set of object postures and the somatosensory information and estimated the object s pose Posture data are cyclic data that become the same pos ture again after rotating through 360 Therefore when 8133 Fig 3 Experimental setup The robotic hand equipped with the fi xed robot arm touches the object at random The stereo camera captures images of the object THUMB 5 1 FF 4 1 MF 4 1 RF 4 1 LF 5 1 WRIST 2 0 Joint Touch sensor Fig 4 Degrees of freedom of the robotic hand The fi ngers have 22 degrees of freedom and the wrist has two degrees of freedom The robotic hand has fi ve fi ngers that are equipped with touch sensors on the fi nger tips regression nets are trained raw posture data cannot be used to calculate the minimum square error We thus used the cosine and the sine of the posture as the outputs of the regression nets C Conditional generative adversarial networks A cGAN is composed of generator networks and dis criminator networks A conditional label is a number in the 0 9 range that classifi es one round of the object s posture into 10 discrete classes in our experiment In the case where there are too many classes we believe that the small quantities of training data per class infl uence the instability of the cGAN s learning In the case where there are too few classes we believe that various images of the object s poses were included in a single class so a conditional label cannot be used to control the correct Fig 5 Three objects used in the experiments The left object is a regular square prism the middle object is an elliptical cylinder and the right object is a regular triangular prism image of the object s pose LGis the loss function of the generator and LDis the loss function of the discriminator described by 1 2 x represents real images y is a conditional label that is constructed from the estimated object poses and z is random noise The generator minimizes log D x y which means that the discriminator discriminates the real images from the generated images correctly and maximizes log D G z y which means that the discrimi nator recognizes the generated images as real images In contrast the discriminator minimizes log D G z y LG Ex pdata log D x y Ez pz 1 log D G z y 1 LD Ez pz log D G z y 2 D Evaluation of the generated images To evaluate whether or not the generated images express the object s appearance correctly we compare the image pixels of the generated images with those of the real images The cGAN generates an image that corresponds to a conditional label and we calculate the pixel loss between this image and the real images for 10 classes If the class of the smallest loss corresponds to the input label or to the label on both sides the generated image expresses the object s appearance correctly E Implementation We implemented regression nets and the cGAN us ing Keras 15 which is a neural network library in Python The cGAN was trained using the DGX 1 system NVIDIA which contains eight Pascal P100 graphics processing units GPUs IV Data collection A Hardware setup Fig 3 shows the experimental hardware setup We used a robotic arm LBR iiwa 14 R820 KUKA that has seven degrees of freedom and a robotic hand Shadow Dexterous Hand E Series Shadow Robot Company that has 24 degrees of freedom Fig 4 The joint angles of the robotic arm are all set at fi xed positions The robotic hand has fi ve fi ngers that are equipped with touch sen sors on the fi ngertips Touch sensors are Pressure Sensor Tactiles PSTs which are a single region sensor 8134 time Fig 6 Haptic exploration with the robotic hand 50150250350450550650750850950 epoch 0 12 0 14 0 16 0 18 0 20 0 22 0 24 0 26 loss regression loss loss 1 loss 3 loss 5 loss 7 loss 9 Fig 7 Comparison of the transitions of the minimum square error when changing the span of time series so matosensory information in the case of a square prism The test objects are set on a horizontal table and their positions are fi xed They rotate around a single pivot and their angular positions are measured using a rotary encoder MAS 14 262144N1 Micro Tech Laboratory A stereo camera ZED Stereolabs is then used to take photographs of the objects The object images are gray scale images with a size of 128 128 We used three objects a regular square prism an elliptical cylinder and a regular triangular prism Fig 5 The regular square prism achieves the same pose by rotating through 90 The elliptical cylinder achieves the same pose by rotating through 180 The regular triangular prism achieves the same pose by rotating through 120 B Haptic exploration using the robotic hand We controlled the robotic hand remotely using a glove CyberGlove II CyberGlove Systems and the robotic hand touched the objects at random Fig 6 We collected fi ve tactile data and 24 joint angles of the robotic hand with a 10 Hz cycle and then merged the tactile and somatosensory data into 29 dimensional hand data The hand data generated when touching the objects at two or more points were extracted To use the time series infor mation of the hand data we merged the extracted hand TABLE I Accuracy of the estimated posture with regres sion analysis ShapeAccuracy Square prism89 9 Elliptic cylinder92 3 Triangular prism88 7 data with several steps before and after the extracted hand data were acquired We collected 3000 extracted somatosensory data for each object V Experiments A Pose estimation We trained the regression nets on the somatosensory information and the estimated object poses The so matosensory information was split into two sets and we used 1500 data to train the regression nets and used the other 1500 data to test a regression model and estimate the object poses To determine how many steps of the hand data were merged with the extracted hand data we evaluated the minimum square error of the regression in time windows of fi ve diff erent sizes Fig 7 shows the minimum square error results in 1 3 5 7 and 9 steps which were determined from before and after analysis of the extracted hand data on the square prism One step means 29 dimensional hand data from touching the objects while three steps means 87 dimensional hand data composed of the touching data and the hand data in the 0 1 s periods before and after touching occurred and the quantities of data continue to increase with increasing numbers of steps In the case of the shorter time series hand data the minimum square error did not decrease In contrast in the case of the longer time series hand data the weights were overfi tting the training data so the minimum square error increased as the number of learning epochs increased In the case of fi ve steps which are merged with hand data in the 0 2 s periods before and after the touching data were acquired the minimum square error gradually decreased The estimated postures from the somatosensory infor mation were classifi ed into 10 classes A square prism was classifi ed every 9 an elliptical cylinder was classifi ed every 18 and a triangular prism was classifi ed every 12 TABLE I shows the accuracy as calculated from 8135 class 0class 1class 2class 3class 4 class 5class 6class 7class 8class 9 square a Square prism class 0class 1class 2class 3class 4 class 5class 6class 7class 8class 9 ellipse b Elliptical cylinder class 0class 1class 2class 3class 4 class 5class 6class 7class 8class 9 triangle c Triangular prism Fig 8 Results for 10 generated images corresponding to the conditional labels that were classifi ed from estimated postures Fig 8a shows the square prism Fig 8b shows the elliptical cylinder and Fig 8c shows the triangular prism 025050075010001250150017502000 epoch 10 0 0 0 2 0 4 0 6 0 8 1 0 discriminator loss square 025050075010001250150017502000 epoch 10 0 5 10 15 generator a Square prism 025050075010001250150017502000 epoch 10 0 0 0 2 0 4 0 6 0 8 1 0 discriminator loss ellipse 025050075010001250150017502000 epoch 10 0 5 10 15 generator b Elliptical cylinder 025050075010001250150017502000 epoch 10 0 0 0 2 0 4 0 6 0 8 1 0 discriminator loss triangle 025050075010001250150017502000 epoch 10 0 5 10 15 generator c Triangular prism Fig 9 Loss transitions of the generator and the discriminator Fig 9a shows the results for the square prism Fig 9b shows the results for the elliptical cylinder and Fig 9c shows the results for the triangular prism the classifi ed class corresponding to the correct class or the class on both sides In the elliptic cylinder case the accuracy was 92 3 which was the highest score The accuracies for the other objects also showed high scores demonstrating that the object pose information can be extracted from fi ve steps of time series somatosensory information with regression analysis B Image generation We matched the object images to the somatosensory information using the estimated postures We trained the cGAN on 1500 samples of object images and conditional labels that were classifi ed based on the estimated pos tures The learning time was 20000 epochs and the batch size was 32 Fig 8 shows the results of image generation of each object shape The cGAN was able to generate visual images of the objects The results also show that the change in the class label corresponded to the change in object poses visually Fig 9 shows the loss transitions of the generator and the discriminator The loss of the generator converged constantly on all object shapes C Evaluation We evaluated the generated images quantitatively We compared the accuracy of the results for images generated from the correct posture label with that of the results for images generated from the estimated pos ture label using regression analysis The correct posture means raw data measured by a rotary encoder 1000 images which were generated by the trained model were used for evaluation per one object Fig 10 shows the mean accu

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论