




已阅读5页,还剩2页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Can User Centered Reinforcement Learning Allow a Robot to Attract Passersby without Causing Discomfort Yasunori Ozaki1 Tatsuya Ishihara2 Narimune Matsumura1and Tadashi Nunobiki1 Abstract The aim of our study is to develop a method by which a social robot can greet passersby and get their attention without causing them to suffer discomfort Social robots now function in a number of customer service roles such as receptionists guides and exhibitors However sudden greetings from a robot can startle passersby Therefore we developed a method that allows social robots to adapt their mannerisms situationally based the results of related work Our proposed method user centered reinforcement learning enables robots to greet passersby without causing them discomfort p 0 01 Our fi eld experiment in an offi ce entrance demonstrated that our method meets this requirement I INTRODUCTION The working population in many developed countries is decreasing proportionately to the total population due to population aging and this problem is expected to affect developing countries as well 1 One approach to addressing this problem is to use social robots to provide customer service instead of people Such robots are starting to be used as receptionists guides and exhibitors While robots can increase the possibility of providing a service by simply greeting passersby 2 passersby can suffer discomfort when suddenly greeted by a robot 3 Therefore the dilemma is whether to have the robot behave in a manner that benefi ts the owner or in a manner that does not disturb passersby Figure 1 illustrates the dilemma Therefore we developed the user centered reinforcement learning method which allows robots to greet passersby without causing them discomfort We defi ne the hypothesis The robot that has our proposed method can greet passersby and get their attention without causing them to suffer discom fort as the the theoretical hypothesis In the next section we defi ne the problem and describe how we approached solving it by studying related work In the Proposed Method chapter we explain our method of solving the problem In the Experiment chapter we explain our fi eld experiment to test two working hypotheses created from an original hypothesis The results show that our method can solve the problem In the Discussion chapter we examine the results from the standpoints of physiology psychology and user experience In the Conclusion chapter we conclude that user centered Q learning can increase a This work is supported by NTT Corporation 1Yasunori Ozaki NarimuneMatsumuraandTadashi NunobikiarewithServiceEvolutionLab NTTCorporation Yokosuka Japanozaki yasunori narimune matsumura ae hco ntt co jpand tadashi nunobiki ew hco ntt co jp 2Tatsuya Ishihara is with the R instead we tackled the problem that precedes it which is illustrated in Figure 2 In terms of the solution the problem we addressed is 2019 IEEE RSJ International Conference on Intelligent Robots and Systems IROS Macau China November 4 8 2019 978 1 7281 4003 2 19 31 00 2019 IEEE6986 similar to machine learning namely reinforcement learning Reinforcement learning in robotics is a technique used to fi nd a policy O A 9 and is used for robotic control tasks It is not used much for interaction tasks Reinforcement learning has been applied to the learning of several complex aerobatic control tasks for radio controlled helicopters 10 and to the learning of door opening tasks for robot arms 11 The research on interaction tasks is less remarkable Mitsunaga et al showed that a social robot can adapt its behavior to humans for human robot interaction by using reinforcement learning 12 if human robot engagement has been established Papaioannou et al used reinforcement learning to extend the engagement time and enhance the dialogue quality 13 The applicability of these methods to the situation before human robot engagement is established is unclear As shown in Figure 2 the problem we addressed occurs before engage ment is established In terms of the goal the problem we addressed is simi lar to increasing the number of human robot engagements Macharet et al simulated Gaussian process regression based on reinforcement learning being used to increase the number of engagements 14 Going further we focused on increas ing the number of engagements in a fi eld environment B Problem Statement We use a problem framework commonly used for re inforcement learning in robotics the partially observable Markov decision process POMDP to defi ne the problem 9 The robot is the agent and the environment is the problem The robot can observe the environment partially by using sensors We choose an exhibition service area in a company entrance as the environment The entrance consists of one automated exhibition system one aisle and other space In addition we express the entrance as Euclidean space R3 Passersby can move freely around the exhibition system The automated exhibition system consists of a tablet a computer a robot and a sensor system The sensor system can sense a color image data Itand a depth image data Dt We call this data Observation Ot The sensor system can also extract a passerbys partial action from Ot The passerbys action consists of the passerbys position pt xt yt zt and the head angle t yaw t roll t pitch t We defi ne the times when the passerby enters the entrance t 0 and when the passerby leaves from the entrance t Tend We call the interval between t 0 and t Tendan episode Let P p0 pTend be the passerbys position in an episode and let 0 Tend be the passerbys head angle in the episode Our method takes an the robot s action from the passerbys actions Let Nube the number of people that used the service Let Ndbe the number of people that felt discomfort Then we can declare this problem as Find a robot s policy O A such that max Nu and min Nd C Our Approach We solve this problem by controlling the robot using reinforcement learning ordinarily Q learning except for de signing the reward function We created the reward function by focusing on the user experience of the stakeholders We call this reinforcement learning that includes this reward function user centered reinforcement learning We do not use deep reinforcement learning because it is diffi cult to collect the huge amount of data needed for learning D Contributions The contributions of this work are as follows 1 We show that robots can learn actions from a passerby s responses related the passerby s user expe rience 2 We present a method of increasing the number of human robot engagements in the fi eld without causing discomfort II PROPOSEDMETHOD We based our user centered reinforcement learning method on reinforcement learning In this paper we use Q learning reinforcement learning as a base algorithm because it make it easy to explain why the robot choose their past actions We call this algorithm user centered Q learning UCQL UCQL is different from original Q learning 15 in action set A state set S Q function Q s a s S a A and reward function r st 1 at st st 1 st S at A UCQL consists of three functions 1 Selecting an action based on a policy 2 Updating the policy based on user actions 3 Designing a reward function and a Q function as the initial condition 1 Selecting an action based on a policy Generally speaking a robot senses observation and chooses an action including waiting Let ta sec be the time when the robot acts Let tc sec be the time when the robot computes the algorithm Let st S be the predicted users state at the time t Let at A be the robots action at the time t In UCQL the robot chooses the action with Algorithm 1 Algorithm 1 Select an action by UCQL Action Selector Input tc stc Q s a s A Q Output at ta at stc A Q ta tc return at ta 2 Updating the policy based on user actions In UCQL the robot updates the policy using Algorithm 2 3 Designing a reward function In UCQL robot is given a reward function with Algorithm 3 Algorithm 3 divide motivation into extrinsic and intrinsic one inspired from Intrinsically Motivated Reinforcement Learning 16 We call the proposed method User Centered because we design an extrinsic motivation from user s states related User 6987 Algorithm 2 Update the policy by UCQL Policy Updater Input sta ata stc A Q s a Output Q s a if ata is fi nished then R r stc ata sta q Q sta ata Q sta ata 1 q R max a Q stc a end if return Q s a Experience Va at is a value that a robot can obtain by taking an action at Vd stc sta is a value that makes a passerby discomfort when the user s states changes from stato stc Vg stc sta that a robot can obtain when the user s states changes from stato stc Its value increases in proportion to the distance to the goal Algorithm 3 Reward function by UCQL r Input sta stc atc Output r r 0 if atcis not wait then r r Va at end if if stcis discomfort for users than stathen r r Vd stc sta end if if stcis better than stato achieve the goal then r r Vg stc sta end if return r 4 Miscellaneous We can choose optional policies such as greedy greedy etc The Q function can be initialized with a uniform dis tribution However if the Q function is designed to be suitable for the task the learning speed is faster than that of the uniform distribution The Q function may be approximated with a function such as Deep Q Network 17 However the learning speed is much slower than that of the designed function III EXPERIMENT In this chapter we aim to prove the hypothesis that user centered Q learning can allow a robot to increase the chance of providing a service to a passerby without causing discomfort A Concrete Goal First we convert the hypothesis into another working hypothesis by operationalization because we cannot evaluate the hypothesis quantitatively In Introduction we defi ned this problem as fi nd a robots policy O A such that max Nu and min Nd We give shape to Nuand Ndin this experiment Two important observations in Ozakis study 3 fi rstly passerby is not suffer a negative effect by robots call if passerby use a robot service and secondly passerby is suffer a negative effect by robots call if passerby dont use the robot service Thus this is a binary classifi cation problem of whether passersby called by a robot use the robot service or not We can defi ne a confusion matrix to evaluate the method We defi ne a class that a passerby actually used the service as Positive We also defi ne a class that a passerby not used the service as Negative We infer that Nuand TP TN have a positive correlation We also infer that Ndand FP have a positive correlation We also infer that Ndand FP have a positive correlation Conversely we infer that Ndand TN have a negative correlation Therefore we can use Accuracy TP TN TP FP TN FN as an index for evaluation because max Accuracy is another representation of max Nu and min Nd Based on the aforementioned we defi ne the working hypothesis WH as the accuracy after learning by UCQL is better than the accuracy before learning by UCQL In this experiment we test WH to show that the hypothe ses is sound B Method In this section we explain how we experiment in a fi eld environment We can divide the method for this experiment into fi ve steps 1 Create an experimental equipment 2 Construct an experimental environment 3 Defi ne an experimental procedure 4 Evaluate the working hypotheses by statistical hypoth esis testing 5 Visualize the effect of UCQL 1 Create an experimental equipment First we create equipment that includes UCQL The equipment can be explained regarding the physical structure and the logical structure Figure 3 is a diagram of the physical structure of the equipment According to Figure 3 the experimental equipment consists of a table a sensor a robot a tablet iPad Air 2 a router a display and servers The components are connected with an Ethernet cable or a wireless LAN We use Sota 1 a palm sized social humanoid robot Sota has a speaker to output voices a LED to represent lip motions a SoC to control elements etc In this experiment Sotas elements are used to interact with a participant The iPad Air 2 is used to start a movie The iPad Air 2 is used to display a movie The Intel RealSense Depth Camera D435 is used as an RGB D sensor device to measure the actions of the passersby Figure 4 is a diagram of the logical structure of the equip ment The structure consists of a sensor motion capture a state estimator an action selector an action decoder an ef fector and a policy updater We utilize Nuitrack2for motion 1https sota vstone co jp home 2 6988 Depth Sensor PC Robot PC Text To Speech PC ROS Display Device linkage Tablet PC Router Fig 3 The physical structure of the experimental equipment solid line wired dashed line wireless Equipment SensorsEffectors Environment Pedestrian Motion Capture State EstimatorAction DecorderAction Selector Policy Updater Fig 4 The logical structure of the experimental equipment capture and we utilize ROS3as the infrastructure of the equipment to communicate variables among functions As Figures 3 and 4 show the equipment works using Algorithm 4 Etin Algorithm 4 means physical quantity that created by effectors such as amplitude of acoustic wave We utilize Table I as the action set A and Table II as the state set Table I is a double Markov model created from the state set of Ozaki s decision making predictor 3 Ozaki s decision making predictor estimates passerbys states into seven state Not Found s0 Passing By s1 Look At s2 Hesitating s3 Approaching s4 Established s5 Leaving s6 In addition we utilize 0 5 and 0 999 as learning parameters And we utilize Soft max selection as the policy because we want robot to do action that has a high value and to fi nd an action that has a higher value Soft max selection is often used for Q learning Equation 3 is the possibility to select actions on the policy we utilize Equation 2 as a policy parameter Tn s means a thermometer when it is updated n times on s Tn s depends on the states because s00occur many times we utilize kT 0 98 and Tmin 0 01 as learning parameters T0 s 1 1 Tn 1 s Tn s Tn s Tmin kT Tn s otherwise 2 3http wiki ros org Algorithm 4 Select an action using an experimental system that includes UCQL Input t Ot Output Et if the system is NOT initialized then Q Q0 an empty list P an empty list end if It Dt sense Ot t pt extract It Dt Push tinto Push ptinto P st estimate P at selectAction t st Q Push t at st into X Et decode at t pt return Et TABLE I Actions set in this experiment SymbolDetails a0Robot waits 5 secs until someone comes a1Robot blinks its eyes a2Robot looks at a passerby a3Robot represents joy by the its motion a4Robot greets the passerby a5Robot says I m sorry in Japanese a6Robot says Excuse me in Japanese a7Robot says It s rainy today in Japanese a8Robot tells how to start their own service a9Robot says goodbye p s a exp Q s a Tn s P ai Aexp Q s ai Tn s 3 2 Construct an experimental environment First we have to defi ne how to construct an environment for the experiment Figure 5 is an overhead view of the environment The environment consists of an exhibition space a wall a seating area and a path to a restroom in an actual company building There are hundreds of employees in the building Dozens of visitors come to the building Visitors often sit in the seating area for tens of minutes while waiting to meet employees Some visitors and em ployees look at the exhibition space to fi nd out about newer technologies of the company Visitors sometimes go to the restroom while waiting for the employees TABLE II States set in this experiment SymbolDetails s00Passerby s state changes from Not Found to Not Found s10Passerby s state changes from Not Found to Passing By s56Passerby s state changes from Leaving to Established s66Passerby s state changes from Leaving to Leaving 6989 Algorithm 5 Create initial Q function for the experiment QB Input void Output Q s a qC 1 qH 5 Q a S A zero 2D array for j 1 to 5 do Q s0j a1 qC end for for i 1 to 5 do for j 0 to 4 do Q sij a4 qC end for end for for i 1 to 4 do Q si5 a9 qH end for Q s56 a8 qH Q s50 a8 qH return Q s a W C Wall Seat Space Exhibition Space Equipment Fig 5 The overhead view of the experimental environment 3 Defi ne an experimental procedure We suppose the two main scenario The fi rst scenario is as follows 1 A visitor sits in the seating area 2 The visitor gets up to go to the restroom Thus the visitor moves past the exhibition space going from the seating area to the restroom The second scenario is as follows 1 A visitor sits in the seating area 2 The visitor gets up because they are bored 3 The visitor moves from the seating area to the exhibi tion space to look at the robots and the equipment We mainly want to attract the passersby in the second scenario We do not want to attract the passersby in the fi rst scenario because they want to go to the restroom According to the above assumptions we defi ne the proce dure for the robot s learning First the robot s Q function is initialized by Algorithm 5 We call the Q function as QB s a The robot initialized by Algorithm 5 greets a passerby who enters a certain distance Then because we want the robot to learn the rules we let the robot learn the rules of the environment by UCQL for several days As a s00 s01 s02 a0 a1 a2 10 0 10 05 0 5 00 0 5 0 1 01 010 0 10 0 7 5 5 0 2 5 0 0 2 5 5 0 7 5 10 0 action value Fig 6 An example Q function represented as a heat map The columns are the states of the agent and the rows are the action symbols of the agent For example Q s01 a1 is 0 That means the robot calling a passerby near it will get no action value result we can get a learned Q function QA s a After the learning we let the robot attract passersby under two condition We defi ne two condition Before Learning and After Learning because we want to test the hypotheses The robot do not learn during the test We start collect data for the evaluation by rosbag4 Each data is recorded by rosbag We can recode all of values in ROS by rosbag during the procedure 4 Evaluate by sta
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 花儿与少年说课稿-2025-2026学年初中音乐人音版八年级下册-人音版
- Lesson1 Making friends说课稿-2025-2026学年小学英语第三级A剑桥少儿英语(2013版)
- 小猫钓鱼说课稿-2025-2026学年小学音乐人音版五线谱北京一年级下册-人音版(五线谱)(北京)
- 河北省石家庄市八年级地理下册 6.3 世界最大的黄土堆积区-黄土高原说课稿2 (新版)新人教版
- 古墓发掘考试题及答案大全
- 数智化背景下职教课堂教学内容与评估方式的融合
- 感测技术考试题及答案
- 复杂路段考试题及答案
- 法硕考试题目及答案
- 汽车紧固件生产线项目社会稳定风险评估报告
- 网络交友新时代课件
- 电商直播行业合规性风险管控与流程优化报告
- 第08讲+建议信(复习课件)(全国适用)2026年高考英语一轮复习讲练测
- 政务大模型安全治理框架
- 生态视角下陕南乡村人居环境适老化设计初步研究
- “研一教”双驱:名师工作室促进区域青年教师专业发展的实践探索
- 手卫生及消毒隔离基本知识
- 2025四川能投合江电力有限公司员工招聘11人笔试备考题库及答案解析
- 江苏省徐州市2025年中考英语真题(含答案)
- 包钢招聘考试试题及答案
- 2025年上海市安全员-A证(企业主要负责人)考试题库及答案
评论
0/150
提交评论