




已阅读5页,还剩2页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Can User-Centered Reinforcement Learning Allow a Robot to Attract Passersby without Causing Discomfort?* Yasunori Ozaki1, Tatsuya Ishihara2, Narimune Matsumura1and Tadashi Nunobiki1 AbstractThe aim of our study is to develop a method by which a social robot can greet passersby and get their attention without causing them to suffer discomfort. Social robots now function in a number of customer service roles, such as receptionists, guides, and exhibitors. However, sudden greetings from a robot can startle passersby. Therefore, we developed a method that allows social robots to adapt their mannerisms situationally based the results of related work. Our proposed method, user-centered reinforcement learning, enables robots to greet passersby without causing them discomfort (p 0.01). Our fi eld experiment in an offi ce entrance demonstrated that our method meets this requirement. I. INTRODUCTION The working population in many developed countries is decreasing proportionately to the total population due to population aging, and this problem is expected to affect developing countries as well 1. One approach to addressing this problem is to use social robots to provide customer service instead of people. Such robots are starting to be used as receptionists, guides, and exhibitors. While robots can increase the possibility of providing a service by simply greeting passersby 2, passersby can suffer discomfort when suddenly greeted by a robot 3. Therefore, the dilemma is whether to have the robot behave in a manner that benefi ts the owner or in a manner that does not disturb passersby. Figure 1 illustrates the dilemma. Therefore, we developed the user-centered reinforcement learning method, which allows robots to greet passersby without causing them discomfort. We defi ne the hypothesis, ”The robot that has our proposed method can greet passersby and get their attention without causing them to suffer discom- fort”, as the the theoretical hypothesis. In the next section, we defi ne the problem and describe how we approached solving it by studying related work. In the Proposed Method chapter, we explain our method of solving the problem. In the Experiment chapter, we explain our fi eld experiment to test two working hypotheses created from an original hypothesis. The results show that our method can solve the problem. In the Discussion chapter, we examine the results from the standpoints of physiology, psychology, and user experience. In the Conclusion chapter, we conclude that user-centered Q-learning can increase a *This work is supported by NTT Corporation. 1Yasunori Ozaki,NarimuneMatsumuraandTadashi NunobikiarewithServiceEvolutionLab.,NTTCorporation, Yokosuka,Japanozaki.yasunori, narimune.matsumura.aehco.ntt.co.jpand tadashi.nunobiki.ewhco.ntt.co.jp 2Tatsuya Ishihara is with the R instead, we tackled the problem that precedes it, which is illustrated in Figure 2. In terms of the solution, the problem we addressed is 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Macau, China, November 4-8, 2019 978-1-7281-4003-2/19/$31.00 2019 IEEE6986 similar to machine learning, namely reinforcement learning. Reinforcement learning in robotics is a technique used to fi nd a policy : O A 9 and is used for robotic control tasks. It is not used much for interaction tasks. Reinforcement learning has been applied to the learning of several complex aerobatic control tasks for radio-controlled helicopters 10 and to the learning of door-opening tasks for robot arms 11. The research on interaction tasks is less remarkable. Mitsunaga et al. showed that a social robot can adapt its behavior to humans for human-robot interaction by using reinforcement learning 12 if human-robot engagement has been established. Papaioannou et al. used reinforcement learning to extend the engagement time and enhance the dialogue quality 13. The applicability of these methods to the situation before human-robot engagement is established is unclear. As shown in Figure 2, the problem we addressed occurs before engage- ment is established. In terms of the goal, the problem we addressed is simi- lar to increasing the number of human-robot engagements. Macharet et al. simulated Gaussian process regression based on reinforcement learning being used to increase the number of engagements 14. Going further, we focused on increas- ing the number of engagements in a fi eld environment. B. Problem Statement We use a problem framework commonly used for re- inforcement learning in robotics, the partially observable Markov decision process (POMDP), to defi ne the problem 9. The robot is the agent, and the environment is the problem. The robot can observe the environment partially by using sensors. We choose an exhibition service area in a company entrance as the environment. The entrance consists of one automated exhibition system, one aisle, and other space. In addition, we express the entrance as Euclidean space R3. Passersby can move freely around the exhibition system. The automated exhibition system consists of a tablet, a computer, a robot, and a sensor system. The sensor system can sense a color image data Itand a depth image data Dt. We call this data Observation Ot. The sensor system can also extract a passerbys partial action from Ot. The passerbys action consists of the passerbys position pt= (xt,yt,zt) and the head angle t= (yaw t ,roll t ,pitch t ). We defi ne the times when the passerby enters the entrance (t = 0) and when the passerby leaves from the entrance (t = Tend). We call the interval between t = 0 and t = Tendan episode. Let P = (p0,.,pTend) be the passerbys position in an episode, and let = (0,.,Tend) be the passerbys head angle in the episode. Our method takes an the robots action from the passerbys actions. Let Nube the number of people that used the service. Let Ndbe the number of people that felt discomfort. Then, we can declare this problem as ”Find a robots policy : O A such that max(Nu) and min(Nd)”. C. Our Approach We solve this problem by controlling the robot using reinforcement learning, ordinarily Q-learning, except for de- signing the reward function. We created the reward function by focusing on the user experience of the stakeholders. We call this reinforcement learning that includes this reward function ”user-centered reinforcement learning.” We do not use deep reinforcement learning because it is diffi cult to collect the huge amount of data needed for learning. D. Contributions The contributions of this work are as follows: 1) We show that robots can learn actions from a passerbys responses related the passerbys user expe- rience. 2) We present a method of increasing the number of human-robot engagements in the fi eld without causing discomfort. II. PROPOSEDMETHOD We based our user-centered reinforcement learning method on reinforcement learning. In this paper, we use Q-learning (reinforcement learning) as a base algorithm because it make it easy to explain why the robot choose their past actions. We call this algorithm user-centered Q-learning (UCQL). UCQL is different from original Q-learning 15 in action set A, state set S, Q-function Q(s,a)(s S,a A), and reward function r(st+1,at,st)(st+1,st S,at A). UCQL consists of three functions; 1) Selecting an action based on a policy 2) Updating the policy based on user actions 3) Designing a reward function and a Q-function as the initial condition. 1) Selecting an action based on a policy: Generally speaking, a robot senses observation and chooses an action, including waiting. Let tasec be the time when the robot acts. Let tcsec be the time when the robot computes the algorithm. Let st S be the predicted users state at the time t. Let at A be the robots action at the time t. In UCQL, the robot chooses the action with Algorithm 1. Algorithm 1 Select an action by UCQL (Action Selector) Input: tc,stc,Q(s,a), (s,A,Q) Output: at,ta at (stc,A,Q) ta tc return at,ta 2) Updating the policy based on user actions: In UCQL, the robot updates the policy using Algorithm 2. 3) Designing a reward function:In UCQL, robot is given a reward function with Algorithm 3 . Algorithm 3 divide motivation into extrinsic and intrinsic one inspired from ”Intrinsically Motivated Reinforcement Learning 16”. We call the proposed method ”User-Centered” because we design an extrinsic motivation from users states related User 6987 Algorithm 2 Update the policy by UCQL (Policy Updater) Input: sta,ata,stc,A,Q(s,a) Output: Q(s,a) if ata is fi nished then R r(stc,ata,sta) q Q(sta,ata) Q(sta,ata) (1 )q + ? R + max a Q(stc,a) ? end if return Q(s,a) Experience. Va(at) is a value that a robot can obtain by taking an action at. Vd(stc,sta) is a value that makes a passerby discomfort when the users states changes from stato stc. Vg(stc,sta) that a robot can obtain when the users states changes from stato stc. Its value increases in proportion to the distance to the goal. Algorithm 3 Reward function by UCQL (r) Input: sta,stc,atc Output: r r 0 if atcis not wait then r r + Va(at). end if if stcis discomfort for users than stathen r r Vd(stc,sta) end if if stcis better than stato achieve the goal then r r + Vg(stc,sta) end if return r 4) Miscellaneous: We can choose optional policies such as greedy, ?- greedy, etc. The Q-function can be initialized with a uniform dis- tribution. However, if the Q-function is designed to be suitable for the task, the learning speed is faster than that of the uniform distribution. The Q-function may be approximated with a function such as Deep Q-Network 17. However, the learning speed is much slower than that of the designed function. III. EXPERIMENT In this chapter, we aim to prove the hypothesis that user-centered Q-learning can allow a robot to increase the chance of providing a service to a passerby without causing discomfort. A. Concrete Goal First, we convert the hypothesis into another working hypothesis by operationalization because we cannot evaluate the hypothesis quantitatively. In Introduction, we defi ned this problem as fi nd a robots policy : O A such that max(Nu) and min(Nd). We give shape to Nuand Ndin this experiment. Two important observations in Ozakis study 3 , fi rstly, passerby is not suffer a negative effect by robots call if passerby use a robot service, and secondly, passerby is suffer a negative effect by robots call if passerby dont use the robot service. Thus, this is a binary classifi cation problem of whether passersby called by a robot use the robot service or not. We can defi ne a confusion matrix to evaluate the method. We defi ne a class that a passerby actually used the service as ”Positive”. We also defi ne a class that a passerby not used the service as ”Negative”. We infer that Nuand TP, TN have a positive correlation. We also infer that Ndand FP have a positive correlation. We also infer that Ndand FP have a positive correlation. Conversely, we infer that Ndand TN have a negative correlation. Therefore, we can use Accuracy = (TP + TN)/(TP + FP + TN + FN) as an index for evaluation because max(Accuracy) is another representation of max(Nu) and min(Nd). Based on the aforementioned, we defi ne the working hypothesis WH as the accuracy after learning by UCQL is better than the accuracy before learning by UCQL. In this experiment, we test WH to show that the hypothe- ses is sound. B. Method In this section, we explain how we experiment in a fi eld environment. We can divide the method for this experiment into fi ve steps: 1) Create an experimental equipment 2) Construct an experimental environment 3) Defi ne an experimental procedure 4) Evaluate the working hypotheses by statistical hypoth- esis testing 5) Visualize the effect of UCQL 1) Create an experimental equipment: First, we create equipment that includes UCQL. The equipment can be explained regarding the physical structure and the logical structure. Figure 3 is a diagram of the physical structure of the equipment. According to Figure 3, the experimental equipment consists of a table, a sensor, a robot, a tablet (iPad Air 2), a router, a display and servers. The components are connected with an Ethernet cable or a wireless LAN. We use Sota 1, a palm-sized social humanoid robot. Sota has a speaker to output voices, a LED to represent lip motions, a SoC to control elements, etc. In this experiment, Sotas elements are used to interact with a participant. The iPad Air 2 is used to start a movie. The iPad Air 2 is used to display a movie. The Intel RealSense Depth Camera D435 is used as an RGB-D sensor device to measure the actions of the passersby. Figure 4 is a diagram of the logical structure of the equip- ment. The structure consists of a sensor, motion capture, a state estimator, an action selector, an action decoder, an ef- fector, and a policy updater. We utilize Nuitrack2for motion 1https:/sota.vstone.co.jp/home/ 2 6988 Depth Sensor PC Robot PC Text-To-Speech PC ROS Display Device linkage Tablet PC Router Fig. 3: The physical structure of the experimental equipment (solid line: wired; dashed line: wireless) Equipment SensorsEffectors Environment Pedestrian Motion Capture State EstimatorAction DecorderAction Selector Policy Updater (,) (,)(,) (,) = (,) (,) Fig. 4: The logical structure of the experimental equipment capture, and we utilize ROS3as the infrastructure of the equipment to communicate variables among functions. As Figures 3 and 4 show, the equipment works using Algorithm 4. Etin Algorithm 4 means physical quantity that created by effectors such as amplitude of acoustic wave. We utilize Table I as the action set A and Table II as the state set. Table I is a double Markov model created from the state set of Ozakis decision-making predictor 3. Ozakis decision-making predictor estimates passerbys states into seven state: Not Found (s0), Passing By (s1), Look At (s2), Hesitating (s3), Approaching (s4), Established (s5), Leaving (s6). In addition, we utilize = 0.5 and = 0.999 as learning parameters. And we utilize Soft-max selection as the policy because we want robot to do action that has a high value and to fi nd an action that has a higher value. Soft-max selection is often used for Q-learning. Equation 3 is the possibility to select actions on the policy. we utilize Equation 2 as a policy parameter. Tn(s) means a thermometer when it is updated n times on s. Tn(s) depends on the states because s00occur many times. we utilize kT= 0.98 and Tmin= 0.01 as learning parameters. T0(s)=1(1) Tn+1(s)= ( Tn(s)(Tn(s) Tmin) kT Tn(s)(otherwise) (2) 3/ Algorithm 4 Select an action using an experimental system that includes UCQL Input: t,Ot Output: Et if the system is NOT initialized then Q Q0 an empty list P an empty list end if It,Dt sense(Ot) t,pt extract(It,Dt) Push tinto . Push ptinto P. st estimate(,P) at selectAction(t,st,Q,) Push (t,at,st) into X. Et decode(at,t,pt) return Et TABLE I: Actions set in this experiment SymbolDetails a0Robot waits 5 secs until someone comes. a1Robot blinks its eyes. a2Robot looks at a passerby. a3Robot represents joy by the its motion. a4Robot greets the passerby. a5Robot says ”Im sorry.” in Japanese. a6Robot says ”Excuse me.” in Japanese. a7Robot says ”Its rainy today.” in Japanese. a8Robot tells how to start their own service. a9Robot says goodbye. p(s,a) = exp(Q(s,a)/Tn(s) P aiAexp(Q(s,ai)/Tn(s) (3) 2) Construct an experimental environment: First, we have to defi ne how to construct an environment for the experiment. Figure 5 is an overhead view of the environment. The environment consists of an exhibition space, a wall, a seating area, and a path to a restroom in an actual company building. There are hundreds of employees in the building. Dozens of visitors come to the building. Visitors often sit in the seating area for tens of minutes while waiting to meet employees. Some visitors and em- ployees look at the exhibition space to fi nd out about newer technologies of the company. Visitors sometimes go to the restroom while waiting for the employees. TABLE II: States set in this experiment SymbolDetails s00Passerbys state changes from ”Not Found” to ”Not Found”. s10Passerbys state changes from ”Not Found” to ”Passing By”. . . . . . . s56Passerbys state changes from ”Leaving” to ”Established”. s66Passerbys state changes from ”Leaving” to ”Leaving”. 6989 Algorithm 5 Create initial Q-function for the experiment (QB) Input: (void) Output: Q(s,a) qC 1 qH 5 Q a |S| |A| zero 2D-array for j = 1 to 5 do Q(s0j,a1) qC end for for i = 1 to 5 do for j = 0 to 4 do Q(sij,a4) qC end for end for for i = 1 to 4 do Q(si5,a9) qH end for Q(s56,a8) qH Q(s50,a8) qH return Q(s,a) W.C. Wall Seat Space Exhibition Space Equipment Fig. 5: The overhead view of the experimental environment. 3) Defi ne an experimental procedure: We suppose the two main scenario. The fi rst scenario is as follows: 1) A visitor sits in the seating area. 2) The visitor gets up to go to the restroom. Thus, the visitor moves past the exhibition space going from the seating area to the restroom. The second scenario is as follows: 1) A visitor sits in the seating area. 2) The visitor gets up because they are bored. 3) The visitor moves from the seating area to the exhibi- tion space to look at the robots and the equipment. We mainly want to attract the passersby in the second scenario. We do not want to attract the passersby in the fi rst scenario because they want to go to the restroom. According to the above assumptions, we defi ne the proce- dure for the robots learning. First, the robots Q-function is initialized by Algorithm 5. We call the Q-function as QB(s,a). The robot initialized by Algorithm 5 greets a passerby who enters a certain distance. Then, because we want the robot to learn the rules, we let the robot le
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 美丽定制服务行业的未来趋势预测与市场机遇探索
- 工厂生产流程优化与改造方案
- 2026届贵州省六盘水市数学七年级第一学期期末质量跟踪监视试题含解析
- 邮储银行本溪市明山区2025秋招半英文面试题库及高分答案
- 邮储银行徐州市泉山区2025秋招笔试经济学专练及答案
- 2025废纸购销合同范本模板
- 邮储银行邯郸市魏县2025秋招笔试会计学专练及答案
- 工商银行沧州市献县2025秋招英文面试20问及高分答案
- 专业知识技能培训目的课件
- 中国银行吉林市桦甸市2025秋招英文群面案例角色分析
- 2025内蒙古鄂尔多斯市国源矿业开发有限公司招聘75人备考考试题库附答案解析
- 2025年专升本政治试题真题及答案
- 包装材质基础知识培训课件
- 养老护理员学习汇报
- (新人教PEP版)英语五年级上册全册大单元教学设计
- 小儿急性阑尾炎护理查房
- 2025-2030中国锆铪行业市场发展趋势与前景展望战略研究报告
- 专业英语翻译教学设计
- 经济与社会 思维导图式复习课件高中政治统编版必修二经济与社会
- 湘教版(2024)七年级上册地理第二章 认识地球 测试卷(含答案)
- 联合体施工协议书
评论
0/150
提交评论