林一炯开题报告_第1页
林一炯开题报告_第2页
林一炯开题报告_第3页
林一炯开题报告_第4页
林一炯开题报告_第5页
已阅读5页,还剩16页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

研究生学位论文研究生学位论文开题报告开题报告 学 位 论 文 题 目 : Improving the Deep Reinforcement Learning Training Efficiency in High-Dimensional Space 学生学号、姓名:2111701025、林一炯 所学学科(专业):机械工程 入学年月:2017 年 9 月 所 属 学 院 名 称 :机电工程学院 指 导 教 师 姓 名 :刘冠峰、Juan Rojas 研究生院 制表 2018 年 12 月28 日 本表一式三份,由研究生 院、学生、所属学院留存! 2 研究生 姓名 林一炯性别男 身份证 号码攻读本学位 前学历 本科 已获最高 学位类别 学士 已获学位 授予单位 广东工业大学 已获学位 年月 2017.06 主 要 学 习 、 工 作 经 历 起止年月所在单位任职 2010.092013.07 2013.092017.07 2017.09至今 湛江市第一中学 广东工业大学 广东工业大学 学生 学生 学生 学位论文所属类型 (请单项选择打) 基础研究应用研究综合研究 其他 学位论文选题来源 (请单项选择打) 973、863 项目国家社科规划、基金项目 教育部人文、社会科学研究项目国家自然科学基金项目 中央、国家各部门项目省(自治区、直辖市)项目 国际合作研究项目企、事业单位委托项目 外资项目学校自选项目 国防军工项目非立项其他 学位论文关键词 (不超过 5 个) DeepReinforcementLearning,Manipulation,SampleEfficiency,Machine Learning 属于导师科研 项目名称 Predicting Human Behavior for Efficient Physical Human-Robot Interaction 该项目来源 及编号 61750110521 开 题 报 告 会 组 成 人 员 姓名职称 注明是博导、 硕导 还是校外专家 工作单位 管贻生教授博导广东工业大学 机电工程学院 杨勇教授博导广东技术师范学院 Juan Rojas副教授硕导广东工业大学 机电工程学院 朱海飞讲师硕导广东工业大学 机电工程学院 何力讲师硕导广东工业大学 机电工程学院 张涛讲师/广东工业大学 机电工程学院 3 I. Project Support, Backgrounds, and Relative works. 1.1 Project Support This work is supported by the National Science Foundation of China grant number 61750110521. 1.2 Background Robots can perform impressive tasks under human control, including surgery 19 and household chores 20. The number and variety of robots used in everyday life are rapidly increasing. To date, the controllers for these robots are largely designed and tuned by hand. However, designing the perception system (for state estimation) and control software for autonomous operation remains a major challenge even for basic tasks. Programming robots is a tedious task that requires years of experience and a high degree of expertise 18. Often programmed controllers make significant exact models of both the robots behavior and its environment 36. Hence, there is a gap between current robot design goals used and the vision of incorporating fully autonomous robots. In robot learning, machine learning methods are used to automatically extract relevant information from data to solve robotic tasks. Using the power and flexibility of modern machine learning techniques, the field of robot control can be further automated allowing us to substantially close the gap toward autonomous robot sin fields as diverse as household assistants, elderly care, and public services. A principled mathematical framework for experience-driven autonomous learning is reinforcement learning (RL) 1. Reinforcement learning allows robots to learn what to do. Namely, how to map situations to actions so as to maximize a numerical reward signal. The learner is not told which actions to take, instead he must discover which actions yield the most reward by exploration. In the most interesting and challenging cases, actions may be selectedbasednotonimmediaterewardsbutonlong-termbenefits. Thesetwo characteristics, trial-and-error and search-and-delayed reward are the two most important 4 distinguishing features of reinforcement learning. Although RL had some success in the past 2,3,15,32,33, previous approaches were not scalable and were limited to low-dimensional problems. These limitations exist due to memory and computational complexity, and in the case of machine learning algorithms, sample complexity 4,34. More recently with the rise of deeplearning, relying on powerful function approximation and representation learning properties has provided us with new tools to overcoming these problems. Deep reinforcement learning is poised to revolutionize the field of AI and represents a step towards building autonomous systems with a higher-level understanding of the visual world. Deep learning is enabling reinforcement learning (now called deep reinforcement learning DRL) to scale to problems that were previously intractable, such as learning to play video games directly from pixels 5,29,30,31. DRL algorithms have recently achieved impressive results in robotics 6,7,18,28, allowing robust and human-like control policies to be learned directly from camera inputs in the real world. 1.3 Relative Works Reinforcement learning and policy search methods 8, 9 have been applied in robotics for playing games such as table tennis 10, baseball11, object manipulation12, 13, 14, locomotion 15, 16, and flight17. Recent policy search surveys in robotics18, 35show that policy search is typically applied to a single component of a robot control pipeline. The pipeline often sits on top of a hand-designed controller, such as a PD controller, and accepts processed input from an existing vision pipeline 14. Grasping remains one of the most significant challenges in manipulation. Not only should the grasping system be able to pick up previously unseen objects with reliable and effective grasps while using realistic sensing and actuation, but also it should be able to dexterously handle tools, perform in-hand manipulation, and overcome significant contact dynamics in its interactions. Grasping thus serves as a microcosm of the larger robotic manipulation problem, providing a challenging and practically applicable model problem for experimenting with generalization and diverse object interaction. In 18, Abeel et. al, introduced a novel technique called Hindsight Experience Replay 5 which makes possible applying RL algorithms to problems with sparse and binary rewards. This technique when combined with deep models like Deep Q-netowrks (DQN) and Deep Deterministic Policy Gradient (DDPG) showed that plausibility of training policies for push, slide and pick-and-place objects to specified positions plausible. Abeel et. al also show that the policy for the pick-and-place task performs well on the physical robot without finetuning. is The work marked the first time complicated behaviors were learned using only sparse binary rewards. The training for 200 epochs took approximately 2.5h for the push and pick-and-place tasks and 6h for sliding (because physics simulation was slower for this task) using 8 cpu cores. Figure 1: The pick-and -place policy deployed on the physical robotp18 In 6, Rajeswaran et. al, proposed Demonstration Augmented Policy Gradient (DAPG). DAPG learns policies that map visual input and joint encoder readings directly to the robots joint torques. By learning the entire mapping from perception to control, the perception layers adapt to optimize task performance, and the motor control layers adapt to imperfect perception. After incorporating human demonstrations, DAPG acquires policies that not only exhibit more human-like motion, but are also substantially more robust. Furthermore, DAPG can be up to 30 x more sample efficient than RL from scratch with shaped rewards. DAPG is able to train policies for the tasks in under 5 hours, which is likely practical to run on real systems. However, it is only successful in simulation and had not been applied to real world robots. Figure 2: Awide range of dexterous manipulation skills such as object relocation, in-hand manipulation.6 6 tool use, and opening doors using DRL methods.6 Table 1: Sample and robot time complexity of DAPG compared to RL (Natural Policy Gradient) from scratch with shaped (sh) and sparse task completion reward (sp).6 In 7,Abeel et. al present a method for learning robotic control policies that use raw input from a monocular camera. These policies are represented by a novel convolutional neural network architecture, and can be trained end-to-end using the proposed guided policy search (GPS) algorithm, which decomposes the policy search problem into: (i) a trajectory optimization phase that uses full state information, and (ii) a supervised learning phase that only uses the observations. This decomposition allows to leverage state-of-the-art tools from supervised learning, making it straightforward to optimize extremely high-dimensional policies. However, this method does not generalize to dramatically different settings, especially when visual distractors occlude the manipulated object or have disturbances of environment that differ from the training. Figure 3: Learning Visuomotor Policies that Directly Use Camera Image Observation to Set Motor Torques on PR2 Robot7 In 23, Kalashnikov et. al, presented a framework for scalable robotic reinforcement learning with raw sensory inputs such as images, based on an algorithm called QT-Opt, a distributed optimization framework, and a combination of off-policy and on-policy training. This paper QT-Opt to grasp hundreds of previously unseen objects in bins across robots. QT-Opt learns closed-loop vision-based policies that attain extremely high success rates (96%) on the previously unseen objects. QT-Opt also exhibited sophisticated and intelligent closed-loop behavior, including singulation and prehensile manipulation, regrasping, and dynamic responses to disturbances. All of these behaviors emerged automatically from 7 optimizing the grasp success probability via QT-Opt. However, these policies are trained on a large amount of robot experience (580k real-world grasps), the extreme time it used for training makes it unfeasible for a large number of use cases Figure 4: Distributed RL infrastructure for QT-Opt State-action-reward tuples are loaded from an offline data stored and pushed from online real robot collection.23 However, defining a cost function that can be optimized effectively and encodes the correct task is challenging in practice, inverse reinforcement learning 21,22,24,25,26,27 (IRL)s passive mode of transferring skills is strongly appealing because it significantly mitigates costly human effort in not only manually programming the robot but also in actively teaching the robot through demonstrations. All in all, designing the perception and control software for autonomous manipulation still remains a major challenge-even for basic tasks. Deep reinforcement learning provides a number of promising approaches to us, as we can see in recent year since the deep learning arose up, DRL has been able to solve more and more complex manipulation tasks. However, there still exist some challenges 46,47: 1. The optimal policy must be inferred by trial-and-error interaction with the environment. The only learning signal the agent receives is the reward. In manipulation tasks, it is extremely hard for the agent to learn good policies quickly from a sparse reward space, thus improving the sample complexity is a crucial technique to improve on. 2. The observations of the agent depend on its actions and can contain strong temporal correlations. 3. Agents must deal with long-range time dependencies: Often the consequences of an action only materialise after many transitions in the environment. This is known as the (temporal) credit assignment problem. This proposal will propose techniques from a couple of domains to improve sampling 8 complexity during training time thus in order to minimize training time. Bibliography : 1 Sutton R S, Barto AG. Reinforcement learning: An introductionM. MIT press, 2018. 2 Tesauro G. Temporal difference learning and TD-GammonJ. Communications of the ACM, 1995, 38(3): 58-68. 3 Kohl N, Stone P. Policy gradient reinforcement learning for fast quadrupedal locomotionC/Robotics and Automation, 2004. Proceedings. ICRA04. 2004 IEEE International Conference on. IEEE, 2004, 3: 2619-2624. 4 Strehl AL, Li L, Wiewiora E, et al. PAC model-free reinforcement learningC/ Proceedings of the 23rd international conference on Machine learning. ACM, 2006: 881-888. 5 Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learningJ. Nature, 2015, 518(7540): 529. 6 Rajeswaran A, Kumar V, Gupta A, et al. Learning complex dexterous manipulation with deep reinforcement learning and demonstrationsJ. arXiv preprint arXiv:1709.10087, 2017. 7 Levine S, Finn C, Darrell T, et al. End-to-end training of deep visuomotor policiesJ. The Journal of Machine Learning Research, 2016, 17(1): 1334-1373. 8 Gullapalli V.Astochastic reinforcement learning algorithm for learning real-valued functionsJ. Neural networks, 1990, 3(6): 671-692. 9 Williams R J . Simple statistical gradient-following algorithms for connectionist reinforcement learningJ. Machine Learning, 1992, 7(3): 123-142. 10 Kober J, ztop E, Peters J. Reinforcement learning to adjust robot movements to new situationsC/ IJCAI Proceedings-International Joint Conference on Artificial Intelligence. 2011, 22(3): 2650. 11 Peters J . Reinforcement Learning of Motor Skills with Policy GradientsJ. Neural Netw, 2008, 11(1): 521-545. 12 Kober J, Mlling K, Krmer O, et al. Movement templates for learning of hitting and battingC/Robotics and Automation (ICRA), 2010 IEEE International Conference on. IEEE, 2010: 853-858. 13 Durrant-Whyte H , Roy N ,Abbeel P . Learning to Control a Low-Cost Manipulator Using 9 Data-Efficient Reinforcement LearningC/ MIT Press, 2011, 12(4): 650. 14 Kalakrishnan M, Righetti L, Pastor P, et al. Learning force control policies for compliant manipulationC/Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on. IEEE, 2011: 4639-4644. 15 Kohl N , Stone P . Policy gradient reinforcement learning for fast quadrupedal locomotionC/ IEEE International Conference on Robotics &Automation. IEEE, 2004: 345-352. 16 Geng T , Porr B , Wrgtter, Florentin. Fast biped walking with a reflexive controller and real-time policy searchingC/ International Conference on Neural Information Processing Systems. MIT Press, 2005: 1330-1336. 17 Kim H J, Jordan M I, Sastry S, et al.Autonomous helicopter flight via reinforcement learningC/Advances in neural information processing systems. 2004: 799-806. 18Andrychowicz M, Wolski F, RayA, et al. Hindsight experience replayC/Advances in Neural Information Processing Systems. 2017: 5048-5058. 19 LanfrancoAR, CastellanosAE, Desai J P, et al. Robotic surgery: a current perspectiveJ. Annals of surgery, 2004, 239(1): 14. 20 Wyrobek KA, Berger E H, Van der Loos H F M, et al. Towards a personal robotics development platform: Rationale and design of an intrinsically safe personal robotC/Robotics and Automation, 2008. ICRA 2008. IEEE International Conference on. IEEE, 2008: 2165-2170. 21 Finn C, Levine S,Abbeel P. Guided cost learning: Deep inverse optimal control via policy optimizationC/International Conference on Machine Learning. 2016: 49-58. 22 Maulesh T, and Prashant D. Inverse Learning of Robot Behavior for Collaborative PlanningC/ IEEE International Conference on Robotics &Automation. IEEE,2018: 134-139. 23 Kalashnikov D, IrpanA, Pastor P, et al. Scalable Deep Reinforcement Learning for Vision-Based Robotic ManipulationC/Conference on Robot Learning. 2018: 651-673. 24 NgA , Russell S . Algorithms for inverse reinforcement learningC/ International Conference on Machine Learning. 2000: 599-605. 25Arora S, Doshi P.ASurvey of Inverse Reinforcement Learning: Challenges, Methods and ProgressJ. arXiv preprint arXiv:1806.06877, 2018. 26Abbeel P, Ng AY.Apprenticeship learning via inverse reinforcement learningC/Proceedings of the twenty-first international conference on Machine learning. ACM, 2004: 1. 10 27 Natarajan S, Kunapuli G, Judah K, et al. Multi-agent inverse reinforcement learningC/2010 Ninth International Conference on Machine Learning andApplications. IEEE, 2010: 395-400. 28 Lillicrap T P, Hunt J J, Pritzel A, et al. Continuous control with deep reinforcement learningJ. arXiv preprint arXiv:1509.02971, 2015. 29 Gu S, Lillicrap T, Sutskever I, et al. Continuous deep q-learning with model-based accelerationC/ International Conference on Machine Learning. 2016: 3220-3228. 30 Held D, Geng X, Florensa C, et al.Automatic goal generation for reinforcement learning agentsJ. arXiv preprint arXiv:1705.06366, 2017. 31 Levine S, Wagener N,Abbeel P. Learning contact-rich manipulation skills with guided policy searchC/Robotics andAutomation (ICRA), 2015 IEEE International Conference on. IEEE, 2015: 123-130. 32 Singh S, Litman D, Kearns M, et al. Optimizing dialogue management with reinforcement learning: Experiments with the NJFun systemJ. Journal ofArtificial Intelligence Research, 2002, 16: 105-133. 33 NgAY, CoatesA, Diel M, et al. Autonomous inverted helicopter flight via reinforcement learningM/Experimental Robotics IX. Springer, Berlin, Heidelberg, 2006: 363-372. 34Arulkumaran K, Deisenroth M P, Brundage M, et al.Abrief survey of deep reinforcement learningJ. arXiv preprint arXiv:1708.05866, 2017. 35 Kober J, Bagnell JA, Peters J. Reinforcement learning in robotics:AsurveyJ. The International Journal of Robotics Research, 2013, 32(11): 1238-1274. 36 Lampe T, Riedmiller M.Acquiring visual servoing reaching and grasping skills using neural reinforcement learningC/ Neural Networks (IJCNN), The 2013 International Joint Conference on. IEEE, 2013: 823-830. 37 Li Y. Deep reinforcement learning: An overviewJ. arXiv preprint arXiv:1701.07274, 2017. 38 Metz L , Ibarz J , Jaitly N , et al. Discrete Sequential Prediction of ContinuousActions for Deep RLJ. 2018. 39 Hester T, Vecerik M, Pietquin O, et al. Learning from demonstrations for real world reinforcement learningJ. CoRR, abs/1704.03732, 2017. 40 Mnih V, Badia A P, Mirza M, et al.Asynchronous methods for deep reinforcement learningC/International conference on machine learning. 2016: 1928-1937. 11 41 Mohamed S, Rezende D J. Variational information maximization for intrinsically motivated reinforcement learningC/Advances in neural information processing systems. 2015: 1933-1941. 42 Schulman J, Moritz P, Levine S, et al. High-dimensional continuous control using generalized advantage estimationJ. arXiv preprint arXiv:1506.02438, 2015. 43 Silver D, Lever G, Heess N, et al. Deterministic Polic

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论