台大-李宏毅-B站机器学习视频-课件神经网络与深度学习RL_第1页
台大-李宏毅-B站机器学习视频-课件神经网络与深度学习RL_第2页
台大-李宏毅-B站机器学习视频-课件神经网络与深度学习RL_第3页
台大-李宏毅-B站机器学习视频-课件神经网络与深度学习RL_第4页
台大-李宏毅-B站机器学习视频-课件神经网络与深度学习RL_第5页
已阅读5页,还剩47页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

DeepReinforcementLearning,Scratchingthesurface,DeepReinforcementLearning,ScenarioofReinforcementLearning,Agent,Environment,Observation,Action,Reward,Dontdothat,State,Changetheenvironment,ScenarioofReinforcementLearning,Agent,Environment,Observation,Reward,Thankyou.,Agentlearnstotakeactionstomaximizeexpectedreward.,State,Action,Changetheenvironment,LearningtopalyGo,Environment,Observation,Action,Reward,NextMove,LearningtopalyGo,Environment,Observation,Action,Reward,Ifwin,reward=1,Ifloss,reward=-1,reward=0inmostcases,Agentlearnstotakeactionstomaximizeexpectedreward.,LearningtopalyGo-Supervisedv.s.Reinforcement,Supervised:ReinforcementLearning,Nextmove:“5-5”,Nextmove:“3-3”,Firstmove,manymoves,Win!,AlphaGoissupervisedlearning+reinforcementlearning.,Learningfromteacher,Learningfromexperience,(Twoagentsplaywitheachother.),Learningachat-bot,Sequence-to-sequencelearning,Learningachat-bot-Supervisedv.s.Reinforcement,SupervisedReinforcement,Agent,.,.,Bad,“Hello”,Say“Hi”,“Byebye”,Say“Goodbye”,Learningachat-bot-ReinforcementLearning,Lettwoagentstalktoeachother(sometimesgenerategooddialogue,sometimesbad),Howoldareyou?,Seeyou.,Seeyou.,Seeyou.,Howoldareyou?,Iam16.,Ithoughyouwere12.,Whatmakeyouthinkso?,Learningachat-bot-ReinforcementLearning,Bythisapproach,wecangeneratealotofdialogues.Usesomepre-definedrulestoevaluatethegoodnessofadialogue,Dialogue1,Dialogue2,Dialogue3,Dialogue4,Dialogue5,Dialogue6,Dialogue7,Dialogue8,Machinelearnsfromtheevaluation,DeepReinforcementLearningforDialogueGeneration/pdf/1606.01541v3.pdf,Moreapplications,Interactiveretrieval,Herearewhatyouarelookingfor.,Isitrelatedto“Election”?,user,USPresident,Isee!,Moreprecisely,please.,Trump,Yes.,Wu&Lee,INTERSPEECH16,Moreapplications,FlyingHelicopterSumitChopra,MichaelAuli,WojciechZaremba,“SequenceLevelTrainingwithRecurrentNeuralNetworks”,ICLR,2016,Example:PlayingVideoGame,Widelystudies:Gym:,Machinelearnstoplayvideogamesashumanplayers,Machinelearnstotakeproperactionitself,Whatmachineobservesispixels,Example:PlayingVideoGame,Spaceinvader,fire,Score(reward),Killthealiens,Termination:allthealiensarekilled,oryourspaceshipisdestroyed.,shield,Example:PlayingVideoGame,SpaceinvaderPlayyourself:,Startwithobservation1,Observation2,Observation3,Example:PlayingVideoGame,Action1:“right”,Obtainreward1=0,Action2:“fire”,(killanalien),Obtainreward2=5,Usuallythereissomerandomnessintheenvironment,Startwithobservation1,Observation2,Observation3,Example:PlayingVideoGame,Aftermanyturns,Action,Obtainreward,GameOver(spaceshipdestroyed),Thisisanepisode.,Learntomaximizetheexpectedcumulativerewardperepisode,DifficultiesofReinforcementLearning,RewarddelayInspaceinvader,only“fire”obtainsrewardAlthoughthemovingbefore“fire”isimportantInGoplaying,itmaybebettertosacrificeimmediaterewardtogainmorelong-termrewardAgentsactionsaffectthesubsequentdataitreceivesE.g.Exploration,Outline,Policy-based,Value-based,LearninganActor,LearningaCritic,Actor+Critic,AlphaGo:policy-based+value-based+model-based,VolodymyrMnih,AdriPuigdomnechBadia,MehdiMirza,AlexGraves,TimothyP.Lillicrap,TimHarley,DavidSilver,KorayKavukcuoglu,“AsynchronousMethodsforDeepReinforcementLearning”,ICML,2016,AsynchronousAdvantageActor-Critic(A3C),Tolearndeepreinforcementlearning,Textbook:ReinforcementLearning:AnIntroductionhttps:/webdocs.cs.ualberta.ca/sutton/book/the-book.htmlLecturesofDavidSilverhttp:/www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html(10lectures,1:30each),Policy-basedApproach,LearninganActor,MachineLearningLookingforaFunction,Environment,Observation,Action,Reward,Functioninput,Usedtopickthebestfunction,Functionoutput,Action=(Observation),Actor/Policy,ThreeStepsforDeepLearning,DeepLearningissosimple,NeuralNetworkasActor,NeuralnetworkasActor,Inputofneuralnetwork:theobservationofmachinerepresentedasavectororamatrixOutputneuralnetwork:eachactioncorrespondstoaneuroninoutputlayer,NNasactor,pixels,fire,right,left,Probabilityoftakingtheaction,Whatisthebenefitofusingnetworkinsteadoflookuptable?,0.7,0.2,0.1,generalization,ThreeStepsforDeepLearning,DeepLearningissosimple,NeuralNetworkasActor,GoodnessofActor,Review:Supervisedlearning,y1,y2,y10,Loss,“1”,target,Softmax,Ascloseaspossible,Givenasetofparameters,FindthenetworkparametersthatminimizetotallossL,=1,TotalLoss:,TrainingExample,GoodnessofActor,GivenanactorwithnetworkparameterUsetheactortoplaythevideogameStartwithobservation1Machinedecidestotake1Machineobtainsreward1Machineseesobservation2Machinedecidestotake2Machineobtainsreward2Machineseesobservation3MachinedecidestotakeMachineobtainsreward,Totalreward:=1,Evenwiththesameactor,isdifferenteachtime,WedefineastheexpectedvalueofR,evaluatesthegoodnessofanactor,Randomnessintheactorandthegame,END,GoodnessofActor,Anepisodeisconsideredasatrajectory=1,1,1,2,2,2,=1Ifyouuseanactortoplaythegame,eachhasaprobabilitytobesampledTheprobabilitydependsonactorparameter:|,=|,Sumoverallpossibletrajectory,UsetoplaythegameNtimes,obtain1,2,Samplingfrom|Ntimes,1=1,ThreeStepsforDeepLearning,DeepLearningissosimple,NeuralNetworkasActor,GradientAscent,ProblemstatementGradientascentStartwith010+021+1,=max,=1,2,1,=,121,GradientAscent,=?,=|,=|,=|,UsetoplaythegameNtimes,Obtain1,2,1=1|,donothavetobedifferentiable,Itcanevenbeablackbox.,=1,GradientAscent,=1,1,1,2,2,2,|=?,|=,11|1,1,2|1,12|2,2,3|2,2,=1=1|,+1|,Controlbyyouractor,notrelatedtoyouractor,Actor,left,right,fire,0.1,0.2,0.7,=|,=0.7,GradientAscent,=1,1,1,2,2,2,|=1=1|,+1|,|=?,|=1+=1|,+,+1|,|=1|,Ignorethetermsnotrelatedto,GradientAscent,1=1|,|=1|,=1=1=1|,+,=1=1=1|,Ifinmachinetakeswhenseeingin,ispositive,Tuningtoincrease|,isnegative,Tuningtodecrease|,Itisveryimportanttoconsiderthecumulativerewardofthewholetrajectoryinsteadofimmediatereward,Whatifwereplacewith,GradientAscent,1=1|,|=1|,=1=1=1|,+,=1=1=1|,|,|,Whydividedby|,?,hesamplingdata,In13,takeactiona,shasbeenseenin13,15,17,33,In17,takeactionb,In15,takeactionb,In33,takeactionb,R13=2,R17=1,R15=1,R33=1,AddaBaseline,+,1=1=1|,Itispossiblethatisalwayspositive.,Idealcase,Sampling,a,b,c,a,b,c,Itisprobability,a,b,c,Notsampled,a,b,c,Theprobabilityoftheactionsnotsampledwilldecrease.,1=1=1|,Value-basedApproach,LearningaCritic,Critic,Acriticdoesnotdeterminetheaction.Givenanactor,itevaluatesthehowgoodtheactoris,Anactorcanbefoundfromacritic.,e.g.Q-learning,(nottoday),ThreekindsofCritics,AcriticisafunctiondependingontheactoritisevaluatedThefunctionisrepresentedbyaneuralnetworkStatevaluefunctionWhenusingactor,thecumulatedrewardexpectstobeobtainedafterseeingobservation(state)s,s,scalar,islarge,issmaller,ThreekindsofCritics,State-actionvaluefunction,Whenusingactor,thecumulatedrewardexpectstobeobtainedafterseeingobservationsandtakinga,scalar,=,=,=,fordiscreteactiononly,Howtoestimate,Monte-CarlobasedapproachThecriticwatchesplayingthegame,Afterseeing,Untiltheendoftheepisode,thecumulatedrewardis,Afterseeing,Untiltheendoftheepisode,thecumulatedrewardis,Howtoestimate,Temporal-differenceapproach,+=,-,Someapplicationshaveverylongepisodes,sothatdelayingalllearninguntilanepisodesendistooslow.,Howtoestimate,Thecritichasthefollowing8episodes,=0,=0,END,=1,END,=1,END,=1,END,=1,END,=1,END,=1,END,=0,END,Sutton,v2,Example6.4,(Theactionsareignoredhere.),=?,=3/4,0?,3/4?,Monte-Carlo:,Temporal-difference:,=0,+=,3/4,3/4,0,DeepReinforcementLearning,Actor-Critic,1=1=1|,Actor-Critic,+,Evaluatedbycritic,+1,AdvantageFu

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论