第10章_强化学习学习教案

上传人：伊*** IP属地：上海上传时间：2022-06-22 格式：PPTX 页数：91 大小：1.12MB 积分：20 举报 版权申诉

已阅读5页，还剩86页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、会计学1第第10章章_强化强化(qinghu)学习学习第一页，共91页。2022-5-25强化(qinghu)学习史忠植2第1页/共90页第二页，共91页。2022-5-25强化(qinghu)学习史忠植3第2页/共90页第三页，共91页。2022-5-25强化(qinghu)学习史忠植4第3页/共90页第四页，共91页。2022-5-25强化(qinghu)学习史忠植5第4页/共90页第五页，共91页。2022-5-25强化(qinghu)学习史忠植6第5页/共90页第六页，共91页。2022-5-25强化(qinghu)学习史忠植7第6页/共90页第七页，共91页。2022-5

2、-25强化(qinghu)学习史忠植8强化学习(xux)模型i: inputr: reward s: statea: action状态 sisi+1ri+1奖励 ri动作动作 aia0a1a2s0s1s2s3第7页/共90页第八页，共91页。2022-5-25强化(qinghu)学习史忠植9The most complex general class of environments are inaccessible, non-deterministic, non-episodic, dynamic, and continuous.第8页/共90页第九页，共91页。2022-5-25强化(q

3、inghu)学习史忠植10EnvironmentactionstaterewardRLAgent1Pr, for all ,( ).asstttPssss aas sS aA s11, for all ,( ).assttttRE rss aa sss sS aA s第9页/共90页第十页，共91页。2022-5-25强化(qinghu)学习史忠植11RLSystemInputsOutputs (“actions”)Training Info = evaluations (“rewards” / “penalties”)lSupervised Learning Learn from exa

4、mples provided by a knowledgable external supervisor.第10页/共90页第十一页，共91页。2022-5-25强化(qinghu)学习史忠植12PolicyRewardValueModel ofenvironmentIs unknownIs my goalIs I can getIs my method第11页/共90页第十二页，共91页。2022-5-25强化(qinghu)学习史忠植131t1t4t23t2t1t4t33t22t1ttRrrrrrrrrrRThe basic idea: So: sssVrEssRE)s(Vt1t1tt

5、tOr, without the expectation operator: asassass)s(VRP)a, s()s(V is the discount rate第12页/共90页第十三页，共91页。2022-5-25强化(qinghu)学习史忠植1411( )( )( )max(),max( )tttta A saassssa A ssVsE rVsss aaPRVs其中(qzhng)：V*：状态值映射S：环境状态R：奖励函数P：状态转移概率函数：折扣因子第13页/共90页第十四页，共91页。2022-5-25强化(qinghu)学习史忠植15状态转移到s的概率。第14页/

6、共90页第十五页，共91页。2022-5-25强化(qinghu)学习史忠植16第15页/共90页第十六页，共91页。2022-5-25强化(qinghu)学习史忠植17Characteristics of MDP:a set of states : Sa set of actions : Aa reward function :R : S x A RA state transition function:T: S x A ( S) T (s,a,s): probability of transition from s to s using action a第16页/共90页第十七页，共9

7、1页。2022-5-25强化(qinghu)学习史忠植18第17页/共90页第十八页，共91页。2022-5-25强化(qinghu)学习史忠植19MDP EXAMPLE:TransitionfunctionStates and rewardsBellman Equation:(Greedy policy selection)第18页/共90页第十九页，共91页。2022-5-25强化(qinghu)学习史忠植20, : T (s, action, s )Similarity to Hidden Markov Models (HMMs)第19页/共90页第二十页，共91页。2022-5-

8、25强化(qinghu)学习史忠植21Deterministic transitionsStochastic transitionsaijMis the probability to reaching state j when taking action a in state istart3211234+1-1A simple environment that presents the agent with a sequential decision problem:Move cost = 0.04(Temporal) credit assignment problem sparse rei

9、nforcement problemOffline alg: action sequences determined ex anteOnline alg: action sequences is conditional on observations along the way; Important in stochastic environment (e.g. jet flying)第20页/共90页第二十一页，共91页。2022-5-25强化(qinghu)学习史忠植22M = 0.8 in direction you want to go 0.2 in perpendicular 0.

10、1 left0.1 rightPolicy: mapping from states to actions3211234+1-10.7053211234+1-1 0.8120.762 0.868 0.912 0.660 0.655 0.611 0.388An optimal policy for the stochastic environment:utilities of states:EnvironmentObservable (accessible): percept identifies the statePartially observableMarkov property: Tra

11、nsition probabilities depend on state only, not on the path to the state.Markov decision problem (MDP).Partially observable MDP (POMDP): percepts does not have enough info to identify transition probabilities.第21页/共90页第二十二页，共91页。2022-5-25强化(qinghu)学习史忠植23)()()|()|()(sVssRasssasVaas(as)是给定在随机策略下状态s时

12、动作a的概率。(ssa)是在动作a下状态s转到状态s的概率。这就是对V的Bellman(1957)等式。第22页/共90页第二十三页，共91页。2022-5-25强化(qinghu)学习史忠植2410),(juirk,10)()|(1ipiijipkijkk第23页/共90页第二十四页，共91页。2022-5-25强化(qinghu)学习史忠植25lFinite Horizon ProblemlInfinite Horizon ProblemlValue IterationiiiiiriGEiVNkkkkkkNNN0101),(,()()(iiiiirEiVNkkkkkkn0101),(,

13、(lim)(nijVjuirupiVTnjijiUu, 1,)(),()(max)(0)()()(1iVTiVkkNoImage)(min)(*iViV第24页/共90页第二十五页，共91页。2022-5-25强化(qinghu)学习史忠植2611( )argmax( )( )( , )( )( )max( )aassssasaaksssskasaaksssskassPRVsVss aPRVsVsPRVs*1010VVV policy evaluationpolicy improvement“greedification”Policy IterationValue Iteration第25页

14、/共90页第二十六页，共91页。2022-5-25强化(qinghu)学习史忠植2711( )()tttV sErV sTTTTstrt1st1TTTTTTTTT第26页/共90页第二十七页，共91页。2022-5-25强化(qinghu)学习史忠植28Idea: use the constraints (state transition probabilities) between states to speed learning.Solve jijjUMiRiU)()()(= value determination.No maximization over actions becaus

15、e agent is passive unlike in value iteration.using DPLarge state spacee.g. Backgammon: 1050 equations in 1050 variables第27页/共90页第二十八页，共91页。2022-5-25强化(qinghu)学习史忠植29AN ALTERNATIVE ITERATION: (Singh,1993)(Important for model free learning)Stop Iteration when V(s) differs less than .Policy difference

16、 ratio = 2 / (1- ) ( Williams & Baird 1993b)第28页/共90页第二十九页，共91页。2022-5-25强化(qinghu)学习史忠植30Policies converge faster than values.Why faster convergence? 第29页/共90页第三十页，共91页。2022-5-25强化(qinghu)学习史忠植31第30页/共90页第三十一页，共91页。2022-5-25强化(qinghu)学习史忠植32nQ Learning第31页/共90页第三十二页，共91页。2022-5-25强化(qinghu)学

17、习史忠植33第32页/共90页第三十三页，共91页。2022-5-25强化(qinghu)学习史忠植34. state following return actual the is where)()()(ttttttsRsVRsVsVTTTTTTTTTTstTTTTTTTTTT第33页/共90页第三十四页，共91页。2022-5-25强化(qinghu)学习史忠植35第34页/共90页第三十五页，共91页。2022-5-25强化(qinghu)学习史忠植36lGoal: learn V (s) under P and R are unknown in advancelGiven: so

18、me number of episodes under which contain slIdea: Average returns observed after visits to slEvery-Visit MC: average returns for every time s is visited in an episodelFirst-visit MC: average returns only for first time s is visited in an episodelBoth converge asymptotically12345第35页/共90页第三十六页，共91页。2

19、022-5-25强化(qinghu)学习史忠植37Problem: Unvisited pairs(problem of maintaining exploration)For every make sure that:P( selected as a start state and action) 0 (Assumption of exploring starts ) 第36页/共90页第三十七页，共91页。2022-5-25强化(qinghu)学习史忠植38 MC policy iteration: Policy evaluation using MC methods followed

20、 by policy improvement Policy improvement step: greedify with respect to value (or action-value) function第37页/共90页第三十八页，共91页。2022-5-25强化(qinghu)学习史忠植39时序差分学习中没有环境模型，根据经验学习。每步进行迭代，不需要等任务完成。预测模型的控制算法，根据历史信息判断将来的输入(shr)和输出，强调模型的函数而非模型的结构。时序差分方法和蒙特卡罗方法类似，仍然采样一次学习循环中获得的瞬时奖惩反馈，但同时类似与动态规划方法采用自举方法估计状态的值函数。

21、然后通过多次迭代学习，去逼近真实的状态值函数。第38页/共90页第三十九页，共91页。2022-5-25强化(qinghu)学习史忠植40)()()()(11tttttsVsVrsVsVTTTTTTTTTTst1rt1stTTTTTTTTTT第39页/共90页第四十页，共91页。2022-5-25强化(qinghu)学习史忠植41)()()(:methodCarlo Monte visit-every SimplettttsVRsVsV )()()()(:TD(0) method, TD simplest The11tttttsVsVrsVsVtarget: the actual retu

22、rn after time ttarget: an estimate of the return第40页/共90页第四十一页，共91页。2022-5-25强化(qinghu)学习史忠植42Idea: Do ADP backups on a per move basis, not for the whole state space.)()()()()(iUjUiRiUiUTheorem: Average value of U(i) converges to the correct value.Theorem: If is appropriately decreased as a functio

23、n of times a state is visited (=Ni), then U(i) itself converges to the correct value第41页/共90页第四十二页，共91页。2022-5-25强化(qinghu)学习史忠植43111)(1)(11)1 ()1 (tTnttTntnntnntRRRR)()(tttttsVRsV第42页/共90页第四十三页，共91页。2022-5-25强化(qinghu)学习史忠植44Idea: update from the whole epoch, not just on state transition.kmmmmkmi

24、UiUiRiUiU)()()()()(1Special cases:=1: Least-mean-square (LMS), Mont Carlo=0: TDIntermediate choice of (between 0 and 1) is best. Interplay with 第43页/共90页第四十四页，共91页。2022-5-25强化(qinghu)学习史忠植45算法算法(sun f) 10.1 TD(0)学习算法学习算法(sun f)Initialize V(s) arbitrarily, to the policy to be evaluatedRepeat (for ea

25、ch episode) Initialize s Repeat (for each step of episode) Choose a from s using policy derived from V(e.g., -greedy) Take action a, observer r, s Until s is terminal V sV srV sV sss 第44页/共90页第四十五页，共91页。2022-5-25强化(qinghu)学习史忠植46第45页/共90页第四十六页，共91页。2022-5-25强化(qinghu)学习史忠植47Theorem: Converges w.p.

26、 1 under certain boundaries conditions.Decrease i(t) s.t. )()(2tttitiIn practice, often a fixed is used for all i and t.第46页/共90页第四十七页，共91页。2022-5-25强化(qinghu)学习史忠植48第47页/共90页第四十八页，共91页。2022-5-25强化(qinghu)学习史忠植49Watkins, 1989在Q学习中，回溯从动作结点开始，最大化下一个状态的所有可能动作和它们的奖励。在完全递归定义的Q学习中，回溯树的底部结点一个从根结点开始的动作和它们

27、的后继动作的奖励的序列可以到达的所有终端结点。联机的Q学习，从可能的动作向前扩展，不需要建立一个完全的世界模型(mxng)。Q学习还可以脱机执行。我们可以看到，Q学习是一种时序差分的方法。第48页/共90页第四十九页，共91页。2022-5-25强化(qinghu)学习史忠植50在Q学习(xux)中，Q是状态-动作对到学习(xux)到的值的一个函数。对所有的状态和动作： Q: (state x action) value 对Q学习(xux)中的一步：),(),(MAX),()1 (),(11tttatttttasQasQrcasQcasQ (10.15)其中c和都1，rt+1是状态(zhun

28、gti)st+1的奖励。第49页/共90页第五十页，共91页。2022-5-25强化(qinghu)学习史忠植51第50页/共90页第五十一页，共91页。2022-5-25强化(qinghu)学习史忠植52Q (a,i),(max)(iaQiUajaaijjaQMiRiaQ), (max)(),(Direct approach (ADP) would require learning a model .Q-learning does not:aijM),(), (max)(),(),(iaQjaQiRiaQiaQaDo this update after each state trans

29、ition:第51页/共90页第五十二页，共91页。2022-5-25强化(qinghu)学习史忠植53Tradeoff between exploitation (control) and exploration (identification) Extremes: greedy vs. random acting(n-armed bandit models)Q-learning converges to optimal Q-values if* Every state is visited infinitely often (due to exploration),* The actio

30、n selection becomes greedy as time approaches infinity, and* The learning rate is decreased fast enough but not too fast (as we discussed in TD learning)第52页/共90页第五十三页，共91页。2022-5-25强化(qinghu)学习史忠植54)(sA1.In value iteration in an ADP agent: Optimistic estimate of utility U+(i)2.-greedy methodNongre

31、edy actions Greedy action3.Boltzmann explorationjaijaiaNjUMfiRiU),(),(max)()(Exploration func),(nufR+ if nNu o.w.aTjUMTjUMajaijjaijeeP)()(*)(1sA第53页/共90页第五十四页，共91页。2022-5-25强化(qinghu)学习史忠植55Until s is terminal,max,aQ s aQ s arQ s aQ s ass第54页/共90页第五十五页，共91页。2022-5-25强化(qinghu)学习史忠植56. 01TQ.);,(1mi

32、narg).;,(max, 0,.,1,21,11111niitittitttttttattQYnaQrYTTttASAS.),;,(maxarg),(1,tQttttatttQtasas第55页/共90页第五十六页，共91页。2022-5-25强化(qinghu)学习史忠植57ttttttattttaQrEQtASAS,| ),(max),(11*1*1AS*11ttQQ2*);,(ttttQQEAS211*1);,(),(max1tttttttatQaQrEtASAS第56页/共90页第五十七页，共91页。2022-5-25强化(qinghu)学习史忠植58第57页/共90页第五十八页

33、，共91页。2022-5-25强化(qinghu)学习史忠植59);,(max);,();,(.),|();,(1minarg).;,(, 0,.,1,1,21, 1,1tttttattttttttniitittitittittTtjjjjjTtjjtaEYnrYTTttASASASASASAS.),;,(maxarg),(1,tttttatttAtasas第58页/共90页第五十九页，共91页。2022-5-25强化(qinghu)学习史忠植60第59页/共90页第六十页，共91页。2022-5-25强化(qinghu)学习史忠植61第60页/共90页第六十一页，共91页。2022-5

34、-25强化(qinghu)学习史忠植62The optimal strategy might need to consider the history.第61页/共90页第六十二页，共91页。2022-5-25强化(qinghu)学习史忠植63POMDP由六元组定义。其中定义了环境潜在的马尔可夫决策模型上，是观察的集合，即系统可以感知的世界状态集合，观察函数(hnsh)：SAPD（）。系统在采取动作a转移到状态s时，观察函数(hnsh)确定其在可能观察上的概率分布。记为（s, a, o）。第62页/共90页第六十三页，共91页。2022-5-25强化(qinghu)学习史忠植64What

35、 if state information (from sensors) is noisy?Mostly the case!MDP techniques are suboptimal!Two halls are not the same.第63页/共90页第六十四页，共91页。2022-5-25强化(qinghu)学习史忠植65SE: Belief State Estimator (Can be based on HMM): MDP Techniques第64页/共90页第六十五页，共91页。2022-5-25强化(qinghu)学习史忠植66Open Problem : How to d

36、eal with the continuous distribution? 第65页/共90页第六十六页，共91页。2022-5-25强化(qinghu)学习史忠植67 , , ,Pr, ,Pr,s SO s a oP s a s b sb ss o a bo a b, ,Pr, ,Pr,o Ob a bb a b oo a b ,s Sb ab s R s a第66页/共90页第六十七页，共91页。2022-5-25强化(qinghu)学习史忠植68算法名称基本思想学习值函数Memoryless policies直接采用标准的强化学习算法直接采用标准的强化学习算法Simple memor

37、y based approaches使用使用k个历史观察表示当前状态个历史观察表示当前状态UDM(Utile Distinction Memory)分解状态，构建有限状态机模型分解状态，构建有限状态机模型NSM(Nearest Sequence Memory)存储状态历史，进行距离度量存储状态历史，进行距离度量USM(Utile Suffix Memory)综合综合UDM和和NSM两种方法两种方法Recurrent-Q使用循环神经网络进行状态预测使用循环神经网络进行状态预测策略搜索Evolutionary algorithms使用遗传算法直接进行策略搜索使用遗传算法直接进行策略搜索Gradie

38、nt ascent method使用梯度下降（上升）法搜索使用梯度下降（上升）法搜索第67页/共90页第六十八页，共91页。2022-5-25强化(qinghu)学习史忠植69RLFASubset of statesValue estimate as targetsV (s)Generalization of the value function to the entire state space00000,V M VM VMM VMM Vis the TD operator.Mis the function approximation operator.第68页/共90页第六十九页，共91

39、页。2022-5-25强化(qinghu)学习史忠植70,1, ,max,aQ s aVs ar s a sVs a,Vs aM Q s aHow to construct the M function? Using state cluster, interpolation, decision tree or neural network?第69页/共90页第七十页，共91页。2022-5-25强化(qinghu)学习史忠植71lFunction Approximator: V( s) = f( s, w)lUpdate: Gradient-descent Sarsa: w w + rt+

40、1 + Q(st+1,at+1)- Q(st,at) w f(st,at,w)weight vectorStandard gradienttarget valueestimated valueOpen Problem : How to design the non-liner FA system which can converge with the incremental instances? 第70页/共90页第七十一页，共91页。2022-5-25强化(qinghu)学习史忠植72Discrete timeHomogeneous discountContinuous timeDiscr

41、ete eventsInterval-dependent discountDiscrete timeDiscrete eventsInterval-dependent discountA discrete-time SMDP overlaid on an MDPCan be analyzed at either level. One approach to Temporal Hierarchical RLMDPSMDPOptionsover MDPStateTime第71页/共90页第七十二页，共91页。2022-5-25强化(qinghu)学习史忠植73 *,max, | ,a AsVsR

42、 s aP ss a Vs*, | ,max,aAsQs aR s aP ss aQs a1111,1,max,kktttkaAQs aQs arrrQs a 第72页/共90页第七十三页，共91页。2022-5-25强化(qinghu)学习史忠植74EnvironmentactionstaterewardRLAgentRLAgent第73页/共90页第七十四页，共91页。2022-5-25强化(qinghu)学习史忠植75问题空间主要方法算法准则合作多agent强化学习分布、同构、分布、同构、合作环境合作环境交换状态交换状态提高学习收敛速度提高学习收敛速度交换经验交换经验交换策略交换策略交换建议交换建议基于平衡解多agent强化学习同构或异构、同构或异构、合作或竞争环境合作或竞争环境极小极大极小极大-Q-Q理性和收敛性理性和收敛性NASH-QNASH-QCE-QCE-QW

人人文库> 全部分类> 教育资料 > 课件下载

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

第10章_强化学习学习教案

文档简介

温馨提示

最新文档

评论

第10章_强化学习学习教案

文档简介

温馨提示

最新文档

评论

相关文档