强化学习专题知识市公开课金奖市赛课一等奖课件

上传人：满*** IP属地：江苏上传时间：2022-10-10 格式：PPTX 页数：90 大小：1.57MB 积分：80 举报 版权申诉

已阅读5页，还剩85页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、/10/10强化学习史忠植1高级人工智能第十章史忠植中国科学院计算技术研究所强化学习第1页/10/10强化学习史忠植2内容提要引言强化学习模型动态规划蒙特卡罗方法时序差分学习Q学习强化学习中函数预计应用第2页/10/10强化学习史忠植3引言人类通常从与外界环境交互中学习。所谓强化（reinforcement）学习是指从环境状态到行为映射学习，以使系统行为从环境中取得累积奖励值最大。在强化学习中，我们设计算法来把外界环境转化为最大化奖励量方式动作。我们并没有直接告诉主体要做什么或者要采取哪个动作,而是主体经过看哪个动作得到了最多奖励来自己发觉。主体动作影响不只是马上得到奖励，而且还

2、影响接下来动作和最终奖励。试错搜索(trial-and-error search)和延期强化(delayed reinforcement)这两个特征是强化学习中两个最主要特征。第3页/10/10强化学习史忠植4引言强化学习技术是从控制理论、统计学、心理学等相关学科发展而来，最早能够追溯到巴甫洛夫条件反射试验。但直到上世纪八十年代末、九十年代初强化学习技术才在人工智能、机器学习和自动控制等领域中得到广泛研究和应用，并被认为是设计智能系统关键技术之一。尤其是伴随强化学习数学基础研究取得突破性进展后，对强化学习研究和应用日益开展起来，成为当前机器学习领域研究热点之一。第4页/10/10强化学

3、习史忠植5引言强化思想最先起源于心理学研究。19Thorndike提出了效果律（Law of Effect）：一定情景下让动物感到舒适行为，就会与此情景增强联络（强化），当此情景再现时，动物这种行为也更易再现；相反，让动物感觉不舒适行为，会减弱与情景联络，此情景再现时，此行为将极难再现。换个说法，哪种行为会“记住”，会与刺激建立联络，取决于行为产生效果。动物试错学习,包含两个含义：选择（selectional）和联络（associative），对应计算上搜索和记忆。所以，1954年，Minsky在他博士论文中实现了计算上试错学习。同年，Farley和Clark也在计算上对它进行了研究。强化学

4、习一词最早出现于科技文件是1961年Minsky 论文“Steps Toward Artificial Intelligence”，今后开始广泛使用。1969年， Minsky因在人工智能方面贡献而取得计算机图灵奖。第5页/10/10强化学习史忠植6引言1953到1957年，Bellman提出了求解最优控制问题一个有效方法：动态规划（dynamic programming） Bellman于 1957年还提出了最优控制问题随机离散版本，就是著名马尔可夫决议过程（MDP, Markov decision processe），1960年Howard提出马尔可夫决议过程策略迭代方法，这些都成为当代

5、强化学习理论基础。1972年，Klopf把试错学习和时序差分结合在一起。1978年开始，Sutton、Barto、 Moore，包含Klopf等对这二者结合开始进行深入研究。 1989年Watkins提出了Q-学习Watkins 1989，也把强化学习三条根本扭在了一起。1992年，Tesauro用强化学习成功了应用到西洋双陆棋（backgammon）中，称为TD-Gammon 。第6页/10/10强化学习史忠植7内容提要引言强化学习模型动态规划蒙特卡罗方法时序差分学习Q学习强化学习中函数预计应用第7页/10/10强化学习史忠植8主体强化学习模型i: inputr: reward s: s

6、tatea: action状态 sisi+1ri+1奖励 ri环境动作 aia0a1a2s0s1s2s3第8页/10/10强化学习史忠植9描述一个环境（问题）Accessible vs. inaccessibleDeterministic vs. non-deterministicEpisodic vs. non-episodicStatic vs. dynamicDiscrete vs. continuousThe most complex general class of environments are inaccessible, non-deterministic, non-epis

7、odic, dynamic, and continuous.第9页/10/10强化学习史忠植10强化学习问题Agent-environment interactionStates, Actions, RewardsTo define a finite MDPstate and action sets : S and Aone-step “dynamics” defined by transition probabilities (Markov Property):reward probabilities:EnvironmentactionstaterewardRLAgent第10页/10/1

8、0强化学习史忠植11与监督学习对比Reinforcement Learning Learn from interactionlearn from its own experience, and the objective is to get as much reward as possible. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them.RLSystemInputsOutputs (“ac

9、tions”)Training Info = evaluations (“rewards” / “penalties”)Supervised Learning Learn from examples provided by a knowledgable external supervisor.第11页/10/10强化学习史忠植12强化学习要素Policy: stochastic rule for selecting actionsReturn/Reward: the function of future rewards agent tries to maximizeValue: what i

10、s good because it predicts rewardModel: what follows whatPolicyRewardValueModel ofenvironmentIs unknownIs my goalIs I can getIs my method第12页/10/10强化学习史忠植13在策略下Bellman公式The basic idea: So: Or, without the expectation operator: is the discount rate第13页/10/10强化学习史忠植14Bellman最优策略公式其中：V*：状态值映射S：环境状态R

11、：奖励函数P：状态转移概率函数：折扣因子第14页/10/10强化学习史忠植15马尔可夫决议过程 MARKOV DECISION PROCESS 由四元组定义。环境状态集S 系统行为集合A 奖励函数R：SA 状态转移函数P：SAPD（S）记R（s，a，s）为系统在状态s采取a动作使环境状态转移到s取得瞬时奖励值；记P（s，a，s）为系统在状态s采取a动作使环境状态转移到s概率。第15页/10/10强化学习史忠植16马尔可夫决议过程 MARKOV DECISION PROCESS马尔可夫决议过程本质是：当前状态向下一状态转移概率和奖励值只取决于当前状态和选择动作，而与历史状态和历史动

12、作无关。所以在已知状态转移概率函数P和奖励函数R环境模型知识下，能够采取动态规划技术求解最优策略。而强化学习着重研究在P函数和R函数未知情况下，系统怎样学习最优行为策略。第16页/10/10强化学习史忠植17MARKOV DECISION PROCESSCharacteristics of MDP:a set of states : Sa set of actions : Aa reward function :R : S x A RA state transition function:T: S x A ( S) T (s,a,s): probability of transition f

13、rom s to s using action a第17页/10/10强化学习史忠植18马尔可夫决议过程 MARKOV DECISION PROCESS第18页/10/10强化学习史忠植19MDP EXAMPLE:TransitionfunctionStates and rewardsBellman Equation:(Greedy policy selection)第19页/10/10强化学习史忠植20MDP Graphical Representation, : T (s, action, s )Similarity to Hidden Markov Models (HMMs)第20

14、页/10/10强化学习史忠植21Reinforcement Learning Deterministic transitionsStochastic transitionsis the probability to reaching state j when taking action a in state istart3211234+1-1A simple environment that presents the agent with a sequential decision problem:Move cost = 0.04(Temporal) credit assignment pr

15、oblem sparse reinforcement problemOffline alg: action sequences determined ex anteOnline alg: action sequences is conditional on observations along the way; Important in stochastic environment (e.g. jet flying)第21页/10/10强化学习史忠植22Reinforcement Learning M = 0.8 in direction you want to go 0.2 in perp

16、endicular 0.1 left0.1 rightPolicy: mapping from states to actions3211234+1-10.7053211234+1-1 0.8120.762 0.868 0.912 0.660 0.655 0.611 0.388An optimal policy for the stochastic environment:utilities of states:EnvironmentObservable (accessible): percept identifies the statePartially observableMarkov p

17、roperty: Transition probabilities depend on state only, not on the path to the state.Markov decision problem (MDP).Partially observable MDP (POMDP): percepts does not have enough info to identify transition probabilities.第22页/10/10强化学习史忠植23动态规划Dynamic Programming动态规划(dynamic programming)方法经过从后继状态回溯

18、到前驱状态来计算赋值函数。动态规划方法基于下一个状态分布模型来接连更新状态。强化学习动态规划方法是基于这么一个事实：对任何策略和任何状态s，有(10.9)式迭代一致等式成立(as)是给定在随机策略下状态s时动作a概率。(ssa)是在动作a下状态s转到状态s概率。这就是对VBellman(1957)等式。第23页/10/10强化学习史忠植24动态规划Dynamic Programming - ProblemA discrete-time dynamic systemStates 1, , n + termination state 0Control U(i)Transition Probabi

19、lity pij(u)Accumulative cost structurePolicies第24页/10/10强化学习史忠植25Finite Horizon ProblemInfinite Horizon ProblemValue Iteration动态规划Dynamic Programming Iterative Solution 第25页/10/10强化学习史忠植26动态规划中策略迭代/值迭代 policy evaluationpolicy improvement“greedification”Policy IterationValue Iteration第26页/10/10强化学习

20、史忠植27动态规划方法TTTTTTTTTTTTT第27页/10/10强化学习史忠植28自适应动态规划(ADP)Idea: use the constraints (state transition probabilities) between states to speed learning.Solve = value determination.No maximization over actions because agent is passive unlike in value iteration.using DPLarge state spacee.g. Backgammon: 1

21、050 equations in 1050 variables第28页/10/10强化学习史忠植29Value Iteration AlgorithmAN ALTERNATIVE ITERATION: (Singh,1993)(Important for model free learning)Stop Iteration when V(s) differs less than .Policy difference ratio = 2 / (1- ) ( Williams & Baird 1993b)第29页/10/10强化学习史忠植30Policy Iteration Algorithm

22、 Policies converge faster than values.Why faster convergence? 第30页/10/10强化学习史忠植31动态规划Dynamic Programming经典动态规划模型作用有限，很多问题极难给出环境完整模型。仿真机器人足球就是这么问题，能够采取实时动态规划方法处理这个问题。在实时动态规划中不需要事先给出环境模型，而是在真实环境中不停测试，得到环境模型。能够采取反传神经网络实现对状态泛化，网络输入单元是环境状态s, 网络输出是对该状态评价V(s)。第31页/10/10强化学习史忠植32没有模型方法Model Free MethodsMo

23、dels of the environment:T: S x A ( S) and R : S x A RDo we know them? Do we have to know them?Monte Carlo MethodsAdaptive Heuristic CriticQ Learning第32页/10/10强化学习史忠植33蒙特卡罗方法 Monte Carlo Methods 蒙特卡罗方法不需要一个完整模型。而是它们对状态整个轨道进行抽样，基于抽样点最终止果来更新赋值函数。蒙特卡罗方法不需要经验，即从与环境联机或者模拟交互中抽样状态、动作和奖励序列。联机经验是令人感兴趣，因为它不需要

24、环境先验知识，却依然能够是最优。从模拟经验中学习功效也很强大。它需要一个模型，但它能够是生成而不是分析，即一个模型能够生成轨道却不能计算明确概率。于是，它不需要产生在动态规划中要求全部可能转变完整概率分布。第33页/10/10强化学习史忠植34Monte Carlo方法TTTTTTTTTTTTTTTTTTTT第34页/10/10强化学习史忠植35蒙特卡罗方法 Monte Carlo Methods Idea: Hold statistics about rewards for each state Take the average This is the V(s)Based only on

25、 experience Assumes episodic tasks (Experience is divided into episodes and all episodes will terminate regardless of the actions selected.) Incremental in episode-by-episode sense not step-by-step sense. 第35页/10/10强化学习史忠植36Monte Carlo策略评价Goal: learn Vp(s) under P and R are unknown in advanceGiven:

26、 some number of episodes under p which contain sIdea: Average returns observed after visits to sEvery-Visit MC: average returns for every time s is visited in an episodeFirst-visit MC: average returns only for first time s is visited in an episodeBoth converge asymptotically12345第36页/10/10强化学习史忠植37

27、Problem: Unvisited pairs(problem of maintaining exploration)For every make sure that:P( selected as a start state and action) 0 (Assumption of exploring starts ) 蒙特卡罗方法第37页/10/10强化学习史忠植38蒙特卡罗控制How to select Policies:(Similar to policy evaluation) MC policy iteration: Policy evaluation using MC met

28、hods followed by policy improvement Policy improvement step: greedify with respect to value (or action-value) function第38页/10/10强化学习史忠植39时序差分学习 Temporal-Difference时序差分学习中没有环境模型，依据经验学习。每步进行迭代，不需要等任务完成。预测模型控制算法，依据历史信息判断未来输入和输出，强调模型函数而非模型结构。时序差分方法和蒙特卡罗方法类似，依然采样一次学习循环中取得瞬时奖惩反馈，但同时类似与动态规划方法采取自举方法预计状态值函数

29、。然后经过屡次迭代学习，去迫近真实状态值函数。第39页/10/10强化学习史忠植40时序差分学习 TDTTTTTTTTTTTTTTTTTTTT第40页/10/10强化学习史忠植41时序差分学习 Temporal-Differencetarget: the actual return after time ttarget: an estimate of the return第41页/10/10强化学习史忠植42时序差分学习 (TD)Idea: Do ADP backups on a per move basis, not for the whole state space.Theorem:

30、 Average value of U(i) converges to the correct value.Theorem: If is appropriately decreased as a function of times a state is visited (=Ni), then U(i) itself converges to the correct value第42页/10/10强化学习史忠植43TD(l) A Forward ViewTD(l) is a method for averaging all n-step backups weight by ln-1 (time

31、 since visitation)l-return: Backup using l-return:第43页/10/10强化学习史忠植44时序差分学习算法 TD()Idea: update from the whole epoch, not just on state transition.Special cases:=1: Least-mean-square (LMS), Mont Carlo=0: TDIntermediate choice of (between 0 and 1) is best. Interplay with 第44页/10/10强化学习史忠植45时序差分学习算法

32、TD()算法 10.1 TD(0)学习算法Initialize V(s) arbitrarily, to the policy to be evaluatedRepeat (for each episode) Initialize s Repeat (for each step of episode) Choose a from s using policy derived from V(e.g., -greedy) Take action a, observer r, s Until s is terminal 第45页/10/10强化学习史忠植46时序差分学习算法第46页/10/10强化

33、学习史忠植47时序差分学习算法收敛性TD()Theorem: Converges w.p. 1 under certain boundaries conditions.Decrease i(t) s.t. In practice, often a fixed is used for all i and t.第47页/10/10强化学习史忠植48时序差分学习 TD第48页/10/10强化学习史忠植49Q-learningWatkins, 1989在Q学习中，回溯从动作结点开始，最大化下一个状态全部可能动作和它们奖励。在完全递归定义Q学习中，回溯树底部结点一个从根结点开始动作和它们后继动作奖

34、励序列能够抵达全部终端结点。联机Q学习，从可能动作向前扩展，不需要建立一个完全世界模型。Q学习还能够脱机执行。我们能够看到，Q学习是一个时序差分方法。第49页/10/10强化学习史忠植50Q-learning在Q学习中，Q是状态-动作对到学习到值一个函数。对全部状态和动作： Q: (state x action) value 对Q学习中一步： (10.15)其中c和都1，rt+1是状态st+1奖励。第50页/10/10强化学习史忠植51Q-LearningEstimate the Q-function using some approximator (for example, linea

35、r regression or neural networks or decision trees etc.).Derive the estimated policy as an argument of the maximum of the estimated Q-function.Allow different parameter vectors at different time points.Let us illustrate the algorithm with linear regression as the approximator, and of course, squared

36、error as the appropriate loss function.第51页/10/10强化学习史忠植52Q-learningQ (a,i)Direct approach (ADP) would require learning a model .Q-learning does not:Do this update after each state transition:第52页/10/10强化学习史忠植53ExplorationTradeoff between exploitation (control) and exploration (identification) Ext

37、remes: greedy vs. random acting(n-armed bandit models)Q-learning converges to optimal Q-values if* Every state is visited infinitely often (due to exploration),* The action selection becomes greedy as time approaches infinity, and* The learning rate a is decreased fast enough but not too fast (as we

38、 discussed in TD learning)第53页/10/10强化学习史忠植54Common exploration methodsIn value iteration in an ADP agent: Optimistic estimate of utility U+(i)-greedy methodNongreedy actions Greedy actionBoltzmann explorationExploration funcR+ if nNu o.w.第54页/10/10强化学习史忠植55Q-Learning AlgorithmQ学习算法Initialize Q(s,

39、a) arbitrarilyRepeat (for each episode) Initialize s Repeat (for each step of episode) Choose a from s using policy derived from Q (e.g., -greedy)Take action a, observer r, s Until s is terminal第55页/10/10强化学习史忠植56Q-Learning AlgorithmSetForThe estimated policy satisfies第56页/10/10强化学习史忠植57What is th

40、e intuition?Bellman equation gives If and the training set were infinite, then Q-learning minimizes which is equivalent to minimizing第57页/10/10强化学习史忠植58A-Learning Murphy, and Robins, Estimate the A-function (advantages) using some approximator, as in Q-learning.Derive the estimated policy as an arg

41、ument of the maximum of the estimated A-function.Allow different parameter vectors at different time points.Let us illustrate the algorithm with linear regression as the approximator, and of course, squared error as the appropriate loss function.第58页/10/10强化学习史忠植59A-Learning Algorithm (Inefficient

42、Version)ForThe estimated policy satisfies第59页/10/10强化学习史忠植60Differences between Q and A-learningQ-learningAt time t we model the main effects of the history, (St,At-1) and the action At and their interactionOur Yt-1 is affected by how we modeled the main effect of the history in time t, (St,At-1) A

43、-learningAt time t we only model the effects of At and its interaction with (St,At-1) Our Yt-1 does not depend on a model of the main effect of the history in time t, (St,At-1) 第60页/10/10强化学习史忠植61Q-Learning Vs. A-LearningRelative merits and demerits are not completely known till now.Q-learning has

44、low variance but high bias.A-learning has high variance but low bias. Comparison of Q-learning with A-learning involves a bias-variance trade-off.第61页/10/10强化学习史忠植62POMDP部分感知马氏决议过程 Rather than observing the state we observe some function of the state.Ob Observable functiona random variable for each

45、 states.Problem : different states may look similarThe optimal strategy might need to consider the history.第62页/10/10强化学习史忠植63Framework of POMDPPOMDP由六元组定义。其中定义了环境潜在马尔可夫决议模型上，是观察集合，即系统能够感知世界状态集合，观察函数：SAPD（）。系统在采取动作a转移到状态s时，观察函数确定其在可能观察上概率分布。记为（s, a, o）。1 能够是S子集，也能够与S无关第63页/10/10强化学习史忠植64POMDPsWhat

46、 if state information (from sensors) is noisy?Mostly the case!MDP techniques are suboptimal!Two halls are not the same.第64页/10/10强化学习史忠植65POMDPs A Solution StrategySE: Belief State Estimator (Can be based on HMM): MDP Techniques第65页/10/10强化学习史忠植66POMDP_信度状态方法Idea: Given a history of actions and ob

47、servable value, we compute a posterior distribution for the state we are in (belief state)The belief-state MDPStates: distribution over S (states of the POMDP)Actions: as in POMDPTransition: the posterior distribution (given the observation)Open Problem : How to deal with the continuous distribution

48、? 第66页/10/10强化学习史忠植67The Learning Process of Belief MDP第67页/10/10强化学习史忠植68Major Methods to Solve POMDP 算法名称基本思想学习值函数Memoryless policies直接采取标准强化学习算法Simple memory based approaches使用k个历史观察表示当前状态UDM(Utile Distinction Memory)分解状态，构建有限状态机模型NSM(Nearest Sequence Memory)存放状态历史，进行距离度量USM(Utile Suffix Memory

49、)综合UDM和NSM两种方法Recurrent-Q使用循环神经网络进行状态预测策略搜索Evolutionary algorithms使用遗传算法直接进行策略搜索Gradient ascent method使用梯度下降（上升）法搜索第68页/10/10强化学习史忠植69强化学习中函数预计RLFASubset of statesValue estimate as targetsV (s)Generalization of the value function to the entire state spaceis the TD operator.is the function approxima

50、tion operator.第69页/10/10强化学习史忠植70并行两个迭代过程值函数迭代过程值函数迫近过程How to construct the M function? Using state cluster, interpolation, decision tree or neural network?第70页/10/10强化学习史忠植71Function Approximator: V( s) = f( s, w)Update: Gradient-descent Sarsa: w w + a rt+1 + g Q(st+1,at+1)- Q(st,at) w f(st,at,w)

51、weight vectorStandard gradienttarget valueestimated valueOpen Problem : How to design the non-liner FA system which can converge with the incremental instances? 并行两个迭代过程第71页/10/10强化学习史忠植72Semi-MDPDiscrete timeHomogeneous discountContinuous timeDiscrete eventsInterval-dependent discountDiscrete time

52、Discrete eventsInterval-dependent discountA discrete-time SMDP overlaid on an MDPCan be analyzed at either level. One approach to Temporal Hierarchical RL第72页/10/10强化学习史忠植73The equations第73页/10/10强化学习史忠植74Multi-agent MDPDistributed RLMarkov GameBest ResponseEnvironmentactionstaterewardRLAgentRLAge

53、nt第74页/10/10强化学习史忠植75三种观点问题空间主要方法算法准则合作多agent强化学习分布、同构、合作环境交换状态提升学习收敛速度交换经验交换策略交换提议基于平衡解多agent强化学习同构或异构、合作或竞争环境极小极大-Q理性和收敛性NASH-QCE-QWoLF最正确响应多agent强化学习异构、竞争环境PHC收敛性和不遗憾性IGAGIGAGIGA-WoLF第75页/10/10强化学习史忠植76马尔可夫对策在n个agent系统中，定义离散状态集S（即对策集合G），agent动作集Ai集合A, 联合奖赏函数Ri：SA1An和状态转移函数P：SA1AnPD（S）。第76页/10/

54、10强化学习史忠植77基于平衡解方法强化学习Open Problem : Nash equilibrium or other equilibrium is enough? The optimal policy in single game is Nash equilibrium.第77页/10/10强化学习史忠植78Applications of RLCheckers Samuel 59TD-Gammon Tesauro 92Worlds best downpeak elevator dispatcher Crites at al 95Inventory management Bertse

55、kas et al 9510-15% better than industry standardDynamic channel assignment Singh & Bertsekas, Nie&Haykin 95Outperforms best heuristics in the literatureCart-pole Michie&Chambers 68- with bang-bang controlRobotic manipulation Grupen et al. 93-Path planningRobot docking Lin 93ParkingFootball Stone98Te

56、trisMultiagent RL Tan 93, Sandholm&Crites 95, Sen 94-, Carmel&Markovitch 95-, lots of work sinceCombinatorial optimization: maintenance & repairControl of reasoning Zhang & Dietterich IJCAI-95第78页/10/10强化学习史忠植79仿真机器人足球应用Q学习算法进行仿真机器人足球2 对1 训练，训练目标是试图使主体学习取得到一个战略上意识，能够在进攻中进行配合第79页/10/10强化学习史忠植80仿真

57、机器人足球前锋A控球，而且在可射门区域内，不过A已经没有射门角度了；队友B也处于射门区域，而且B含有良好射门角度。A传球给B，射门由B来完成，那么这次进攻配合就会很成功。经过Q学习方法来进行2 对1射门训练，让A掌握在这种状态情况下传球给B动作是最优策略；主体经过大量学习训练（大数量级状态量和重复相同状态）来取得策略，所以更含有适应性。第80页/10/10强化学习史忠植81仿真机器人足球状态描述，将进攻禁区划分为个小区域，每个小区域是边长为2m正方形，一个二维数组()便可描述这个区域。使用三个Agent位置来描述2 对 1进攻时环境状态，利用图10.11所表示划分来泛化状态。可认为主体位于同一战略区域为相同状态，这么对状态描述即使不准确，但设计所需是一个战略层次描述，可认为Agent在战略区域内是主动跑动，这种方法满足了需求。如此，便描述了一个特定状态；其中，是进攻队员A区域编号，是进攻队员B区域编号，是守门员区域编号。区域编号计算公式为：。对应，所保留状态值为三个区域编号组成对。前锋A控球，而且在可射门区域内，不过A已经没有射门角度了；队友B也处于射门区域，而且B含有良好射门角度。A传球给B，射门由B来完成，那么这次进攻配合就会很成功。经过Q学习方法来进行2 对1射

人人文库> 全部分类> 教育资料 > 课件下载

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

强化学习专题知识市公开课金奖市赛课一等奖课件

文档简介

温馨提示

最新文档

评论

强化学习专题知识市公开课金奖市赛课一等奖课件

文档简介

温馨提示

最新文档

评论

相关文档