了得網計算機/網絡_強化學習與最優控制

編輯推薦

Dimitri P. Bertseka,美國MIT終身教授，美國國家工程院院士，清華大學復雜與網絡化繫統研究中心客座教授,電氣工程與計算機科學領域國際知名作者，著有《非線性規劃》《網絡優化》《凸優化》等十幾本暢銷教材和專著。本書的目的是考慮大型且具有挑戰性的多階段決策問題，這些問題原則上可以通過動態規劃和*控制來解決，但它們的精確解決方案在計算上是難以處理的。本書討論依賴於近似的解決方法，以產生具有足夠性能的次優策略。這些方法統稱為增強學習，也可以叫做近似動態規劃和神經動態規劃等。
本書的主題產生於*控制和人工智能思想的相互作用。本書的目的之一是探索這兩個領域之間的共同邊界，並架設一座具有任一領域背景的專業人士都可以訪問的橋梁。

內容簡介

本書的目的是考慮大型且具有挑戰性的多階段決策問題，這些問題原則上可以通過動態規劃和*控制來解決，但它們的精確解決方案在計算上是難以處理的。本書討論依賴於近似的解決方法，以產生具有足夠性能的次優策略。這些方法統稱為增強學習，也可以叫做近似動態規劃和神經動態規劃等。本書的主題產生於*控制和人工智能思想的相互作用。本書的目的之一是探索這兩個領域之間的共同邊界，並架設一座具有任一領域背景的專業人士都可以訪問的橋梁。

作者簡介

Dimitri P. Bertseka,美國MIT終身教授，美國國家工程院院士，清華大學復雜與網絡化繫統研究中心客座教授。電氣工程與計算機科學領域國際知名作者，著有《非線性規劃》《網絡優化》《凸優化》等十幾本暢銷教材和專著。

1 Exact Dynamic Programming
1.1 DeterministicDynamicProgramming 2
1.1.1 DeterministicProblems 2
1.1.2 TheDynamicProgrammingAlgorithm 7
1.1.3 Approximation inValue Space 12
1.2 StochasticDynamicProgramming 14
1.3 Examples,Variations, and Simplifications 18
1.3.1 Deterministic ShortestPathProblems 19
1.3.2 DiscreteDeterministicOptimization 21
1.3.3 Problemswith aTermination State 25
1.3.4 Forecasts 26
1.3.5 Problems with Uncontrollable State Components 29
1.3.6 PartialState Information andBelief States 34
1.3.7 LinearQuadraticOptimalControl 38

1 Exact Dynamic Programming

1.1 DeterministicDynamicProgramming 2

1.1.1 DeterministicProblems 2

1.1.2 TheDynamicProgrammingAlgorithm 7

1.1.3 Approximation inValue Space 12

1.2 StochasticDynamicProgramming 14

1.3 Examples,Variations, and Simplifications 18

1.3.1 Deterministic ShortestPathProblems 19

1.3.2 DiscreteDeterministicOptimization 21

1.3.3 Problemswith aTermination State 25

1.3.4 Forecasts 26

1.3.5 Problems with Uncontrollable State Components 29

1.3.6 PartialState Information andBelief States 34

1.3.7 LinearQuadraticOptimalControl 38

1.3.8 SystemswithUnknownParameters -Adaptive

Control 40

1.4 ReinforcementLearning andOptimalControl - Some

Terminology 43

1.5 Notes and Sources 45

2 Approximation in Value Space

2.1 ApproximationApproaches inReinforcementLearning 50

2.1.1 General Issues ofApproximation inValue Space 54

2.1.2 Off-Line andOn-LineMethods 56

2.1.3 Model-Based Simplification of the Lookahead

Minimization 57

2.1.4 Model-Free off-Line Q-Factor Approximation 58

2.1.5 Approximation inPolicy Space onTop of

ApproximationinValue Space 61

2.1.6 When is Approximation in Value Space Effective? 62

2.2 Multistep Lookahead 64

??ii

viii Contents

2.2.1 Multistep Lookahead and Rolling Horizon 65

2.2.2 Multistep Lookahead and Deterministic Problems 67

2.3 Problem Approximation 69

2.3.1 Enforced Decomposition 69

2.3.2 Probabilistic Approximation - Certainty Equivalent

Control 76

2.4 Rollout and the Policy Improvement Principle 83

2.4.1 On-Line Rollout for Deterministic Discrete

Optimization 84

2.4.2 Stochastic Rollout and Monte Carlo Tree Search 95

2.4.3 Rollout with an Expert 104

2.5 On-Line Rollout for Deterministic Infinite-Spaces Problems -

Optimization Heuristics 106

2.5.1 Model Predictive Control 108

2.5.2 Target Tubes and the Constrained Controllability

Condition 115

2.5.3 Variants of Model Predictive Control 118

2.6 Notes and Sources 120

3 Parametric Approximation

3.1 Approximation Architectures 126

3.1.1 Linear and Nonlinear Feature-Based Architectures 126

3.1.2 Training of Linear and Nonlinear Architectures 134

3.1.3 Incremental Gradient and Newton Methods 135

3.2 Neural Networks 149

3.2.1 Training of Neural Networks 153

3.2.2 Multilayer and Deep Neural Networks 157

3.3 Sequential Dynamic Programming Approximation 161

3.4 Q-Factor Parametric Approximation 162

3.5 Parametric Approximation in Policy Space by

Classification 165

3.6 Notes and Sources 171

4 Infinite Horizon Dynamic Programming

4.1 An Overview of Infinite Horizon Problems 174

4.2 Stochastic Shortest Path Problems 177

4.3 Discounted Problems 187

4.4 Semi-Markov Discounted Problems 192

4.5 Asynchronous Distributed Value Iteration 197

4.6 Policy Iteration 200

4.6.1 Exact Policy Iteration 200

4.6.2 Optimistic and Multistep Lookahead Policy

Iteration 205

4.6.3 Policy Iteration for Q-factors 208

Contents i??

4.7 Notes and Sources 209

4.8 Appendix: MathematicalAnalysis 211

4.8.1 Proofs for Stochastic ShortestPathProblems 212

4.8.2 Proofs forDiscountedProblems 217

4.8.3 ConvergenceofExact andOptimistic

Policy Iteration 218

5 Infinite Horizon Reinforcement Learning

5.1 Approximation in Value Space - Performance Bounds 222

5.1.1 LimitedLookahead 224

5.1.2 Rollout and Approximate Policy Improvement 227

5.1.3 ApproximatePolicy Iteration 232

5.2 FittedValue Iteration 235

5.3 Simulation-BasedPolicy IterationwithParametric

Approximation 239

5.3.1 Self-Learning andActor-CriticMethods 239

5.3.2 Model-Based Variant of a Critic-Only Method 241

5.3.3 Model-FreeVariant of aCritic-OnlyMethod 243

5.3.4 Implementation Issues ofParametricPolicy

Iteration 246

5.3.5 Convergence Issues ofParametricPolicy Iteration -

Oscillations 249

5.4 Q-Learning 253

5.4.1 Optimistic Policy Iteration with Parametric Q-Factor

Approximation- SARSAandDQN 255

5.5 AdditionalMethods -TemporalDifferences 256

5.6 Exact andApproximateLinearProgramming 267

5.7 Approximation inPolicy Space 270

5.7.1 Training byCostOptimization -PolicyGradient,

Cross-Entropy,andRandomSearchMethods 276

5.7.2 Expert-BasedSupervisedLearning 286

5.7.3 ApproximatePolicy Iteration,Rollout, and

ApproximationinPolicySpace 288

5.8 Notes and Sources 293

5.9 Appendix: MathematicalAnalysis 298

5.9.1 Performance Bounds for Multistep Lookahead 299

5.9.2 Performance Bounds for Rollout 301

5.9.3 Performance Bounds for Approximate Policy

Iteration 304

6 Aggregation

6.1 AggregationwithRepresentativeStates 308

6.1.1 Continuous State and Control Space Discretization p 314

6.1.2 Continuous State Space - POMDP Discretization 315

?? Contents

6.2 AggregationwithRepresentativeFeatures 317

6.2.1 Hard Aggregation and Error Bounds 320

6.2.2 AggregationUsingFeatures 322

6.3 Methods for Solving theAggregateProblem 328

6.3.1 Simulation-BasedPolicy Iteration 328

6.3.2 Simulation-Based Value Iteration 331

6.4 Feature-BasedAggregationwith aNeuralNetwork 332

6.5 BiasedAggregation 334

6.6 Notes and Sources 337

6.7 Appendix: MathematicalAnalysis 340

References 345

Index 369

前言

Turning to the succor of modern computing machines, let us
renounce all analytic tools.
Richard Bellman [Bel57]
From a teleological point of view the particular numerical solution
of any particular set of equations is of far less importance than
the understanding of the nature of the solution.
Richard Bellman [Bel57]
In this book we consider large and challenging multistage decision problems,
which can be solved in principle by dynamic programming (DP for short),

Turning to the succor of modern computing machines, let us

renounce all analytic tools.

Richard Bellman [Bel57]

From a teleological point of view the particular numerical solution

of any particular set of equations is of far less importance than

the understanding of the nature of the solution.

Richard Bellman [Bel57]

In this book we consider large and challenging multistage decision problems,

which can be solved in principle by dynamic programming (DP for short),

but their exact solution is computationally intractable. We discuss solution

methods that rely on approximations to produce suboptimal policies with

adequate performance. These methods are collectively known by several

essentially equivalent names: reinforcement learning, approximate dynamic

programming, and neuro-dynamic programming. We will use primarily the

most popular name: reinforcement learning.

Our subject has benefited greatly from the interplay of ideas from

optimal control and from artificial intelligence. One of the aims of the

book is to explore the common boundary between these two fields and to

form a bridge that is accessible by workers with background in either field.

Another aim is to organize coherently the broad mosaic of methods that

have proved successful in practice while having a solid theoretical and/or

logical foundation. This may help researchers and practitioners to find

their way through the maze of competing ideas that constitute the current

state of the art.

There are two general approaches for DP-based suboptimal control.

The first is approximation in value space, where we approximate in some

way the optimal cost-to-go function with some other function. The major

alternative to approximation in value space is approximation in policy

space, whereby we select the policy by using optimization over a suitably

restricted class of policies, usually a parametric family of some form. In

some schemes these two types of approximation may be combined, aiming

to capitalize on the advantages of both. Generally, approximation in value

space is tied more closely to the central DP ideas of value and policy iteration

than approximation in policy space, which relies on gradient-like

descent, a more broadly applicable optimization mechanism.

While we provide a substantial treatment of approximation in policy

space, most of the book is focused on approximation in value space. Here,

the control at each state is obtained by optimization of the cost over a

limited horizon, plus an approximation of the optimal future cost. The

latter cost, which we generally denote by ? J, is a function of the state where

we may be. It may be computed by a variety of methods, possibly involving

simulation and/or some given or separately derived heuristic/suboptimal

policy. The use of simulation often allows for implementations that do not

require a mathematical model, a major idea that has allowed the use of DP

beyond its classical boundaries.

We discuss selectively four types of methods for obtaining J?:

(a) Problem approximation: Here ? J is the optimal cost function of a related

simpler problem, which is solved by exact DP. Certainty equivalent

control and enforced decomposition schemes are discussed in

some detail.

(b) Rollout and model predictive control: Here ? J is the cost function of

some known heuristic policy. The needed cost values to implement a

rollout policy are often calculated by simulation. While this method

applies to stochastic problems, the reliance on simulation favors deterministic

problems, including challenging combinatorial problems

for which heuristics may be readily implemented. Rollout may also

be combined with adaptive simulation and Monte Carlo tree search,

which have proved very effective in the context of games such as

backgammon, chess, Go, and others.

Model predictive control was originally developed for continuousspace

optimal control problems that involve some goal state, e.g.,

the origin in a classical control context. It can be viewed as a specialized

rollout method that is based on a suboptimal optimization for

reaching a goal state.

(c) Parametric cost approximation: Here ? J is chosen from within a parametric

class of functions, including neural networks, with the parameters

“optimized” or “trained” by using state-cost sample pairs and

some type of incremental least squares/regression algorithm. Approximate

policy iteration and its variants are covered in some detail,

including several actor and critic schemes. These involve policy evaluation

with simulation-based training methods, and policy improve

......

Dimitri P. Bertsekas

June 2019

商品搜索

商品分类

【醫學】

【各大出版社】