强化学习的数学原理 · Chap1

[TOC]

https://www.bilibili.com/video/BV1sd4y167NS

概念定义

State：$s$
- state space: the set of states $S = \left{ s_i \right }$
Action：$a$
- action space of a state: $\mathcal{A}(s_i) = \left{a_i \right}$
State Transition：$s_i \overset{a}{\rightarrow} s_j$
- 在确定性的情况，可以使用S+A的表格进行标识
- State Transition Probability：表示成条件概率分布
Policy: determine what actions to take at a state
- $\pi$：对于不确定的情况，同样是一个条件概率
Reward: a real number we get after taking an action
- positive/negative: it represents encouragement or punishment
- corner case
  - zero reward: no punishment
  - can positive mean punishment: 取决于怎么设计
- Reward本质上是一种人机交互的接口
Trajectory: a state-action-reward chain
- $s_i\xrightarrow[a_i]{r_i}s_{i+1}\xrightarrow[a_{i+1}]{r_{i+1}}s_{i+2}$
- return: the sum of all the rewards along a trajectory
- discounted return：trajectory可能会无限长（走到终点之后仍有奇怪多余的操作），导致reward发散到无穷
  - discount rate: $\gamma \in [0,1)$
  - 防止return发散，通过指数级别的系数累乘，保证return是收敛的
Episode: the agent may stop at some terminal states, the resulting trajectory is called an episode(or a trial)
- Episodic task（有终止任务）
- Continuing task（持续任务）
- 在数学上，可以通过把 episodic task 转换为 continuing task 来统一处理两类任务
  - Option 1：吸收状态 (absorbing state)，当智能体到达目标状态后，就永远停留在那里，之后的奖励永远设为0
  - Option2：普通状态 (normal state)，智能体可以离开目标状态，任务会继续，每次进入目标状态时都会获得奖励
- 数学与建模上使用Option2会更方便、统一，$\gamma$会控制一切

Markov Decision Process · MDP

Sets
- State: the set of states$\mathcal{S}$
- Action: the set of actions $\mathcal{A}(s)$ is associated for state$s \in \mathcal{S}$
- Reward: the set of rewards $\mathcal{R}(s,a)$
Probability distribution
- State transition probability: at state $s$, taking action $a$, the probability to transit to state $s’$ is $p(s’|s, a)$
- Reward probability: at state $s$, taking action $a$, the probability to get reward $r$ is $p(r|s, a)$
Policy: at state $s$, the probability to choose action a is $\pi(a|s)$
Markov property
- $p(s_{t+1} | a_{t+1}, s_t, …, a_1, s_0) = p(s_{t+1} | a_{t+1}, s_t)$
- $p(r_{t+1} | a_{t+1}, s_t, …, a_1, s_0) = p(r_{t+1} | a_{t+1}, s_t)$

Markov decision process becomes Markov process once the policy is given.

似乎这里的定义和比较官方的定义有一些差别

MDP是一个在马尔可夫性质上的时序决策模型，由一个五元组定义：

$$ S,A,P,R,\gamma $$

状态集合State Space
动作集合Action Space
状态转移概率Transition Probability
奖励函数Reward
折扣因子Discount Factor

一个MDP的动态过程可以描述为一个与时间交互的循环：

处于某个状态
执行一个动作
根据状态转移概率转移到某个状态
获得动作的即时奖励
重复

强化学习的数学原理 · Chap1 · 基本概念

好好学习

概念定义

Markov Decision Process · MDP