[TOC]
https://www.bilibili.com/video/BV1sd4y167NS
概念定义
-
State:$s$
- state space: the set of states $S = \left{ s_i \right }$
-
Action:$a$
- action space of a state: $\mathcal{A}(s_i) = \left{a_i \right}$
-
State Transition:$s_i \overset{a}{\rightarrow} s_j$
- 在确定性的情况,可以使用S+A的表格进行标识
- State Transition Probability:表示成条件概率分布
-
Policy: determine what actions to take at a state
- $\pi$:对于不确定的情况,同样是一个条件概率
-
Reward: a real number we get after taking an action
- positive/negative: it represents encouragement or punishment
- corner case
- zero reward: no punishment
- can positive mean punishment: 取决于怎么设计
- Reward本质上是一种人机交互的接口
-
Trajectory: a state-action-reward chain
- $s_i\xrightarrow[a_i]{r_i}s_{i+1}\xrightarrow[a_{i+1}]{r_{i+1}}s_{i+2}$
- return: the sum of all the rewards along a trajectory
- discounted return:trajectory可能会无限长(走到终点之后仍有奇怪多余的操作),导致reward发散到无穷
- discount rate: $\gamma \in [0,1)$
- 防止return发散,通过指数级别的系数累乘,保证return是收敛的
-
Episode: the agent may stop at some terminal states, the resulting trajectory is called an episode(or a trial)
- Episodic task(有终止任务)
- Continuing task(持续任务)
- 在数学上,可以通过把 episodic task 转换为 continuing task 来统一处理两类任务
- Option 1:吸收状态 (absorbing state),当智能体到达目标状态后,就永远停留在那里,之后的奖励永远设为0
- Option2:普通状态 (normal state),智能体可以离开目标状态,任务会继续,每次进入目标状态时都会获得奖励
- 数学与建模上使用Option2会更方便、统一,$\gamma$会控制一切
Markov Decision Process · MDP
-
Sets
- State: the set of states$\mathcal{S}$
- Action: the set of actions $\mathcal{A}(s)$ is associated for state$s \in \mathcal{S}$
- Reward: the set of rewards $\mathcal{R}(s,a)$
-
Probability distribution
- State transition probability: at state $s$, taking action $a$, the probability to transit to state $s’$ is $p(s’|s, a)$
- Reward probability: at state $s$, taking action $a$, the probability to get reward $r$ is $p(r|s, a)$
-
Policy: at state $s$, the probability to choose action a is $\pi(a|s)$
-
Markov property
- $p(s_{t+1} | a_{t+1}, s_t, …, a_1, s_0) = p(s_{t+1} | a_{t+1}, s_t)$
- $p(r_{t+1} | a_{t+1}, s_t, …, a_1, s_0) = p(r_{t+1} | a_{t+1}, s_t)$
Markov decision process becomes Markov process once the policy is given.
似乎这里的定义和比较官方的定义有一些差别
MDP是一个在马尔可夫性质上的时序决策模型,由一个五元组定义:
$$ S,A,P,R,\gamma $$- 状态集合State Space
- 动作集合Action Space
- 状态转移概率Transition Probability
- 奖励函数Reward
- 折扣因子Discount Factor
一个MDP的动态过程可以描述为一个与时间交互的循环:
- 处于某个状态
- 执行一个动作
- 根据状态转移概率转移到某个状态
- 获得动作的即时奖励
- 重复