强化学习的数学原理 · Chap1 · 基本概念

好好学习

[TOC]

https://www.bilibili.com/video/BV1sd4y167NS

概念定义

  • State:$s$

    • state space: the set of states $S = \left{ s_i \right }$
  • Action:$a$

    • action space of a state: $\mathcal{A}(s_i) = \left{a_i \right}$
  • State Transition:$s_i \overset{a}{\rightarrow} s_j$

    • 在确定性的情况,可以使用S+A的表格进行标识
    • State Transition Probability:表示成条件概率分布
  • Policy: determine what actions to take at a state

    • $\pi$:对于不确定的情况,同样是一个条件概率
  • Reward: a real number we get after taking an action

    • positive/negative: it represents encouragement or punishment
    • corner case
      • zero reward: no punishment
      • can positive mean punishment: 取决于怎么设计
    • Reward本质上是一种人机交互的接口
  • Trajectory: a state-action-reward chain

    • $s_i\xrightarrow[a_i]{r_i}s_{i+1}\xrightarrow[a_{i+1}]{r_{i+1}}s_{i+2}$
    • return: the sum of all the rewards along a trajectory
    • discounted return:trajectory可能会无限长(走到终点之后仍有奇怪多余的操作),导致reward发散到无穷
      • discount rate: $\gamma \in [0,1)$
      • 防止return发散,通过指数级别的系数累乘,保证return是收敛的
  • Episode: the agent may stop at some terminal states, the resulting trajectory is called an episode(or a trial)

    • Episodic task(有终止任务)
    • Continuing task(持续任务)
    • 在数学上,可以通过把 episodic task 转换为 continuing task 来统一处理两类任务
      • Option 1:吸收状态 (absorbing state),当智能体到达目标状态后,就永远停留在那里,之后的奖励永远设为0
      • Option2:普通状态 (normal state),智能体可以离开目标状态,任务会继续,每次进入目标状态时都会获得奖励
    • 数学与建模上使用Option2会更方便、统一,$\gamma$会控制一切

Markov Decision Process · MDP

  • Sets

    • State: the set of states$\mathcal{S}$
    • Action: the set of actions $\mathcal{A}(s)$ is associated for state$s \in \mathcal{S}$
    • Reward: the set of rewards $\mathcal{R}(s,a)$
  • Probability distribution

    • State transition probability: at state $s$, taking action $a$, the probability to transit to state $s’$ is $p(s’|s, a)$
    • Reward probability: at state $s$, taking action $a$, the probability to get reward $r$ is $p(r|s, a)$
  • Policy: at state $s$, the probability to choose action a is $\pi(a|s)$

  • Markov property

    • $p(s_{t+1} | a_{t+1}, s_t, …, a_1, s_0) = p(s_{t+1} | a_{t+1}, s_t)$
    • $p(r_{t+1} | a_{t+1}, s_t, …, a_1, s_0) = p(r_{t+1} | a_{t+1}, s_t)$

Markov decision process becomes Markov process once the policy is given.


似乎这里的定义和比较官方的定义有一些差别

MDP是一个在马尔可夫性质上的时序决策模型,由一个五元组定义:

$$ S,A,P,R,\gamma $$
  • 状态集合State Space
  • 动作集合Action Space
  • 状态转移概率Transition Probability
  • 奖励函数Reward
  • 折扣因子Discount Factor

一个MDP的动态过程可以描述为一个与时间交互的循环:

  • 处于某个状态
  • 执行一个动作
  • 根据状态转移概率转移到某个状态
  • 获得动作的即时奖励
  • 重复
使用 Hugo 构建
主题 StackJimmy 设计