LECTURE 1 - BASIC BANDITS

What is Reinforcement Learning?

什么是强化学习

Reinforcement Learning studies how an agent learns to make decisions by interacting with an environment to maximize reward over time.
强化学习研究智能体如何通过与环境交互来学习决策，从而最大化长期奖励。

Key elements:
核心要素：

Agent: the decision maker
Agent：做决策的主体
Action: what the agent chooses
Action：智能体选择的行为
Reward: feedback signal
Reward：反馈信号

The goal is not just immediate reward, but long-term return.
目标不是即时奖励，而是长期回报。

Return definition:
回报定义：

This means future rewards are discounted by factor γ.
这表示未来奖励会被折扣因子 γ 衰减。

If γ is small → agent cares about short-term reward.
如果 γ 很小 → 更关注短期奖励。

If γ close to 1 → agent values long-term reward.
如果 γ 接近 1 → 更关注长期奖励。

Multi-Armed Bandit

多臂赌博机

There is no state transition. Each round is independent.
没有状态转移，每一轮互相独立。

Each action has an unknown expected reward:

Meaning: average reward if we always choose action a.
含义：如果一直选择动作 a，它的平均奖励。

Optimal action:

Meaning: choose the action with highest expected reward.
含义：选择期望奖励最高的动作。

Exploration vs Exploitation

探索与利用

Exploit: choose the action with highest estimated reward.
利用：选择当前估计最好的动作。

Explore: try other actions to learn more.
探索：尝试其他动作获取更多信息。

Why necessary?
为什么必须探索？

Because estimates may be wrong early on.
因为早期估计可能是错的。

ε-greedy Strategy

ε-贪心策略

With probability 1−ε: exploit.
以概率 1−ε 利用。

With probability ε: choose random action.
以概率 ε 随机探索。

Simple but inefficient exploration.
简单但探索效率不高。

UCB (Upper Confidence Bound)

上置信界

Interpretation:
解释：

First term = estimated reward
第一项 = 当前估计奖励
Second term = uncertainty bonus
第二项 = 不确定性奖励

If an action is rarely chosen → large bonus.
如果某动作很少被选 → 奖励项大。

Encourages exploration in principled way.
以系统方式鼓励探索。

LECTURE 2 - CONTEXTUAL BANDITS

What is Contextual Bandit?

什么是上下文赌博机

At time t:
在时间 t：

Observe context x_t
观察上下文 x_t
Choose action a_t
选择动作 a_t
Receive reward r_t
获得奖励 r_t

Key difference from basic bandit:
与普通赌博机的区别：

Reward depends on both action and context.
奖励依赖于动作和上下文。

Expected reward function:

Meaning: expected reward if we take action a in context x.
含义：在上下文 x 下选择动作 a 的期望奖励。

Optimal policy:

Meaning: for each context, choose best action.
含义：对每个上下文选择最优动作。

Why Not Just Do Regression?

为什么不能只做回归？

Because we only observe reward for chosen action.
因为只观察到被选动作的奖励。

This is called partial feedback.
这叫部分反馈问题。

Other issues:
其他问题：

Selection bias: data depends on past decisions
选择偏差：数据由过去决策决定
Confirmation loop: decisions influence future data
确认循环：决策影响未来数据

Bandits are decision problems, not pure prediction problems.
赌博机是决策问题，不只是预测问题。

Linear Reward Model

线性奖励模型

Assume reward can be approximated as linear function:

Meaning:
含义：

φ(a,x) = feature vector
φ(a,x) = 特征向量
θ = parameter vector
θ = 参数向量

We learn θ by minimizing squared error with regularization:
$$
L(\theta) =
\sum (r_t - \theta^\top \phi_t)^2

\lambda |\theta|^2
$$
Regularization prevents overfitting.
正则项防止过拟合。

Closed-form solution:

Where:

Interpretation:
解释：

A captures feature covariance.
A 表示特征协方差结构。

b captures correlation with reward.
b 表示特征与奖励的相关性。

LinUCB

线性UCB

Action selection rule:

Interpretation:
解释：

First term = predicted reward.
第一项 = 预测奖励。

Second term = uncertainty.
第二项 = 不确定度。

β controls exploration strength.
β 控制探索强度。

Policy Gradient

策略梯度

Instead of estimating reward first, we directly optimize policy.
不是先估计奖励，而是直接优化策略。

Objective:

Gradient:
$$
\nabla J(\theta)

\mathbb{E}
[
r \nabla \log \pi_\theta(a|x)
]
$$
Meaning:
含义：

Increase probability of actions that give high reward.
增加高奖励动作的概率。

Sample update:

Advantage: direct decision optimization.
优点：直接优化决策。

Disadvantage: high variance.
缺点：方差高。

Off-Policy Evaluation

离策略评估

Data collected under behavior policy π_B.
数据来自行为策略 π_B。

We want to evaluate target policy π_T.
希望评估目标策略 π_T。

Importance weight:
$$
w_t

\frac{
\pi_T(a_t|x_t)
}{
\pi_B(a_t|x_t)
}

\hat V

\frac{1}{n}
\sum
w_t r_t
$$
Meaning:
含义：

Reweight samples to correct bias.
通过加权修正分布偏差。

Unbiased but high variance.
无偏但方差大。

LECTURE 3 - MARKOV DECISION PROCESSES

Sequential Decision Making 序列决策

Contextual Bandits vs MDP
Bandit：no temporal dependency
多臂老虎机：无时间依赖

MDP：current state influences next state
MDP：当前状态影响未来状态

核心难点 Core challenges:
• Delayed rewards 延迟奖励
• Credit assignment 信用分配问题
• Short-term vs long-term trade-off 短期与长期权衡

考试常考：解释 credit assignment；举例说明 delayed reward。

State 状态

Definition
State S_t = information needed to make a decision at time t
状态 S_t = 在时刻 t 做决策所需的全部信息

Good state 必须满足：
Markov property 马尔可夫性

Example (grid world):
Option 1: entire history
选项1：完整历史
Option 2: current position (x_t, y_t)
选项2：当前位置（推荐）
Option 3: only x_t
选项3：信息不足

Too much info → slower learning, possible overfitting
信息过多 → 学习慢，可能过拟合

Too little info → cannot learn optimal action
信息不足 → 无法学到最优策略

考试常考：什么是 good state；什么破坏 Markov property。

Markov Property 马尔可夫性质

Without action:
P(S_{t+1} | S_t, S_{t-1}, …) = P(S_{t+1} | S_t)

With action:
P(S_{t+1} | S_t, A_t, history) = P(S_{t+1} | S_t, A_t)

Meaning:
Future depends only on current state (and action)
未来只依赖当前状态（和动作）

Higher order dependence:
If depends on last two states → define new state Z_t = (S_t, S_{t-1})
若依赖多个历史 → 扩展状态

代价：state space explodes
代价：状态空间指数增长

考试常考：写出 Markov property 数学表达式。

Transition Dynamics 转移动态

Transition probability:
P(s’ | s, a)

For all s,a:
Sum_{s’} P(s’ | s,a) = 1

Deterministic case:
S_{t+1} = T(S_t, A_t)
确定性：下一个状态唯一确定

Stochastic case:
S_{t+1} ~ P(. | S_t, A_t)
随机：从概率分布采样

考试常考：deterministic vs stochastic 区别。

MDP Definition MDP 定义

An MDP consists of:

• S: set of states 状态空间
• A: set of actions 动作空间
• P(s’ | s,a): transition probability 转移概率
• R(s,a) or R_{t+1}: reward function 奖励函数

Interaction at time t:
Observe S_t
Take action A_t
Receive reward R_{t+1}
Transition to S_{t+1}

Trajectory:
S_0, A_0, R_1, S_1, A_1, R_2, …

考试常考：写出 MDP 四要素。

Types of Environments 环境类型

(1) Deterministic vs Stochastic
确定 vs 随机

(2) Episodic vs Continuing
Episodic: has terminal state
有终止状态
Continuing: infinite horizon
无终止状态

(3) Stationary vs Non-stationary
Stationary: P, R do not change over time
Non-stationary: change over time

(4) Finite vs Infinite state space
有限 vs 无限

(5) Fully observable vs Partially observable
完全可观测 vs 部分可观测

考试常考：分类题；举例。

Reward and Goal 奖励与目标

Reward signal defines behavior
奖励定义行为

Reward ≠ True goal
奖励 ≠ 真实目标

Misspecified reward → unintended behavior
奖励设计错误 → 异常行为

Short-term reward ≠ long-term optimal
短期奖励 ≠ 长期最优

考试常考：解释 reward design 问题。

Return 回报

Bandits optimize immediate reward
Bandit 只优化即时奖励

MDP optimize return G_t
MDP 优化累计回报

Episodic case:
G_t = R_{t+1} + R_{t+2} + … + R_T

Continuing case (discounting):
G_t = sum_{k=0 to ∞} gamma^k R_{t+1+k}

gamma ∈ [0,1)

gamma close to 0 → short-term
gamma close to 1 → long-term

Boundedness:
If R_max is maximum reward
G_t ≤ R_max / (1 - gamma)

考试高频：
• 写 discounted return
• 解释 gamma 含义
• 证明 bounded by geometric series

Value Function 价值函数

Return G_t is random variable
回报是随机变量

State-value function:
V^pi(s) = E[ G_t | S_t = s ]

Meaning:
Expected return starting in s and following policy pi
在状态 s 下按策略 pi 的期望回报

Action-value function:
Q^pi(s,a) = E[ G_t | S_t = s, A_t = a ]

Meaning:
Take action a in s, then follow pi
在 s 先执行 a，再按 pi

考试高频：
• 写出 V^pi
• 写出 Q^pi
• 解释期望含义

Policy 策略

Policy pi defines behavior
策略定义行为

Deterministic policy:
pi: S → A
a = pi(s)

Stochastic policy:
pi(a|s) = probability of taking action a in state s

Thanks to Markov property:
Only current state needed
只需要当前状态

Optimality 最优性

Optimal state-value:
V*(s) = max_pi V^pi(s)

Optimal action-value:
Q*(s,a) = max_pi Q^pi(s,a)

Optimal policy:
pi*(s) = argmax_a Q*(s,a)

Important:
V* ≠ V^pi unless pi is optimal
只有当 pi = pi* 才相等

考试高频：
• 写 optimal policy expression
• 区分 V^pi 和 V*

本章考试核心总结

必会写：

Markov property 数学表达式
Discounted return 公式
V^pi 与 Q^pi 定义
Optimal policy 表达式
Deterministic vs stochastic 区别

理解类必会：

• Credit assignment
• Reward design problem
• Short-term vs long-term trade-off
• gamma 的物理意义

LECTURE 4 - DYNAMIC PROGRAMMING

本讲核心前提 Core Assumption

已知完整 MDP（P 和 R 已知）
Assume full knowledge of the MDP (transition probabilities and rewards known)

不需要 trial-and-error 学习
No need for trial-and-error learning

目标：直接计算最优策略
Goal: compute optimal policy exactly

Bellman 思想 Bellman Principle

核心直觉
Value of a state = reward now + expected value later
一个状态的价值 = 当前奖励 + 未来期望价值

依赖 Markov property
Only current state matters
只依赖当前状态

递归结构 → 固定点问题
Recursive structure → fixed point problem

Markov Reward Process (MRP)

MRP = MDP without actions
MRP 是无动作的 MDP

组成 Components:

• 状态转移矩阵 P
Transition matrix P

• 期望奖励 r(s)
Expected reward r(s)

• 折扣因子 γ ∈ [0,1)
Discount factor γ

Bellman 方程（MRP）

矩阵形式
Matrix form

(I − γP)v = r
v = (I − γP)^{-1} r

重要结论 Important theorem:

若 0 ≤ γ < 1
Bellman equation 有唯一解
(I − γP) 可逆

考试常考：
• 推导 v = r + γPv
• 解释为什么 γ < 1 保证唯一解
• 解释 fixed point 含义

Episodic MRP (γ = 1 情况)

若 γ = 1 且存在终止状态
Terminal state value 固定为 0

只更新非终止状态
Do not update terminal state

收敛速度由到达终止状态的概率决定
Convergence depends on expected hitting time

考试常考：
γ=1 为什么仍可收敛？
因为 eventually reach terminal with prob 1

Bellman 方程（MDP）

四种版本 Four versions:

1. Policy State Value

含义
Expected return following policy π
在策略 π 下的期望回报

2. Optimal State Value

$v^(s) = \max_a [ r(s,a) + \gamma \sum_{s’} P(s’|s,a) v^(s’) ]$

区别
期望 → 最大值
Expectation → Max

非线性（因 max）
Nonlinear because of max

3. Policy Q Function

4. Optimal Q Function

$Q^(s,a) = r(s,a) + \gamma \sum_{s’} P(s’|s,a) \max_{a’} Q^(s’,a’)$

这是 Q-learning 的理论基础
This is the basis of Q-learning

考试高频：
• 写出四个 Bellman 方程
• 解释 vπ vs v* 区别
• 为什么 v* 是非线性方程

Bellman 作为固定点 Fixed Point View

定义 Bellman operator B(v)

v = B(v)

两种解法 Two approaches:

直接解线性方程
Solve linear system
迭代法（Dynamic Programming）
Iterative method

更新形式:

v_{k+1} = r + γPv

收敛速率 O(γ^k)

γ 越小收敛越快
Smaller γ → faster convergence

考试常考：
证明误差收缩性质
||v_k − v|| ≤ γ^k ||v_0 − v||

Policy Evaluation（策略评估）

目标
计算 vπ

方法：

直接法 Direct

构造 MRP:

rπ(s) = Σ_a π(a|s) r(s,a)
Pπ(s,s’) = Σ_a π(a|s) P(s’|s,a)

然后解:

vπ = rπ + γPπ vπ

迭代法 Iterative DP

初始化 v0(s)

迭代更新:

v_{k+1}(s) = Σ_a π(a|s)[ r(s,a) + γ Σ_{s’} P(s’|s,a)v_k(s’) ]

收敛到 vπ

考试常考：
• 写 policy evaluation 算法
• 说明收敛条件

Policy Improvement（策略改进）

思想
Greedy w.r.t value

改进规则:

π’(s) = argmax_a Qπ(s,a)

保证：

v_{π’}(s) ≥ vπ(s)

Policy Improvement Theorem
策略永不变差

若无法改进 → 最优策略
If no improvement possible → optimal

考试高频：
解释为什么 greedy 一定更优

Policy Iteration

步骤：

Policy Evaluation
Policy Improvement
重复直到策略不变

保证收敛到 π*

原因：

有限策略空间 + 每次严格改进

实践中迭代次数很少
Usually converges fast

Value Iteration

思想
直接用 optimal Bellman operator

更新规则:

特点：

每次更新已包含 improvement
Evaluation + improvement 合并

收敛到 v*

然后提取策略:

π*(s) = argmax_a [ r(s,a) + γ Σ P(s’|s,a)v*(s’) ]

Policy Iteration vs Value Iteration

Policy Iteration
完整评估 → 再改进
Full evaluation each round

Value Iteration
每轮只做一次 Bellman optimal 更新
Partial evaluation

两者都收敛
Both converge

大规模问题通常 VI 更快
VI often faster for large problems

考试常考比较题。

本章考试核心 Checklist

必须能写：

• 四个 Bellman 方程
• v = r + γPv
• Policy evaluation 更新式
• Policy improvement 公式
• Value iteration 更新式

必须理解：

• fixed point 概念
• contraction mapping
• γ 对收敛的影响
• greedy improvement 为什么有效
• PI 与 VI 区别

LECTURE 5 - MONTE CARLO METHODS

本讲核心转折 Core Transition

Lecture 4: 已知 MDP，用 Dynamic Programming
Lecture 4: MDP known → use Dynamic Programming

Lecture 5: 不知道 MDP，只能用经验数据
Lecture 5: MDP unknown → learn from experience

我们只看到 trajectory:
We only observe trajectories:

S₀, A₀, R₁, S₁, A₁, R₂, …

Monte Carlo 的核心思想 Core Idea of MC

用完整 episode 的 return 来估计 value
Estimate value by averaging complete episode returns

无需转移概率
No need for transition probabilities

只做一件事：
Just do one thing:

v(s) ≈ 访问 s 后观察到的 return 的平均值
v(s) ≈ average of returns after visiting s

MC 需要 episodic MDP
MC requires episodic tasks (clear terminal state)

Return 定义 Return

从时刻 t 开始的回报：
Return starting from time t:

Episodic 情况常设 γ=1
Often set γ=1 in episodic tasks

Monte Carlo 估计 vπ(s)

目标 Goal:

估计方法 Estimator:

性质 Properties:

• 无偏 Unbiased
E[ v̂(s) ] = vπ(s)

• 方差有限时 MSE = σ² / E
Error decreases with more episodes

前提：
必须有限时间到达终止状态（prob=1）

考试常考：
• 为什么是无偏估计
• 收敛速度 O(1/E)

全状态估计 Estimating All States

每个 episode 可以更新多个状态
Each episode updates many states

算法思想：

对 episode 从后往前算 G_t
Compute G backward

对每个 S_t 更新 v(S_t)

First-Visit vs Every-Visit MC

First-Visit MC

只用一个 episode 中 state 第一次出现的 return
Use first occurrence in episode

无偏估计
Unbiased

Every-Visit MC

使用每次访问
Use all occurrences

利用数据更多
Better data usage

可能存在相关性
More correlation → higher variance

两者都收敛（若无限探索）
Both converge if infinite exploration

考试常考对比题。

Monte Carlo 估计 Qπ(s,a)

目标 Goal:

关键问题：

若 π 不探索所有 (s,a)，
某些 Q 永远学不到

必须满足：
π(a|s) > 0 for all (s,a)

Exploring Starts

思想 Idea:

初始化时随机选择 (S₀,A₀)，
确保所有 (s,a) 都有正概率

收敛保证
Convergence guaranteed

缺点：
实际中难以控制初始 state-action

ε-soft 策略

定义 ε-soft policy:

作用：

保证 π(a|s) > 0

Trade-off:

ε 大 → 探索多 → 偏离最优
ε 小 → 更接近 greedy → 探索慢

Monte Carlo Control（On-policy）

核心结构：Policy Iteration

Policy Evaluation:
估计 Qπ(s,a)

Policy Improvement:
π(s) = argmax Q(s,a)

循环直到收敛

考试常考：
为什么需要学 Q 而不是 v？
因为不知道 P(s’|s,a)，不能从 v 推 Q

Off-Policy 学习

问题：

用行为策略 b 生成数据
却想评估目标策略 π

两者不同

必须修正采样偏差
Use importance sampling

Importance Sampling

定义 importance ratio:

估计公式：

要求：

b(a|s) > 0 若 π(a|s) > 0
否则无法估计（除零问题）

Ordinary IS vs Weighted IS

Ordinary IS

无偏
Unbiased

方差极高
High variance

Weighted IS

有偏
Biased

方差更低
Lower variance

考试高频：
比较 IS 与 WIS 优缺点

Incremental 更新

避免存储所有数据
Avoid storing all past returns

增量均值公式：

用于 v(s) 更新

同理可用于 Q(s,a)

Incremental WIS

定义累计权重 C:

C ← C + ρ

更新规则：

内存效率高
Online update possible

Off-Policy MC Control

行为策略 b: 固定、探索性
Behavior policy fixed and exploratory

目标策略 π: greedy(Q)

每轮：

用 b 生成 episode
用 IS 更新 Q
π ← greedy(Q)

直到收敛

本章考试 Checklist

必须掌握：

• MC 估计 vπ 的公式
• First-visit vs Every-visit
• 为什么需要 exploration
• Exploring start 与 ε-soft
• Importance sampling ratio 定义
• Ordinary IS vs Weighted IS 区别
• Incremental 更新公式
• MC control 流程