On the Possible Mislabeling of “On-Policy” in Section 7.4.3 of the Textbook #33

YuP2905 · 2025-10-28T09:37:49Z

YuP2905
Oct 28, 2025

赵老师：
您好。
首先，非常感谢您精心制作的教学视频与教材内容，您的讲解对我理解强化学习的核心数学概念帮助极大，很多细节讲解清晰易懂，让人受益匪浅。
不过在学习第七章时，我注意到教材中7.4.3关于 Q-learning 的两种编程实现模式部分，可能存在一个小的表述问题。教材中展示的 ε-greedy 策略伪代码标题写为 “Optimal policy learning via Q-learning (on-policy version)”（中文对应：算法7.2 Q-learning (on-policy)），而我认为此处的 “on-policy” 应该更准确地写作 “on-line”。
因为从算法逻辑来看，该部分依旧是标准的 Q-learning 更新方式，属于典型的 off-policy TD 控制算法。在Sutton & Barto的Reinforcement Learning: An Introduction 这本书第6.5节当中也明确 “Q-learning is an off-policy TD control algorithm... the learned action-value function, Q directly approximates q*, ..., independent of the policy being followed.”。
因此，我推测教材此处的“on-policy version”可能是笔误，本意应该是指“on-line version”，与前文关于 online/offline 的描述的内容更为一致。
谨此建议，如有理解错误请谅解。
再次感谢您的教学工作与耐心讲解，真的很受益匪浅！

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

On the Possible Mislabeling of “On-Policy” in Section 7.4.3 of the Textbook #33

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

On the Possible Mislabeling of “On-Policy” in Section 7.4.3 of the Textbook #33

Uh oh!

YuP2905 Oct 28, 2025

Replies: 0 comments

YuP2905
Oct 28, 2025