You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
赵老师:
您好。
首先,非常感谢您精心制作的教学视频与教材内容,您的讲解对我理解强化学习的核心数学概念帮助极大,很多细节讲解清晰易懂,让人受益匪浅。
不过在学习第七章时,我注意到教材中7.4.3关于 Q-learning 的两种编程实现模式部分,可能存在一个小的表述问题。教材中展示的 ε-greedy 策略伪代码标题写为 “Optimal policy learning via Q-learning (on-policy version)”(中文对应:算法7.2 Q-learning (on-policy)),而我认为此处的 “on-policy” 应该更准确地写作 “on-line”。
因为从算法逻辑来看,该部分依旧是标准的 Q-learning 更新方式,属于典型的 off-policy TD 控制算法。在Sutton & Barto的Reinforcement Learning: An Introduction 这本书第6.5节当中也明确 “Q-learning is an off-policy TD control algorithm... the learned action-value function, Q directly approximates q*, ..., independent of the policy being followed.”。
因此,我推测教材此处的“on-policy version”可能是笔误,本意应该是指“on-line version”,与前文关于 online/offline 的描述的内容更为一致。
谨此建议,如有理解错误请谅解。
再次感谢您的教学工作与耐心讲解,真的很受益匪浅!
Beta Was this translation helpful? Give feedback.
All reactions