The environment is a Mario Kart grid world where Mario navigates a 4x4 grid. Mario can move up, down, left, and right. Rewards (+4, +7) and penalties (-3, -5) are represented by coin bags and bombs, respectively. The goal is for Mario to reach the bottom-right cell, starting from the top-left. The objective is to maximize the score by collecting rewards and avoiding penalties, with a trophy reward of +10 for reaching the goal.
Mario Grid World with coin bags as positive rewards and bombs as negative rewards.
{S1 = (0,0), S2 = (0,1), ..., S16 = (3,3)}
Initial state: S1 = (0,0) Goal state: S16 = (3,3)
{Up, Down, Right, Left} => {3,2,0,1}
{-3, -5, +4, +7, +10}
Reach the goal state with maximum reward
Visualization of the grid world and agent's movements
To ensure safety, agent movements are confined to the 4x4 grid using np.clip. The state-space is defined, and the reset_state variable resets the agent's position to a valid state. Constraints are enforced on actions to prevent invalid moves.
-
SARSA (State-Action-Reward-State-Action)
- Update function: Q(s,a) <- Q(s,a) + α[r + γQ(s',a') - Q(s,a)]
- Key features: On-policy, TD learning, Control algorithm
-
Q-learning
- Update function: Q(s,a) <- Q(s,a) + α[r + γmax(Q(s',a')) - Q(s,a)]
- Key features: Off-policy, TD learning, Control algorithm
[Plots for SARSA: Total rewards per episode, Epsilon Decay] [Plots for Q-learning: Total rewards per episode, Epsilon Decay]
[Plots for SARSA and Q-learning on test data]
[Comparison of SARSA and Q-learning performance]
-
Setup 1: Tuning γ
- γ = 0.8
- γ = 0.4
- γ = 1.0
-
Setup 2: Tuning decay_rate
- decay_rate = 0.8
- decay_rate = 0.6
- decay_rate = 0.3
-
Setup 1: Tuning γ
- γ = 0.8
- γ = 0.4
- γ = 0.1
-
Setup 2: Tuning decay_rate
- decay_rate = 0.8
- decay_rate = 0.6
- decay_rate = 0.3
After tuning, optimal values for SARSA and Q-learning were found to be gamma = 0.99 and decay rate = 0.995.