Skip to content

last() function

HBP1969 edited this page Oct 10, 2024 · 1 revision

Notes:

PettingZoo's last() function and the handling of rewards by the PPO algorithm

In the case of environments following the Farama Gymnasium interface, which is a common standard, the step method is used to advance the environment's state. The step method takes an action as input and returns four values: the new observation (state), the reward, a boolean indicating whether an agent has terminated, a boolean when the episode was truncated, and additional info. The order of these return values is fixed, so the agent knows that the second value is the reward.

Here's a simplified example:

observations, rewards, terminations, truncations, info = env.last(action)

In this line of code, reward is the reward for the action taken. The variable name doesn't matter; what matters is the position of the returned value in the tuple.

So, the PPO algorithm (or any reinforcement learning algorithm) doesn't need to know the variable name in the Pettinfenvironment that represents the reward. It just needs to know the structure of the data returned by the environment's last method.

Clone this wiki locally