-
Notifications
You must be signed in to change notification settings - Fork 4
last() function
In the case of environments following the Farama Gymnasium interface, which is a common standard, the step
method is used to advance the environment's state. The step
method takes an action as input and returns four values: the new observation (state), the reward, a boolean indicating whether an agent has terminated, a boolean when the episode was truncated, and additional info. The order of these return values is fixed, so the agent knows that the second value is the reward.
Here's a simplified example:
observations, rewards, terminations, truncations, info = env.last(action)
In this line of code, reward
is the reward for the action taken. The variable name doesn't matter; what matters is the position of the returned value in the tuple.
So, the PPO algorithm (or any reinforcement learning algorithm) doesn't need to know the variable name in the Pettinfenvironment that represents the reward. It just needs to know the structure of the data returned by the environment's last
method.