Home

Design restrictions and workarounds for the PPO algorithm

In reinforcement learning, the environment and the agent interact through a well-defined interface, which typically includes methods for the agent to take actions and receive observations, rewards, and termination signals from the environment.

Restrictions

All Observation ranges must be equal
All Observation spaces must be equal
All action spaces must be equal
The number of agents must remain equal
In the case of environments following the Farama Gymnasium (successor OpenAI Gym) interface, which is a common standard, the step method is used to advance the environment's state. The step method takes an action as input and returns four values: the new observation (state), the reward, a boolean indicating whether an agent has terminated, a boolean when the episode was truncated, and additional info. The order of these return values is fixed, so the agent knows that the second value is the reward.

Here's a simplified example:

observation, reward, done, info = env.step(action)

In this line of code, reward is the reward for the action taken. The variable name doesn't matter; what matters is the position of the returned value in the tuple.

So, the PPO algorithm (or any reinforcement learning algorithm) doesn't need to know the variable name in the environment that represents the reward. It just needs to know the structure of the data returned by the environment's step method.

Workarounds

ad 1. Implement an overall max_observation_range and a specific (smaller) observation range per agent by zero-ing all non-observable cells. Do mind that setting the max_observation_range unneededly high will result in unneeded computing time loss.

ad 2. Implement an overall maximum observation space. In this case a specific observation channel can have a value high at max(n_predators,n_prey,n_max)

ad 3. Implement an overall (maximal) action_range and (heavily) penalize actions which are usually prohibited. Do Mind that setting the action_range unneededly high will result in unneeded computing time loss.

ad 4. At reset a fixed number of agents is initialized and remains constant in the AEC. However they are Active ("alive") or in Inactive ("dead" or not "born" yet), which is checked at the beginning of the step function. Inactive agents do not change at step.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Home

Design restrictions and workarounds for the PPO algorithm

Restrictions

Workarounds

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally