Skip to content
HBP1969 edited this page Apr 8, 2024 · 45 revisions

Design restrictions and workarounds for the PPO algorithm

In reinforcement learning, the environment and the agent interact through a well-defined interface, which typically includes methods for the agent to take actions and receive observations, rewards, and termination signals from the environment.

Restrictions

  1. All Observation ranges must be equal
  2. All Observation spaces must be equal
  3. All action spaces must be equal
  4. The number of agents must remain equal
  5. In the case of environments following the Farama Gymnasium (successor OpenAI Gym) interface, which is a common standard, the step method is used to advance the environment's state. The step method takes an action as input and returns four values: the new observation (state), the reward, a boolean indicating whether an agent has terminated, a boolean when the episode was truncated, and additional info. The order of these return values is fixed, so the agent knows that the second value is the reward.

Here's a simplified example:

observation, reward, done, info = env.step(action)

In this line of code, reward is the reward for the action taken. The variable name doesn't matter; what matters is the position of the returned value in the tuple.

So, the PPO algorithm (or any reinforcement learning algorithm) doesn't need to know the variable name in the environment that represents the reward. It just needs to know the structure of the data returned by the environment's step method.

Workarounds

ad 1. Implement an overall max_observation_range and a specific (smaller) observation range per agent by zero-ing all non-observable cells. Do mind that setting the max_observation_range unneededly high will result in unneeded computing time loss.

ad 2. Implement an overall maximum observation space. In this case a specific observation channel can have a value high at max(n_predators,n_prey,n_max)

ad 3. Implement an overall (maximal) action_range and (heavily) penalize actions which are usually prohibited. Do Mind that setting the action_range unneededly high will result in unneeded computing time loss.

ad 4. At reset a fixed number of agents is initialized and remains constant in the AEC. However they are Active ("alive") or in Inactive ("dead" or not "born" yet), which is checked at the beginning of the step function. Inactive agents do not change at step.

Not able to use the standard PettingZoo procedure to remove agents from 'self.agents' array. Knights-Archer-Zombies environment documentation states: "This environment allows agents to spawn and die, so it requires using SuperSuit’s Black Death wrapper, which provides blank observations to dead agents rather than removing them from the environment."

Possible workaround. Maintain the self.agents array from creation onwards and implement "alive" boolean. At death:

-remove the agent from the agent layer, so other agents cannot observe the dead agents.

-Change all relevant values to zero

PPO in MARL

Proximal Policy Optimization (PPO) itself does not inherently distinguish between different types of agents in a multi-agent reinforcement learning (MARL) scenario. PPO is a policy optimization algorithm designed to work with environments where multiple agents are present, but it treats each agent as an independent learner. Each agent then uses these inputs to update its own policy. The PPO algorithm works independently for each agent, optimizing their policies to maximize their individual expected cumulative rewards. Each agent maintains its own policy, and learning is based on the individual experiences and rewards of each agent. PPO, as a learning algorithm, will adapt to these representations and learn policies for each agent based on their individual experiences in the environment.

Clone this wiki locally