-
Notifications
You must be signed in to change notification settings - Fork 4
Home
In reinforcement learning, the environment and the agent interact through a well-defined interface, which typically includes methods for the agent to take actions and receive observations, rewards, and termination signals from the environment.
- All Observation ranges must be equal
- All Observation spaces must be equal
- All action spaces must be equal
- The number of agents must remain equal
- In the case of environments following the Farama Gymnasium (successor OpenAI Gym) interface, which is a common standard, the
step
method is used to advance the environment's state. Thestep
method takes an action as input and returns four values: the new observation (state), the reward, a boolean indicating whether an agent has terminated, a boolean when the episode was truncated, and additional info. The order of these return values is fixed, so the agent knows that the second value is the reward.
Here's a simplified example:
observation, reward, done, info = env.step(action)
In this line of code, reward
is the reward for the action taken. The variable name doesn't matter; what matters is the position of the returned value in the tuple.
So, the PPO algorithm (or any reinforcement learning algorithm) doesn't need to know the variable name in the environment that represents the reward. It just needs to know the structure of the data returned by the environment's step
method.
ad 1. Implement an overall max_observation_range and a specific (smaller) observation range per agent by zero-ing all non-observable cells. Do mind that setting the max_observation_range unneededly high will result in unneeded computing time loss.
ad 2. Implement an overall maximum observation space. In this case a specific observation channel can have a value high at max(n_predators,n_prey,n_max)
ad 3. Implement an overall (maximal) action_range and (heavily) penalize actions which are usually prohibited. Do Mind that setting the action_range unneededly high will result in unneeded computing time loss.
ad 4. At reset
a fixed number of agents is initialized and remains constant in the AEC. However they are Active ("alive") or in Inactive ("dead" or not "born" yet), which is checked at the beginning of the step
function. Inactive agents do not change at step
.