Skip to content

Commit dad00b3

Browse files
committed
Updated docs
1 parent 51f0769 commit dad00b3

28 files changed

+196
-56
lines changed

CHANGELOG.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# Changelog
22

3-
## [0.2] TODO
3+
## [0.2] 2025-08-06
44

5-
Minor release with many significant additions, such as support for:
5+
We release several significant additions, such as support for:
66

77
- TorchRL and BenchMARL: several SoA MARL algorithms that support heterogeneous groups too.
88

@@ -43,6 +43,8 @@ Minor release with many significant additions, such as support for:
4343

4444
- Options to specify how `DefaultObservationConfig` should handle dict spaces: flatten sub-spaces, ignore some keys, sort the keys, normalize values.
4545

46+
- Optional position and orientation observations in `DefaultObservationConfig`.
47+
4648
- Parallel environment wrappers
4749
- `NameWrapper` to index agent by string, like agent_0, agent_1, ... .
4850
- `MaskWrapper` to mask part of the action and observation spaces.

docs/source/conf.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,10 @@
8787

8888
reftarget_aliases = {}
8989
reftarget_aliases['py'] = {
90+
'AlgorithmConfig': 'benchmarl.algorithms.common.AlgorithmConfig',
91+
'ExperimentConfig': 'benchmarl.experiment.ExperimentConfig',
92+
'ModelConfig': 'benchmarl.models.common.ModelConfig',
93+
'EnvBase': 'torchrl.envs.EnvBase',
9094
'gym.Env': 'gymnasium.Env',
9195
'Axes': 'matplotlib.axes.Axes',
9296
'Path': 'pathlib.Path',

docs/source/introduction.rst

Lines changed: 128 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ which, using the API, is typically implemented like
4040
observation, info = environment.reset()
4141
4242
for _ in range(1000):
43-
action = evaluate_my_policy(observation)
43+
action = my_policy(observation)
4444
observation, reward, terminated, truncated, info = environment.step(action)
4545
4646
if terminated or truncated:
@@ -69,7 +69,7 @@ The Parallel API is similar to Gymnasium, with the difference that actions, rewa
6969
observations, infos = environment.reset()
7070
7171
for _ in range(1000):
72-
actions = {index: evaluate_my_policy(observation)
72+
actions = {index: my_policy(observation)
7373
for index, observation in observations.items()}
7474
observations, rewards, terminations, truncations, infos = environment.step(actions)
7575
@@ -80,7 +80,6 @@ The Parallel API is similar to Gymnasium, with the difference that actions, rewa
8080
8181
env.close()
8282
83-
8483
.. note::
8584

8685
We can convert between environments with AEC and Parallel API using
@@ -89,6 +88,35 @@ The Parallel API is similar to Gymnasium, with the difference that actions, rewa
8988
Moreover, we can convert PettingZoo environments in which all agents share the same action and observation spaces to
9089
a vectorized Gymnasium environment that concatenate all the actions, observations and other infos using `SuperSuit wrappers <https://github.com/Farama-Foundation/SuperSuit/blob/master/supersuit/vector/vector_constructors.py>`_. This way, we can use ML libraries that works with Gymanasium to train distributed multi-agent systems.
9190

91+
TorchRL
92+
-------
93+
94+
`TorchRL <https://docs.pytorch.org/rl/stable/index.html>`_ is an open-source Reinforcement Learning (RL) library for PyTorch.
95+
96+
TorchRL environments are based on the same Markov Decision Process cycle but with a different API: ``environment.step`` input and output are both dictionaries from `tensordict <https://docs.pytorch.org/tensordict/stable/index.html>`_ that holds actions, observations, rewards, ... in separate keys.
97+
98+
TorchRL environment can be constructed from Gymnasium and PettingZoo environments (among others).
99+
The following is a cycle in TorchRL similar to the previous ones.
100+
Note that TorchRL policies also operates on dictionaries for tensors (input and output).
101+
102+
.. code-block:: python
103+
104+
from torchrl.envs import GymEnv
105+
106+
environment = GymEnv("MyEnviroment")
107+
environment.set_seed(0)
108+
td = environment.reset()
109+
110+
for _ in range(1000):
111+
td = my_torchrl_policy(td)
112+
td = environment.step(td)
113+
114+
if td['next', 'terminated'] or td['next', 'truncated']:
115+
td = environment.reset()
116+
117+
environment.close()
118+
119+
One important difference between PettingZoo and TorchRL environments is that agents can be grouped together. For examples, in an environment with 2 green agents and 2 blue agents (where same-colored agents would share the same type of actions and observations), the dictionary ``td`` in the example above would have keys like ``("green", "next", "observation")`` and ``("blue", "next", "observation")`` that hold tensors with the observation form *both* agents of the same color.
92120

93121
Navground
94122
---------
@@ -166,6 +194,25 @@ By specifying
166194
- :py:class:`.ControlActionConfig` where the policy outputs a control command
167195
- :py:class:`.ModulationActionConfig` where the policy outputs parameters of an underlying deterministic navigation behavior.
168196

197+
For example, to create a single a single-agent environment:
198+
199+
.. code-block:: python
200+
201+
import gymnasium as gym
202+
from navground import sim
203+
from navground.learning import DefaultObservationConfig, ControlActionConfig
204+
from navground.learning.rewards import SocialReward
205+
206+
env = gym.make('navground.learning.env:navground',
207+
scenario=scenario,
208+
sensor=sensor,
209+
action=ControlActionConfig(),
210+
observation=DefaultObservationConfig(),
211+
reward=SocialReward(),
212+
time_step=0.1,
213+
max_episode_steps=600)
214+
215+
169216
PettingZoo Navground Environment
170217
--------------------------------
171218

@@ -174,15 +221,56 @@ Similarly, :py:class:`.parallel_env.MultiAgentNavgroundEnv` provides a environme
174221
:py:func:`.parallel_env.parallel_env` instantiate an environment where different agents may use different configurations (such as action spaces, rewards, ...), while
175222
:py:func:`.parallel_env.shared_parallel_env` instantiate an environment where all specified agents share the same configuration.
176223

224+
.. code-block:: python
225+
226+
import gymnasium as gym
227+
from navground import sim
228+
from navground.learning.parallel_env import shared_parallel_env
229+
from navground.learning import DefaultObservationConfig, ControlActionConfig
230+
from navground.learning.rewards import SocialReward
231+
232+
penv = shared_parallel_env(scenario=scenario,
233+
sensor=sensor,
234+
action=ControlActionConfig(),
235+
observation=DefaultObservationConfig(),
236+
reward=SocialReward(),
237+
time_step=0.1,
238+
max_episode_steps=600)
239+
240+
177241
The rest of the functionality is very similar to the Gymnasium Environment (and in fact, they share the same base class), but conform to the PettingZoo API instead.
178242

243+
For example, to create a single a multi-agent environment, where all agents share the same configuration:
244+
245+
246+
TorchRL Navground Environment
247+
-----------------------------
248+
249+
Navground and TorchRL both support PettingZoo environments, therefore it is is straightforward to create TorchRL environments with navground components:
250+
251+
.. code-block:: python
252+
253+
from torchrl.envs.libs.pettingzoo import PettingZooWrapper
254+
from navground.learning.parallel_env import shared_parallel_env
255+
from navground.learning.wrappers.name_wrapper import NameWrapper
256+
257+
penv = shared_parallel_env(...)
258+
env = PettingZooWrapper(NameWrapper(penv),
259+
categorical_actions=False,
260+
device='cpu',
261+
seed=0,
262+
return_state=penv.has_state)
263+
264+
:py:class:`.wrappers.name_wrapper.NameWrapper` converts from an environment where agents are indexed by integers to one where they are indexed by strings, which TorchRL requires.
265+
266+
Function :py:func:`.utils.benchmarl.make_env` provides the same functionality.
179267

180268
Train ML policies in navground
181269
==============================
182270

183271
.. note::
184272

185-
Have a look at the tutorials to see the interaction between gymnasium and navground in action and how to use it to train a navigation policy using IL or RL.
273+
Have a look at the :doc:`tutorials <tutorials/index>` to see the interaction between gymnasium and navground in action and how to use it to train a navigation policy using IL or RL.
186274

187275
Imitation Learning
188276
------------------
@@ -258,6 +346,33 @@ We instantiate the parallel environment using :py:func:`.parallel_env.shared_par
258346
psac.save("PSAC")
259347
260348
349+
Multi-agent Reinforcement Learning with BenchMARL
350+
-------------------------------------------------
351+
352+
`BenchMARL <https://github.com/facebookresearch/BenchMARL>`_ provides implementation of Multi-agent Reinforcement Learning algorithms that extend behind the parallel Multi-agent Reinforcement Learning family just described. They can tackle problems that features heterogeneous agents, which therefore do not share the same policy and cannot be "stacked" together. Moreover, MARL-specific algorithm are designed to reduce instabilities that arise in multi-agent training, where agents learn policies among other agents that keep evolving their behavior.
353+
354+
We provide utilities that simplify training navigation policies with BenchMARL, like for example, using the multi-agent version of SAC:
355+
356+
.. code-block:: python
357+
358+
from navground.learning.parallel_env import shared_parallel_env
359+
from benchmarl.algorithms import MasacConfig
360+
from benchmarl.models.mlp import MlpConfig
361+
from benchmarl.experiment import ExperimentConfig
362+
from navground.learning.utils.benchmarl import NavgroundExperiment
363+
364+
penv = shared_parallel_env(scenario=..., sensor=...,
365+
observation_config=..., action_config=...,
366+
max_episode_steps=100)
367+
masac_exp = NavgroundExperiment(
368+
env=penv,
369+
config=ExperimentConfig.get_from_yaml(),
370+
model_config=MlpConfig.get_from_yaml(),
371+
algorithm_config=MasacConfig.get_from_yaml(),
372+
seed=0
373+
)
374+
masac_exp.run_for(iterations=20)
375+
261376
Evaluation
262377
==========
263378

@@ -294,6 +409,14 @@ Once we have trained a policy (and possibly exported it to onnx using :py:func:`
294409
295410
world.run(time_step=0.1, steps=1000)
296411
412+
413+
.. note::
414+
415+
After training a policy using BenchMARL, we need extract a compatible policy using
416+
:py:meth:`.utils.benchmarl.NavgroundExperiment.get_single_agent_policy`
417+
that transforms a TorchRL policy to a PyTorch policy.
418+
419+
297420
In practice, we do not need to perform the configuration manually. Instead, we can load it from a YAML file (exported e.g. using :py:func:`.io.export_policy_as_behavior`), like common in navground:
298421

299422
.. code-block:: YAML
@@ -350,7 +473,7 @@ Acknowledgement and disclaimer
350473

351474
The work was supported in part by `REXASI-PRO <https://rexasi-pro.spindoxlabs.com>`_ H-EU project, call HORIZON-CL4-2021-HUMAN-01-01, Grant agreement no. 101070028.
352475

353-
.. image:: https://rexasi-pro.spindoxlabs.com/wp-content/uploads/2023/01/Bianco-Viola-Moderno-Minimalista-Logo-e1675187551324.png
476+
.. figure:: https://rexasi-pro.spindoxlabs.com/wp-content/uploads/2023/01/Bianco-Viola-Moderno-Minimalista-Logo-e1675187551324.png
354477
:width: 300
355478
:alt: REXASI-PRO logo
356479

docs/source/tutorials/basics/Gymnasium.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -506,7 +506,7 @@
506506
"which we can use to define a policy that simply ask the agent to actuate the action that it has already computed. \n",
507507
"\n",
508508
"Let us collect the reward from the navground policy in this way. \n",
509-
"To understand the scale, in this case, the reward assigned at each step in maximal (=0) when the agents move straight towards the goal at optimal speed. When the entire safety margin is violated, it gets a penality of at most -1, the same value it gets if it stay in-place, while moving in the opposite target direction at optimal speed gets a -2. Therefore, we can expect an average reward between -1 and 0, and possibly near to 0 for a well-performing navigation behavior."
509+
"To understand the scale, in this case, the reward assigned at each step in maximal (=0) when the agents move straight towards the goal at optimal speed. When the entire safety margin is violated, it gets a penalty of at most -1, the same value it gets if it stay in-place, while moving in the opposite target direction at optimal speed gets a -2. Therefore, we can expect an average reward between -1 and 0, and possibly near to 0 for a well-performing navigation behavior."
510510
]
511511
},
512512
{

docs/source/tutorials/corridor_with_obstacle/Scenario.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7763,7 +7763,7 @@
77637763
"id": "458f9e92-6dde-4a5d-860e-8ef17ec87de5",
77647764
"metadata": {},
77657765
"source": [
7766-
"and in all the other runs it has to actually avoid the obstacles, incurring in a penality "
7766+
"and in all the other runs it has to actually avoid the obstacles, incurring in a penalty "
77677767
]
77687768
},
77697769
{

docs/source/tutorials/corridor_with_obstacle/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
Corridor with obstacle
33
======================
44

5-
In a series of two notebooks, we look at a simple but more interesting scenario than :doc:`../empty/empty`.
5+
In a series of two notebooks, we look at a simple but more interesting scenario than :doc:`../empty/index`.
66

77
In the first notebook, we explore the scenario defined as
88

docs/source/tutorials/crossing/Training-MA.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2825,7 +2825,7 @@
28252825
"id": "fad18282-974b-415b-abaf-888baa42461e",
28262826
"metadata": {},
28272827
"source": [
2828-
"Training in `penv` requires significanlty more steps than in `env` but takes a much shorter time. A similar number of steps per agent are required (about 30K)."
2828+
"Training in `penv` requires significantly more steps than in `env` but takes a much shorter time. A similar number of steps per agent are required (about 30K)."
28292829
]
28302830
},
28312831
{

docs/source/tutorials/empty/Direction.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -341,7 +341,7 @@
341341
"id": "13d5a637-cc98-4669-bf8e-094e29ef6ae5",
342342
"metadata": {},
343343
"source": [
344-
"Let us compute the reward of the \"expert\" that uses the `Dummy` behavior.\n",
344+
"Let us compute the reward of the \"expert\" that uses the ``Dummy`` behavior.\n",
345345
"\n",
346346
"we import `evaluate_policy` as\n",
347347
"\n",

docs/source/tutorials/pad/Behaviors.ipynb

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
"\n",
1010
"In this notebook we measure the performance of 4 model-based navigation behaviors:\n",
1111
"\n",
12-
"`Dummy`: which ignores the pad and the other agent."
12+
"`Dummy`, which ignores the pad and the other agent."
1313
]
1414
},
1515
{
@@ -196,9 +196,9 @@
196196
"We compute the reward distribution when the agents start from oppositing sides of the map.\n",
197197
"The reward function is composed by two terms:\n",
198198
"- `1 - efficacy` (i.e., 0 if moving forwards at full desired speed)\n",
199-
"- `pad_penality` when two agents are on the pad at the same time.\n",
199+
"- `pad_penalty` when two agents are on the pad at the same time.\n",
200200
"\n",
201-
"here, we compute the reward for `pad_penality=1`, which we will later use to pick a penality that balance the tendency to stop and cross the pad. "
201+
"here, we compute the reward for `pad_penalty=1`, which we will later use to pick a penalty that balance the tendency to stop and cross the pad. "
202202
]
203203
},
204204
{
@@ -210,7 +210,7 @@
210210
"source": [
211211
"from navground.learning.examples.pad import PadReward\n",
212212
"\n",
213-
"reward = PadReward(pad_penality=1)"
213+
"reward = PadReward(pad_penalty=1)"
214214
]
215215
},
216216
{
@@ -432,7 +432,7 @@
432432
"id": "94859b79-7578-4eb9-a098-c63454624394",
433433
"metadata": {},
434434
"source": [
435-
"Penality for efficiency should make `Dummy` and `StopAtPad` almost even:"
435+
"Penality for efficiency should make ``Dummy`` and `StopAtPad` almost even:"
436436
]
437437
},
438438
{
@@ -666,7 +666,7 @@
666666
"id": "68f82b3e-0659-4c83-8e3b-87cec19c56e4",
667667
"metadata": {},
668668
"source": [
669-
"Penality for efficiency should make `Dummy` and `StopAtPad` almost even:"
669+
"Penality for efficiency should make ``Dummy`` and `StopAtPad` almost even:"
670670
]
671671
},
672672
{

docs/source/tutorials/pad/Communication/Comm-SAC-Split.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
"source": [
88
"# Distributed policy with comm trained using parallel SAC: split model\n",
99
"\n",
10-
"In this notebook, we test a variant of :doc:`Comm-SAC`, where we split the communication and acceleration models in a way that the transmitted communication *does not* depend on the received message but only on the other observations. \n",
10+
"In this notebook, we test a variant of [the previous notebook](./Comm-SAC.ipynb), where we split the communication and acceleration models in a way that the transmitted communication *does not* depend on the received message but only on the other observations. \n",
1111
"\n",
1212
"The agents exchange 1 float and we share part of the reward."
1313
]

0 commit comments

Comments
 (0)