idsia-robotics
diff --git a/‎CHANGELOG.md
Lines changed: 4 additions & 2 deletions b/‎CHANGELOG.md
Lines changed: 4 additions & 2 deletions
diff --git a/‎docs/source/conf.py
Lines changed: 4 additions & 0 deletions b/‎docs/source/conf.py
Lines changed: 4 additions & 0 deletions
diff --git a/‎docs/source/introduction.rst
Lines changed: 128 additions & 5 deletions b/‎docs/source/introduction.rst
Lines changed: 128 additions & 5 deletions
diff --git a/‎docs/source/tutorials/basics/Gymnasium.ipynb
Lines changed: 1 addition & 1 deletion b/‎docs/source/tutorials/basics/Gymnasium.ipynb
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/tutorials/corridor_with_obstacle/Scenario.ipynb
Lines changed: 1 addition & 1 deletion b/‎docs/source/tutorials/corridor_with_obstacle/Scenario.ipynb
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/tutorials/corridor_with_obstacle/index.rst
Lines changed: 1 addition & 1 deletion b/‎docs/source/tutorials/corridor_with_obstacle/index.rst
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/tutorials/crossing/Training-MA.ipynb
Lines changed: 1 addition & 1 deletion b/‎docs/source/tutorials/crossing/Training-MA.ipynb
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/tutorials/empty/Direction.ipynb
Lines changed: 1 addition & 1 deletion b/‎docs/source/tutorials/empty/Direction.ipynb
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/tutorials/pad/Behaviors.ipynb
Lines changed: 6 additions & 6 deletions b/‎docs/source/tutorials/pad/Behaviors.ipynb
Lines changed: 6 additions & 6 deletions
diff --git a/‎docs/source/tutorials/pad/Communication/Comm-SAC-Split.ipynb
Lines changed: 1 addition & 1 deletion b/‎docs/source/tutorials/pad/Communication/Comm-SAC-Split.ipynb
Lines changed: 1 addition & 1 deletion
@@ -1,8 +1,8 @@
 # Changelog
 
-## [0.2] TODO
+## [0.2] 2025-08-06
 
-Minor release with many significant additions, such as support for:
+We release several significant additions, such as support for:
 
 - TorchRL and BenchMARL: several SoA MARL algorithms that support heterogeneous groups too.
 
@@ -43,6 +43,8 @@ Minor release with many significant additions, such as support for:
 
 - Options to specify how `DefaultObservationConfig` should handle dict spaces: flatten sub-spaces, ignore some keys, sort the keys, normalize values. 
 
+- Optional position and orientation observations in `DefaultObservationConfig`.
+
 - Parallel environment wrappers
 	- `NameWrapper` to index agent by string, like agent_0, agent_1, ... .
 	- `MaskWrapper` to mask part of the action and observation spaces.
 
@@ -87,6 +87,10 @@
 
 reftarget_aliases = {}
 reftarget_aliases['py'] = {
+    'AlgorithmConfig': 'benchmarl.algorithms.common.AlgorithmConfig',
+    'ExperimentConfig': 'benchmarl.experiment.ExperimentConfig',
+    'ModelConfig': 'benchmarl.models.common.ModelConfig',
+    'EnvBase': 'torchrl.envs.EnvBase',
     'gym.Env': 'gymnasium.Env',
     'Axes': 'matplotlib.axes.Axes',
     'Path': 'pathlib.Path',
 
@@ -40,7 +40,7 @@ which, using the API, is typically implemented like
    observation, info = environment.reset()
    
    for _ in range(1000):
-       action = evaluate_my_policy(observation)
+       action = my_policy(observation)
        observation, reward, terminated, truncated, info = environment.step(action)
    
        if terminated or truncated:
@@ -69,7 +69,7 @@ The Parallel API is similar to Gymnasium, with the difference that actions, rewa
    observations, infos = environment.reset()
    
    for _ in range(1000):
-       actions = {index: evaluate_my_policy(observation) 
+       actions = {index: my_policy(observation) 
                   for index, observation in observations.items()}
        observations, rewards, terminations, truncations, infos = environment.step(actions)
 
@@ -80,7 +80,6 @@ The Parallel API is similar to Gymnasium, with the difference that actions, rewa
       
    env.close()
 
-
 .. note::
 
    We can convert between environments with AEC and Parallel API using
@@ -89,6 +88,35 @@ The Parallel API is similar to Gymnasium, with the difference that actions, rewa
    Moreover, we can convert PettingZoo environments in which all agents share the same action and observation spaces to 
    a vectorized Gymnasium environment that concatenate all the actions, observations and other infos using  `SuperSuit wrappers <https://github.com/Farama-Foundation/SuperSuit/blob/master/supersuit/vector/vector_constructors.py>`_. This way, we can use ML libraries that works with Gymanasium to train distributed multi-agent systems.
 
+TorchRL
+-------
+
+`TorchRL <https://docs.pytorch.org/rl/stable/index.html>`_ is an open-source Reinforcement Learning (RL) library for PyTorch.
+
+TorchRL environments are based on the same Markov Decision Process cycle but with a different API: ``environment.step`` input and output are both dictionaries from `tensordict <https://docs.pytorch.org/tensordict/stable/index.html>`_ that holds actions, observations, rewards, ... in separate keys.
+
+TorchRL environment can be constructed from Gymnasium and PettingZoo environments (among others).
+The following is a cycle in TorchRL similar to the previous ones.
+Note that TorchRL policies also operates on dictionaries for tensors (input and output).
+
+.. code-block:: python
+   
+   from torchrl.envs import GymEnv
+   
+   environment = GymEnv("MyEnviroment")
+   environment.set_seed(0)
+   td = environment.reset()
+   
+   for _ in range(1000):
+       td = my_torchrl_policy(td)
+       td = environment.step(td)
+   
+       if td['next', 'terminated'] or td['next', 'truncated']:
+           td = environment.reset()
+   
+   environment.close()
+
+One important difference between PettingZoo and TorchRL environments is that agents can be grouped together. For examples, in an environment with 2 green agents and 2 blue agents (where same-colored agents would share the same type of actions and observations), the dictionary ``td`` in the example above would have keys like ``("green", "next", "observation")`` and ``("blue", "next", "observation")`` that hold tensors with the observation form *both* agents of the same color.
 
 Navground
 ---------
@@ -166,6 +194,25 @@ By specifying
   - :py:class:`.ControlActionConfig` where the policy outputs a control command
   - :py:class:`.ModulationActionConfig` where the policy outputs parameters of an underlying deterministic navigation behavior.
 
+For example, to create a single a single-agent environment: 
+
+.. code-block:: python
+
+   import gymnasium as gym
+   from navground import sim
+   from navground.learning import DefaultObservationConfig, ControlActionConfig
+   from navground.learning.rewards import SocialReward
+
+   env = gym.make('navground.learning.env:navground',
+                  scenario=scenario,
+                  sensor=sensor,
+                  action=ControlActionConfig(),
+                  observation=DefaultObservationConfig(),
+                  reward=SocialReward(),
+                  time_step=0.1,
+                  max_episode_steps=600)
+
+
 PettingZoo Navground Environment
 --------------------------------
 
@@ -174,15 +221,56 @@ Similarly, :py:class:`.parallel_env.MultiAgentNavgroundEnv` provides a environme
 :py:func:`.parallel_env.parallel_env` instantiate an environment where different agents may use different configurations (such as action spaces, rewards, ...), while
 :py:func:`.parallel_env.shared_parallel_env` instantiate an environment where all specified agents share the same configuration.
 
+.. code-block:: python
+
+   import gymnasium as gym
+   from navground import sim
+   from navground.learning.parallel_env import shared_parallel_env
+   from navground.learning import DefaultObservationConfig, ControlActionConfig
+   from navground.learning.rewards import SocialReward
+
+   penv = shared_parallel_env(scenario=scenario,
+                              sensor=sensor,
+                              action=ControlActionConfig(),
+                              observation=DefaultObservationConfig(),
+                              reward=SocialReward(),
+                              time_step=0.1,
+                              max_episode_steps=600)
+   
+
 The rest of the functionality is very similar to the Gymnasium Environment (and in fact, they share the same base class), but conform to the PettingZoo API instead.
 
+For example, to create a single a multi-agent environment, where all agents share the same configuration:
+
+
+TorchRL Navground Environment
+-----------------------------
+
+Navground and TorchRL both support PettingZoo environments, therefore it is is straightforward to create TorchRL environments with navground components:
+
+.. code-block:: python
+
+   from torchrl.envs.libs.pettingzoo import PettingZooWrapper 
+   from navground.learning.parallel_env import shared_parallel_env
+   from navground.learning.wrappers.name_wrapper import NameWrapper
+
+   penv = shared_parallel_env(...)
+   env = PettingZooWrapper(NameWrapper(penv),
+                           categorical_actions=False,
+                           device='cpu',
+                           seed=0,
+                           return_state=penv.has_state) 
+
+:py:class:`.wrappers.name_wrapper.NameWrapper` converts from an environment where agents are indexed by integers to one where they are indexed by strings, which TorchRL requires.
+
+Function :py:func:`.utils.benchmarl.make_env` provides the same functionality.
 
 Train ML policies in navground 
 ==============================
 
 .. note::
 
-   Have a look at the tutorials to see the interaction between gymnasium and navground in action and how to use it to train a navigation policy using IL or RL.
+   Have a look at the :doc:`tutorials <tutorials/index>` to see the interaction between gymnasium and navground in action and how to use it to train a navigation policy using IL or RL.
 
 Imitation Learning
 ------------------
@@ -258,6 +346,33 @@ We instantiate the parallel environment using :py:func:`.parallel_env.shared_par
    psac.save("PSAC")
 
 
+Multi-agent Reinforcement Learning with BenchMARL
+-------------------------------------------------
+
+`BenchMARL <https://github.com/facebookresearch/BenchMARL>`_ provides implementation of Multi-agent Reinforcement Learning algorithms that extend behind the parallel Multi-agent Reinforcement Learning family just described. They can tackle problems that features heterogeneous agents, which therefore do not share the same policy and cannot be "stacked" together. Moreover, MARL-specific  algorithm are designed to reduce instabilities that arise in multi-agent training, where agents learn policies among other agents that keep evolving their behavior.
+
+We provide utilities that simplify training navigation policies with BenchMARL, like for example, using the multi-agent version of SAC:
+
+.. code-block:: python
+
+   from navground.learning.parallel_env import shared_parallel_env
+   from benchmarl.algorithms import MasacConfig
+   from benchmarl.models.mlp import MlpConfig
+   from benchmarl.experiment import ExperimentConfig
+   from navground.learning.utils.benchmarl import NavgroundExperiment
+
+   penv = shared_parallel_env(scenario=..., sensor=...,
+                              observation_config=..., action_config=..., 
+                              max_episode_steps=100)
+   masac_exp = NavgroundExperiment(
+       env=penv,
+       config=ExperimentConfig.get_from_yaml(),
+       model_config=MlpConfig.get_from_yaml(),
+       algorithm_config=MasacConfig.get_from_yaml(),
+       seed=0
+   )
+   masac_exp.run_for(iterations=20)
+
 Evaluation
 ==========
 
@@ -294,6 +409,14 @@ Once we have trained a policy (and possibly exported it to onnx using :py:func:`
    
    world.run(time_step=0.1, steps=1000)
 
+
+.. note::
+
+   After training a policy using BenchMARL, we need extract a compatible policy using 
+   :py:meth:`.utils.benchmarl.NavgroundExperiment.get_single_agent_policy`
+   that transforms a TorchRL policy to a PyTorch policy.
+
+
 In practice, we do not need to perform the configuration manually. Instead, we can load it from a YAML file (exported e.g. using :py:func:`.io.export_policy_as_behavior`), like common in navground:
 
 .. code-block:: YAML
@@ -350,7 +473,7 @@ Acknowledgement and disclaimer
 
 The work was supported in part by `REXASI-PRO <https://rexasi-pro.spindoxlabs.com>`_ H-EU project, call HORIZON-CL4-2021-HUMAN-01-01, Grant agreement no. 101070028.
 
-.. image:: https://rexasi-pro.spindoxlabs.com/wp-content/uploads/2023/01/Bianco-Viola-Moderno-Minimalista-Logo-e1675187551324.png
+.. figure:: https://rexasi-pro.spindoxlabs.com/wp-content/uploads/2023/01/Bianco-Viola-Moderno-Minimalista-Logo-e1675187551324.png
   :width: 300
   :alt: REXASI-PRO logo
 
 
@@ -506,7 +506,7 @@
     "which we can use to define a policy that simply ask the agent to actuate the action that it has already computed. \n",
     "\n",
     "Let us collect the reward from the navground policy in this way. \n",
-    "To understand the scale, in this case, the reward assigned at each step in maximal (=0) when the agents move straight towards the goal at optimal speed. When the entire safety margin is violated, it gets a penality of at most -1, the same value it gets if it stay in-place, while moving in the opposite target direction at optimal speed gets a -2. Therefore, we can expect an average reward between -1 and 0, and possibly near to 0 for a well-performing navigation behavior."
+    "To understand the scale, in this case, the reward assigned at each step in maximal (=0) when the agents move straight towards the goal at optimal speed. When the entire safety margin is violated, it gets a penalty of at most -1, the same value it gets if it stay in-place, while moving in the opposite target direction at optimal speed gets a -2. Therefore, we can expect an average reward between -1 and 0, and possibly near to 0 for a well-performing navigation behavior."
    ]
   },
   {
 
@@ -7763,7 +7763,7 @@
    "id": "458f9e92-6dde-4a5d-860e-8ef17ec87de5",
    "metadata": {},
    "source": [
-    "and in all the other runs it has to actually avoid the obstacles, incurring in a penality "
+    "and in all the other runs it has to actually avoid the obstacles, incurring in a penalty "
    ]
   },
   {
 
@@ -2,7 +2,7 @@
 Corridor with obstacle
 ======================
 
-In a series of two notebooks, we look at a simple but more interesting scenario than :doc:`../empty/empty`.
+In a series of two notebooks, we look at a simple but more interesting scenario than :doc:`../empty/index`.
 
 In the first notebook, we explore the scenario defined as
 
 
@@ -2825,7 +2825,7 @@
    "id": "fad18282-974b-415b-abaf-888baa42461e",
    "metadata": {},
    "source": [
-    "Training in `penv` requires significanlty more steps than in `env` but takes a much shorter time. A similar number of steps per agent are required (about 30K)."
+    "Training in `penv` requires significantly more steps than in `env` but takes a much shorter time. A similar number of steps per agent are required (about 30K)."
    ]
   },
   {
 
@@ -341,7 +341,7 @@
    "id": "13d5a637-cc98-4669-bf8e-094e29ef6ae5",
    "metadata": {},
    "source": [
-    "Let us compute the reward of the \"expert\" that uses the `Dummy` behavior.\n",
+    "Let us compute the reward of the \"expert\" that uses the ``Dummy`` behavior.\n",
     "\n",
     "we import `evaluate_policy` as\n",
     "\n",
 
@@ -9,7 +9,7 @@
     "\n",
     "In this notebook we measure the performance of 4 model-based navigation behaviors:\n",
     "\n",
-    "`Dummy`: which ignores the pad and the other agent."
+    "`Dummy`, which ignores the pad and the other agent."
    ]
   },
   {
@@ -196,9 +196,9 @@
     "We compute the reward distribution when the agents start from oppositing sides of the map.\n",
     "The reward function is composed by two terms:\n",
     "- `1 - efficacy` (i.e., 0 if moving forwards at full desired speed)\n",
-    "- `pad_penality` when two agents are on the pad at the same time.\n",
+    "- `pad_penalty` when two agents are on the pad at the same time.\n",
     "\n",
-    "here, we compute the reward for `pad_penality=1`, which we will later use to pick a penality that balance the tendency to stop and cross the pad. "
+    "here, we compute the reward for `pad_penalty=1`, which we will later use to pick a penalty that balance the tendency to stop and cross the pad. "
    ]
   },
   {
@@ -210,7 +210,7 @@
    "source": [
     "from navground.learning.examples.pad import PadReward\n",
     "\n",
-    "reward = PadReward(pad_penality=1)"
+    "reward = PadReward(pad_penalty=1)"
    ]
   },
   {
@@ -432,7 +432,7 @@
    "id": "94859b79-7578-4eb9-a098-c63454624394",
    "metadata": {},
    "source": [
-    "Penality for efficiency should make `Dummy` and `StopAtPad` almost even:"
+    "Penality for efficiency should make ``Dummy`` and `StopAtPad` almost even:"
    ]
   },
   {
@@ -666,7 +666,7 @@
    "id": "68f82b3e-0659-4c83-8e3b-87cec19c56e4",
    "metadata": {},
    "source": [
-    "Penality for efficiency should make `Dummy` and `StopAtPad` almost even:"
+    "Penality for efficiency should make ``Dummy`` and `StopAtPad` almost even:"
    ]
   },
   {
 
@@ -7,7 +7,7 @@
    "source": [
     "# Distributed policy with comm trained using parallel SAC: split model\n",
     "\n",
-    "In this notebook, we test a variant of :doc:`Comm-SAC`, where we split the communication and acceleration models in a way that the transmitted communication *does not* depend on the received message but only on the other observations. \n",
+    "In this notebook, we test a variant of [the previous notebook](./Comm-SAC.ipynb), where we split the communication and acceleration models in a way that the transmitted communication *does not* depend on the received message but only on the other observations. \n",
     "\n",
     "The agents exchange 1 float and we share part of the reward."
    ]
Original file line number	Diff line number	Diff line change
`@@ -506,7 +506,7 @@`
`506`	`506`	`"which we can use to define a policy that simply ask the agent to actuate the action that it has already computed. \n",`
`507`	`507`	`"\n",`
`508`	`508`	`"Let us collect the reward from the navground policy in this way. \n",`
`509`		`- "To understand the scale, in this case, the reward assigned at each step in maximal (=0) when the agents move straight towards the goal at optimal speed. When the entire safety margin is violated, it gets a penality of at most -1, the same value it gets if it stay in-place, while moving in the opposite target direction at optimal speed gets a -2. Therefore, we can expect an average reward between -1 and 0, and possibly near to 0 for a well-performing navigation behavior."`
	`509`	`+ "To understand the scale, in this case, the reward assigned at each step in maximal (=0) when the agents move straight towards the goal at optimal speed. When the entire safety margin is violated, it gets a penalty of at most -1, the same value it gets if it stay in-place, while moving in the opposite target direction at optimal speed gets a -2. Therefore, we can expect an average reward between -1 and 0, and possibly near to 0 for a well-performing navigation behavior."`
`510`	`510`	`]`
`511`	`511`	`},`
`512`	`512`	`{`
Original file line number	Diff line number	Diff line change
`@@ -7763,7 +7763,7 @@`
`7763`	`7763`	`"id": "458f9e92-6dde-4a5d-860e-8ef17ec87de5",`
`7764`	`7764`	`"metadata": {},`
`7765`	`7765`	`"source": [`
`7766`		`- "and in all the other runs it has to actually avoid the obstacles, incurring in a penality "`
	`7766`	`+ "and in all the other runs it has to actually avoid the obstacles, incurring in a penalty "`
`7767`	`7767`	`]`
`7768`	`7768`	`},`
`7769`	`7769`	`{`
Original file line number	Diff line number	Diff line change
`@@ -2825,7 +2825,7 @@`
`2825`	`2825`	`"id": "fad18282-974b-415b-abaf-888baa42461e",`
`2826`	`2826`	`"metadata": {},`
`2827`	`2827`	`"source": [`
`2828`		- "Training in `penv` requires significanlty more steps than in `env` but takes a much shorter time. A similar number of steps per agent are required (about 30K)."
	`2828`	+ "Training in `penv` requires significantly more steps than in `env` but takes a much shorter time. A similar number of steps per agent are required (about 30K)."
`2829`	`2829`	`]`
`2830`	`2830`	`},`
`2831`	`2831`	`{`
Original file line number	Diff line number	Diff line change
`@@ -9,7 +9,7 @@`
`9`	`9`	`"\n",`
`10`	`10`	`"In this notebook we measure the performance of 4 model-based navigation behaviors:\n",`
`11`	`11`	`"\n",`
`12`		- "`Dummy`: which ignores the pad and the other agent."
	`12`	+ "`Dummy`, which ignores the pad and the other agent."
`13`	`13`	`]`
`14`	`14`	`},`
`15`	`15`	`{`
`@@ -196,9 +196,9 @@`
`196`	`196`	`"We compute the reward distribution when the agents start from oppositing sides of the map.\n",`
`197`	`197`	`"The reward function is composed by two terms:\n",`
`198`	`198`	"- `1 - efficacy` (i.e., 0 if moving forwards at full desired speed)\n",
`199`		- "- `pad_penality` when two agents are on the pad at the same time.\n",
	`199`	+ "- `pad_penalty` when two agents are on the pad at the same time.\n",
`200`	`200`	`"\n",`
`201`		- "here, we compute the reward for `pad_penality=1`, which we will later use to pick a penality that balance the tendency to stop and cross the pad. "
	`201`	+ "here, we compute the reward for `pad_penalty=1`, which we will later use to pick a penalty that balance the tendency to stop and cross the pad. "
`202`	`202`	`]`
`203`	`203`	`},`
`204`	`204`	`{`
`@@ -210,7 +210,7 @@`
`210`	`210`	`"source": [`
`211`	`211`	`"from navground.learning.examples.pad import PadReward\n",`
`212`	`212`	`"\n",`
`213`		`- "reward = PadReward(pad_penality=1)"`
	`213`	`+ "reward = PadReward(pad_penalty=1)"`
`214`	`214`	`]`
`215`	`215`	`},`
`216`	`216`	`{`
`@@ -432,7 +432,7 @@`
`432`	`432`	`"id": "94859b79-7578-4eb9-a098-c63454624394",`
`433`	`433`	`"metadata": {},`
`434`	`434`	`"source": [`
`435`		- "Penality for efficiency should make `Dummy` and `StopAtPad` almost even:"
	`435`	+ "Penality for efficiency should make ``Dummy`` and `StopAtPad` almost even:"
`436`	`436`	`]`
`437`	`437`	`},`
`438`	`438`	`{`
`@@ -666,7 +666,7 @@`
`666`	`666`	`"id": "68f82b3e-0659-4c83-8e3b-87cec19c56e4",`
`667`	`667`	`"metadata": {},`
`668`	`668`	`"source": [`
`669`		- "Penality for efficiency should make `Dummy` and `StopAtPad` almost even:"
	`669`	+ "Penality for efficiency should make ``Dummy`` and `StopAtPad` almost even:"
`670`	`670`	`]`
`671`	`671`	`},`
`672`	`672`	`{`
Original file line number	Diff line number	Diff line change
`@@ -7,7 +7,7 @@`
`7`	`7`	`"source": [`
`8`	`8`	`"# Distributed policy with comm trained using parallel SAC: split model\n",`
`9`	`9`	`"\n",`
`10`		- "In this notebook, we test a variant of :doc:`Comm-SAC`, where we split the communication and acceleration models in a way that the transmitted communication does not depend on the received message but only on the other observations. \n",
	`10`	`+ "In this notebook, we test a variant of [the previous notebook](./Comm-SAC.ipynb), where we split the communication and acceleration models in a way that the transmitted communication does not depend on the received message but only on the other observations. \n",`
`11`	`11`	`"\n",`
`12`	`12`	`"The agents exchange 1 float and we share part of the reward."`
`13`	`13`	`]`