[Doc] More doc on trainers (#663)

vmoens · web-flow · commit 314172358841 · 2022-11-12T14:16:55.000Z
diff --git a/README.md b/README.md
@@ -11,14 +11,6 @@
 
 # TorchRL
 
-## Disclaimer
-
-This library is not officially released yet and is subject to change.
-
-The features are available before an official release so that users and collaborators can get early access and provide feedback. No guarantee of stability, robustness or backward compatibility is provided.
-
----
-
 **TorchRL** is an open-source Reinforcement Learning (RL) library for PyTorch.
 
 It provides pytorch and **python-first**, low and high level abstractions for RL that are intended to be **efficient**, **modular**, **documented** and properly **tested**.
@@ -536,5 +528,12 @@ In the near future, we plan to:
 We welcome any contribution, should you want to contribute to these new features
 or any other, lister or not, in the issues section of this repository.
 
+
+## Disclaimer
+
+This library is not officially released yet and is subject to change.
+
+The features are available before an official release so that users and collaborators can get early access and provide feedback. No guarantee of stability, robustness or backward compatibility is provided.
+
 # License
 TorchRL is licensed under the MIT License. See [LICENSE](LICENSE) for details.
diff --git a/docs/source/reference/trainers.rst b/docs/source/reference/trainers.rst
@@ -3,6 +3,131 @@
 torchrl.trainers package
 ========================
 
+The trainer package provides utilities to write re-usable training scripts. The core idea is to use a
+trainer that implements a nested loop, where the outer loop runs the data collection steps and the inner
+loop the optimization steps. We believe this fits multiple RL training schemes, such as
+on-policy, off-policy, model-based and model-free solutions, offline RL and others.
+More particular cases, such as meta-RL algorithms may have training schemes that differ substentially.
+
+The :obj:`trainer.train()` method can be sketched as follows:
+
+.. code-block::
+   :caption: Trainer loops
+
+        >>> for batch in collector:
+        ...     batch = self._process_batch_hook(batch)  # "batch_process"
+        ...     self._pre_steps_log_hook(batch)  # "pre_steps_log"
+        ...     self._pre_optim_hook()  # "pre_optim_steps"
+        ...     for j in range(self.optim_steps_per_batch):
+        ...         sub_batch = self._process_optim_batch_hook(batch)  # "process_optim_batch"
+        ...         losses = self.loss_module(sub_batch)
+        ...         self._post_loss_hook(sub_batch)  # "post_loss"
+        ...         self.optimizer.step()
+        ...         self.optimizer.zero_grad()
+        ...         self._post_optim_hook()  # "post_optim"
+        ...         self._post_optim_log(sub_batch)  # "post_optim_log"
+        ...     self._post_steps_hook()  # "post_steps"
+        ...     self._post_steps_log_hook(batch)  #  "post_steps_log"
+
+There are 9 hooks that can be used in a trainer loop: :obj:`"batch_process"`, :obj:`"pre_optim_steps"`,
+:obj:`"process_optim_batch"`, :obj:`"post_loss"`, :obj:`"post_steps"`, :obj:`"post_optim"`, :obj:`"pre_steps_log"`,
+:obj:`"post_steps_log"` and :obj:`"post_optim_log"`. They are indicated in the comments where they are applied.
+Hooks can be split into 3 categories: **data processing** (:obj:`"batch_process"` and :obj:`"process_optim_batch"`),
+**logging** (:obj:`"pre_steps_log"`, :obj:`"post_optim_log"` and :obj:`"post_steps_log"`) and **operations** hook
+(:obj:`"pre_optim_steps"`, :obj:`"post_loss"`, :obj:`"post_optim"` and :obj:`"post_steps"`).
+
+- **Data processing** hooks update a tensordict of data. Hooks :obj:`__call__` method should accept
+  a :obj:`TensorDict` object as input and update it given some strategy.
+  Examples of such hooks include Replay Buffer extension (:obj:`ReplayBufferTrainer.extend`), data normalization (including normalization
+  constants update), data subsampling (:doc:`BatchSubSampler`) and such.
+
+- **Logging** hooks take a batch of data presented as a :obj:`TensorDict` and write in the logger
+  some information retrieved from that data. Examples include the :obj:`Recorder` hook, the reward
+  logger (:obj:`LogReward`) and such. Hooks should return a dictionary (or a None value) containing the
+  data to log. The key :obj:`"log_pbar"` is reserved to boolean values indicating if the logged value
+  should be displayed on the progression bar printed on the training log.
+
+- **Operation** hooks are hooks that execute specific operations over the models, data collectors,
+  target network updates and such. For instance, syncing the weights of the collectors using :obj:`UpdateWeights`
+  or update the priority of the replay buffer using :obj:`ReplayBufferTrainer.update_priority` are examples
+  of operation hooks. They are data-independent (they do not require a :obj:`TensorDict`
+  input), they are just supposed to be executed once at every iteration (or every N iterations).
+
+The hooks provided by TorchRL usually inherit from a common abstract class :obj:`TrainerHookBase`,
+and all implement three base methods: a :obj:`state_dict` and :obj:`load_state_dict` method for
+checkpointing and a :obj:`register` method that registers the hook at the default value in the
+trainer. This method takes a trainer and a module name as input. For instance, the following logging
+hook is executed every 10 calls to :obj:`"post_optim_log"`:
+
+.. code-block::
+
+        >>> class LoggingHook(TrainerHookBase):
+        ...     def __init__(self):
+        ...         self.counter = 0
+        ...
+        ...     def register(self, trainer, name):
+        ...         trainer.register_module(self, "logging_hook")
+        ...         trainer.register_op("post_optim_log", self)
+        ...
+        ...     def save_dict(self):
+        ...         return {"counter": self.counter}
+        ...
+        ...     def load_state_dict(self, state_dict):
+        ...         self.counter = state_dict["counter"]
+        ...
+        ...     def __call__(self, batch):
+        ...         if self.counter % 10 == 0:
+        ...             self.counter += 1
+        ...             out = {"some_value": batch["some_value"].item(), "log_pbar": False}
+        ...         else:
+        ...             out = None
+        ...         self.counter += 1
+        ...         return out
+
+Checkpointing
+-------------
+
+The trainer class and hooks support checkpointing, which can be achieved either
+using the `torchsnapshot <https://github.com/pytorch/torchsnapshot/>`_ backend or
+the regular torch backend. This can be controlled via the global variable :obj:`CKPT_BACKEND`:
+
+.. code-block::
+
+    $ CKPT_BACKEND=torch python script.py
+
+which defaults to :obj:`torchsnapshot`. The advantage of torchsnapshot over pytorch
+is that it is a more flexible API, which supports distributed checkpointing and
+also allows users to load tensors from a file stored on disk to a tensor with a
+physical storage (which pytorch currently does not support). This allows, for instance,
+to load tensors from and to a replay buffer that would otherwise not fit in memory.
+
+When building a trainer, one can provide a file path where the checkpoints are to
+be written. With the :obj:`torchsnapshot` backend, a directory path is expected,
+whereas the :obj:`torch` backend expects a file path (typically a  :obj:`.pt` file).
+
+.. code-block::
+
+    >>> filepath = "path/to/dir/"
+    >>> trainer = Trainer(
+    ...     collector=collector,
+    ...     total_frames=total_frames,
+    ...     frame_skip=frame_skip,
+    ...     loss_module=loss_module,
+    ...     optimizer=optimizer,
+    ...     save_trainer_file=filepath,
+    ... )
+    >>> select_keys = SelectKeys(["action", "observation"])
+    >>> select_keys.register(trainer)
+    >>> # to save to a path
+    >>> trainer.save_trainer(True)
+    >>> # to load from a path
+    >>> trainer.load_from_file(filepath)
+
+The :obj:`Trainer.train()` method can be used to execute the above loop with all of
+its hooks, although using the :obj:`Trainer` class for its checkpointing capability
+only is also a perfectly valid use.
+
+
 Trainer and hooks
 -----------------
 
diff --git a/torchrl/trainers/trainers.py b/torchrl/trainers/trainers.py
@@ -5,6 +5,7 @@
 
 from __future__ import annotations
 
+import abc
 import pathlib
 import warnings
 from collections import OrderedDict, defaultdict
@@ -60,6 +61,22 @@
 TYPE_DESCR = {float: "4.4f", int: ""}
 
 
+class TrainerHookBase:
+    """An abstract hooking class for torchrl Trainer class."""
+
+    @abc.abstractmethod
+    def state_dict(self) -> Dict[str, Any]:
+        raise NotImplementedError
+
+    @abc.abstractmethod
+    def load_state_dict(self, state_dict: Dict[str, Any]) -> None:
+        raise NotImplementedError
+
+    @abc.abstractmethod
+    def register(self, trainer: Trainer, name: str):
+        raise NotImplementedError
+
+
 class Trainer:
     """A generic Trainer class.
 
@@ -540,7 +557,7 @@ def _load_list_state_dict(list_state_dict, hook_list):
             hook_list[i] = (item, kwargs)
 
 
-class SelectKeys:
+class SelectKeys(TrainerHookBase):
     """Selects keys in a TensorDict batch.
 
     Args:
@@ -580,12 +597,12 @@ def state_dict(self) -> Dict[str, Any]:
     def load_state_dict(self, state_dict: Dict[str, Any]) -> None:
         pass
 
-    def register(self, trainer) -> None:
+    def register(self, trainer, name="select_keys") -> None:
         trainer.register_op("batch_process", self)
-        trainer.register_module("select_keys", self)
+        trainer.register_module(name, self)
 
 
-class ReplayBufferTrainer:
+class ReplayBufferTrainer(TrainerHookBase):
     """Replay buffer hook provider.
 
     Args:
@@ -673,14 +690,14 @@ def state_dict(self) -> Dict[str, Any]:
     def load_state_dict(self, state_dict) -> None:
         self.replay_buffer.load_state_dict(state_dict["replay_buffer"])
 
-    def register(self, trainer: Trainer):
+    def register(self, trainer: Trainer, name: str = "replay_buffer"):
         trainer.register_op("batch_process", self.extend)
         trainer.register_op("process_optim_batch", self.sample)
         trainer.register_op("post_loss", self.update_priority)
-        trainer.register_module("replay_buffer", self)
+        trainer.register_module(name, self)
 
 
-class ClearCudaCache:
+class ClearCudaCache(TrainerHookBase):
     """Clears cuda cache at a given interval.
 
     Examples:
@@ -699,7 +716,7 @@ def __call__(self, *args, **kwargs):
             torch.cuda.empty_cache()
 
 
-class LogReward:
+class LogReward(TrainerHookBase):
     """Reward logger hook.
 
     Args:
@@ -730,12 +747,12 @@ def __call__(self, batch: TensorDictBase) -> Dict:
             "log_pbar": self.log_pbar,
         }
 
-    def register(self, trainer: Trainer):
+    def register(self, trainer: Trainer, name: str = "log_reward"):
         trainer.register_op("pre_steps_log", self)
-        trainer.register_module("log_reward", self)
+        trainer.register_module(name, self)
 
 
-class RewardNormalizer:
+class RewardNormalizer(TrainerHookBase):
     """Reward normalizer hook.
 
     Args:
@@ -822,10 +839,10 @@ def load_state_dict(self, state_dict: Dict[str, Any]) -> None:
         for key, value in state_dict.items():
             setattr(self, key, value)
 
-    def register(self, trainer: Trainer):
+    def register(self, trainer: Trainer, name: str = "reward_normalizer"):
         trainer.register_op("batch_process", self.update_reward_stats)
         trainer.register_op("process_optim_batch", self.normalize_reward)
-        trainer.register_module("reward_normalizer", self)
+        trainer.register_module(name, self)
 
 
 def mask_batch(batch: TensorDictBase) -> TensorDictBase:
@@ -849,7 +866,7 @@ def mask_batch(batch: TensorDictBase) -> TensorDictBase:
     return batch
 
 
-class BatchSubSampler:
+class BatchSubSampler(TrainerHookBase):
     """Data subsampler for online RL algorithms.
 
     This class subsamples a part of a whole batch of data just collected from the
@@ -969,15 +986,15 @@ def state_dict(self) -> Dict[str, Any]:
     def load_state_dict(self, state_dict: Dict[str, Any]) -> None:
         pass
 
-    def register(self, trainer):
+    def register(self, trainer: Trainer, name: str = "batch_subsampler"):
         trainer.register_op(
             "process_optim_batch",
             self,
         )
-        trainer.register_module("batch_subsampler", self)
+        trainer.register_module(name, self)
 
 
-class Recorder:
+class Recorder(TrainerHookBase):
     """Recorder hook for Trainer.
 
     Args:
@@ -1092,15 +1109,15 @@ def load_state_dict(self, state_dict: Dict) -> None:
         self._count = state_dict["_count"]
         self.recorder.load_state_dict(state_dict["recorder_state_dict"])
 
-    def register(self, trainer: Trainer):
-        trainer.register_module("recorder", self)
+    def register(self, trainer: Trainer, name: str = "recorder"):
+        trainer.register_module(name, self)
         trainer.register_op(
             "post_steps_log",
             self,
         )
 
 
-class UpdateWeights:
+class UpdateWeights(TrainerHookBase):
     """A collector weights update hook class.
 
     This hook must be used whenever the collector policy weights sit on a
@@ -1130,8 +1147,8 @@ def __call__(self):
         if self.counter % self.update_weights_interval == 0:
             self.collector.update_policy_weights_()
 
-    def register(self, trainer: Trainer):
-        trainer.register_module("update_weights", self)
+    def register(self, trainer: Trainer, name: str = "update_weights"):
+        trainer.register_module(name, self)
         trainer.register_op(
             "post_steps",
             self,
@@ -1144,7 +1161,7 @@ def load_state_dict(self, state_dict) -> None:
         return
 
 
-class CountFramesLog:
+class CountFramesLog(TrainerHookBase):
     """A frame counter hook.
 
     Args:
@@ -1178,8 +1195,8 @@ def __call__(self, batch: TensorDictBase) -> Dict:
         self.frame_count += current_frames
         return {"n_frames": self.frame_count, "log_pbar": self.log_pbar}
 
-    def register(self, trainer: Trainer):
-        trainer.register_module("count_frames_log", self)
+    def register(self, trainer: Trainer, name: str = "count_frames_log"):
+        trainer.register_module(name, self)
         trainer.register_op(
             "pre_steps_log",
             self,