Releases: DLR-RM/stable-baselines3
Stable-Baselines3 v1.7.0 : non-shared features extractor, bug fixes and quality of life improvements
SB3 Contrib (more algorithms): https://github.com/Stable-Baselines-Team/stable-baselines3-contrib
RL Zoo3 (training framework): https://github.com/DLR-RM/rl-baselines3-zoo
To upgrade:
pip install stable_baselines3 sb3_contrib rl_zoo3 --upgrade
or simply (rl zoo depends on SB3 and SB3 contrib):
pip install rl_zoo3 --upgrade
Warning
Shared layers in MLP policy (mlp_extractor
) are now deprecated for PPO, A2C and TRPO.
This feature will be removed in SB3 v1.8.0 and the behavior ofnet_arch=[64, 64]
will create separate networks with the same architecture, to be consistent with the off-policy algorithms.
Note
A2C and PPO models saved with SB3 < 1.7.0 will show a warning about
missing keys in the state dict when loaded with SB3 >= 1.7.0.
To suppress the warning, simply save the model again.
You can find more info in issue #1233
Breaking Changes:
- Removed deprecated
create_eval_env
,eval_env
,eval_log_path
,n_eval_episodes
andeval_freq
parameters,
please use anEvalCallback
instead - Removed deprecated
sde_net_arch
parameter - Removed
ret
attributes inVecNormalize
, please usereturns
instead VecNormalize
now updates the observation space when normalizing images
New Features:
- Introduced mypy type checking
- Added option to have non-shared features extractor between actor and critic in on-policy algorithms (@AlexPasqua)
- Added
with_bias
argument tocreate_mlp
- Added support for multidimensional
spaces.MultiBinary
observations - Features extractors now properly support unnormalized image-like observations (3D tensor)
when passingnormalize_images=False
- Added
normalized_image
parameter toNatureCNN
andCombinedExtractor
- Added support for Python 3.10
SB3-Contrib
- Fixed a bug in
RecurrentPPO
where the lstm states where incorrectly reshaped forn_lstm_layers > 1
(thanks @kolbytn) - Fixed
RuntimeError: rnn: hx is not contiguous
while predicting terminal values forRecurrentPPO
whenn_lstm_layers > 1
RL Zoo
- Added support for python file for configuration
- Added
monitor_kwargs
parameter
Bug Fixes:
- Fixed
ProgressBarCallback
under-reporting (@dominicgkerr) - Fixed return type of
evaluate_actions
inActorCritcPolicy
to reflect that entropy is an optional tensor (@Rocamonde) - Fixed type annotation of
policy
inBaseAlgorithm
andOffPolicyAlgorithm
- Allowed model trained with Python 3.7 to be loaded with Python 3.8+ without the
custom_objects
workaround - Raise an error when the same gym environment instance is passed as separate environments when creating a vectorized environment with more than one environment. (@Rocamonde)
- Fix type annotation of
model
inevaluate_policy
- Fixed
Self
return type usingTypeVar
- Fixed the env checker, the key was not passed when checking images from Dict observation space
- Fixed
normalize_images
which was not passed to parent class in some cases - Fixed
load_from_vector
that was broken with newer PyTorch version when passing PyTorch tensor
Deprecations:
- You should now explicitely pass a
features_extractor
parameter when callingextract_features()
- Deprecated shared layers in
MlpExtractor
(@AlexPasqua)
Others:
- Used issue forms instead of issue templates
- Updated the PR template to associate each PR with its peer in RL-Zoo3 and SB3-Contrib
- Fixed flake8 config to be compatible with flake8 6+
- Goal-conditioned environments are now characterized by the availability of the
compute_reward
method, rather than by their inheritance togym.GoalEnv
- Replaced
CartPole-v0
byCartPole-v1
is tests - Fixed
tests/test_distributions.py
type hints - Fixed
stable_baselines3/common/type_aliases.py
type hints - Fixed
stable_baselines3/common/torch_layers.py
type hints - Fixed
stable_baselines3/common/env_util.py
type hints - Fixed
stable_baselines3/common/preprocessing.py
type hints - Fixed
stable_baselines3/common/atari_wrappers.py
type hints - Fixed
stable_baselines3/common/vec_env/vec_check_nan.py
type hints - Exposed modules in
__init__.py
with the__all__
attribute (@ZikangXiong) - Upgraded GitHub CI/setup-python to v4 and checkout to v3
- Set tensors construction directly on the device (~8% speed boost on GPU)
- Monkey-patched
np.bool = bool
so gym 0.21 is compatible with NumPy 1.24+ - Standardized the use of
from gym import spaces
- Modified
get_system_info
to avoid issue linked to copy-pasting on GitHub issue
Documentation:
- Updated Hugging Face Integration page (@simoninithomas)
- Changed
env
tovec_env
when environment is vectorized - Updated custom policy docs to better explain the
mlp_extractor
's dimensions (@AlexPasqua) - Updated custom policy documentation (@athatheo)
- Improved tensorboard callback doc
- Clarify doc when using image-like input
- Added RLeXplore to the project page (@yuanmingqi)
SB3 v1.6.2: Progress bar and RL Zoo3 package
SB3 Contrib: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib
RL Zoo3: https://github.com/DLR-RM/rl-baselines3-zoo
New Features:
- Added
progress_bar
argument in thelearn()
method, displayed using TQDM and rich packages - Added progress bar callback
RL Zoo3
- The RL Zoo can now be installed as a package (
pip install rl_zoo3
)
Bug Fixes:
self.num_timesteps
was initialized properly only after the first call toon_step()
for callbacks- Set importlib-metadata version to
~=4.13
to be compatible withgym=0.21
Deprecations:
- Added deprecation warning if parameters
eval_env
,eval_freq
orcreate_eval_env
are used (see #925) (@tobirohrer)
Others:
- Fixed type hint of the
env_id
parameter inmake_vec_env
andmake_atari_env
(@AlexPasqua)
Documentation:
- Extended docstring of the
wrapper_class
parameter inmake_vec_env
(@AlexPasqua)
SB3 v1.6.1: Bug fix release
SB3 Contrib: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib
Breaking Changes:
- Switched minimum tensorboard version to 2.9.1
New Features:
- Support logging hyperparameters to tensorboard (@timothe-chaumont)
- Added checkpoints for replay buffer and
VecNormalize
statistics (@anand-bala) - Added option for
Monitor
to append to existing file instead of overriding (@sidney-tio) - The env checker now raises an error when using dict observation spaces and observation keys don't match observation space keys
SB3-Contrib
- Fixed the issue of wrongly passing policy arguments when using
CnnLstmPolicy
orMultiInputLstmPolicy
withRecurrentPPO
(@mlodel)
Bug Fixes:
- Fixed issue where
PPO
gives NaN if rollout buffer provides a batch of size 1 (@hughperkins) - Fixed the issue that
predict
does not always return action asnp.ndarray
(@qgallouedec) - Fixed division by zero error when computing FPS when a small number of time has elapsed in operating systems with low-precision timers.
- Added multidimensional action space support (@qgallouedec)
- Fixed missing verbose parameter passing in the
EvalCallback
constructor (@BurakDmb) - Fixed the issue that when updating the target network in DQN, SAC, TD3, the
running_mean
andrunning_var
properties of batch norm layers are not updated (@honglu2875) - Fixed incorrect type annotation of the replay_buffer_class argument in
common.OffPolicyAlgorithm
initializer, where an instance instead of a class was required (@Rocamonde) - Fixed loading saved model with different number of envrionments
- Removed
forward()
abstract method declaration fromcommon.policies.BaseModel
(already defined intorch.nn.Module
) to fix type errors in subclasses (@Rocamonde) - Fixed the return type of
.load()
and.learn()
methods inBaseAlgorithm
so that they now useTypeVar
(@Rocamonde) - Fixed an issue where keys with different tags but the same key raised an error in
common.logger.HumanOutputFormat
(@Rocamonde and @AdamGleave)
Others:
- Fixed
DictReplayBuffer.next_observations
typing (@qgallouedec) - Added support for
device="auto"
in buffers and made it default (@qgallouedec) - Updated
ResultsWriter` (used internally by
Monitorwrapper) to automatically create missing directories when
filename`` is a path (@dominicgkerr)
Documentation:
- Added an example of callback that logs hyperparameters to tensorboard. (@timothe-chaumont)
- Fixed typo in docstring "nature" -> "Nature" (@Melanol)
- Added info on split tensorboard logs into (@Melanol)
- Fixed typo in ppo doc (@francescoluciano)
- Fixed typo in install doc(@jlp-ue)
- Clarified and standardized verbosity documentation
- Added link to a GitHub issue in the custom policy documentation (@AlexPasqua)
- Fixed typos (@Akhilez)
SB3 v1.6.0: Recurrent PPO (PPO LSTM), better defaults for learning from pixels with SAC/TD3
SB3 Contrib: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib
Breaking Changes:
- Changed the way policy "aliases" are handled ("MlpPolicy", "CnnPolicy", ...), removing the former
register_policy
helper,policy_base
parameter and usingpolicy_aliases
static attributes instead (@Gregwar) - SB3 now requires PyTorch >= 1.11
- Changed the default network architecture when using
CnnPolicy
orMultiInputPolicy
with SAC or DDPG/TD3,
share_features_extractor
is now set to False by default and thenet_arch=[256, 256]
(instead ofnet_arch=[]
that was before)
SB3-Contrib
- Added Recurrent PPO (PPO LSTM). See Stable-Baselines-Team/stable-baselines3-contrib#53
Bug Fixes:
- Fixed saving and loading large policies greater than 2GB (@jkterry1, @ycheng517)
- Fixed final goal selection strategy that did not sample the final achieved goal (@qgallouedec)
- Fixed a bug with special characters in the tensorboard log name (@quantitative-technologies)
- Fixed a bug in
DummyVecEnv
's andSubprocVecEnv
's seeding function. None value was unchecked (@ScheiklP) - Fixed a bug where
EvalCallback
would crash when trying to synchronizeVecNormalize
stats when observation normalization was disabled - Added a check for unbounded actions
- Fixed issues due to newer version of protobuf (tensorboard) and sphinx
- Fix exception causes all over the codebase (@cool-RR)
- Prohibit simultaneous use of optimize_memory_usage and handle_timeout_termination due to a bug (@MWeltevrede)
- Fixed a bug in
kl_divergence
check that would fail when using numpy arrays with MultiCategorical distribution
Others:
- Upgraded to Python 3.7+ syntax using
pyupgrade
- Removed redundant double-check for nested observations from
BaseAlgorithm._wrap_env
(@TibiGG)
Documentation:
- Added link to gym doc and gym env checker
- Fix typo in PPO doc (@bcollazo)
- Added link to PPO ICLR blog post
- Added remark about breaking Markov assumption and timeout handling
- Added doc about MLFlow integration via custom logger (@git-thor)
- Updated Huggingface integration doc
- Added copy button for code snippets
- Added doc about EnvPool and Isaac Gym support
SB3 v1.5.0: Bug fixes, early stopping callback
SB3 Contrib: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib
Breaking Changes:
- Switched minimum Gym version to 0.21.0.
New Features:
- Added
StopTrainingOnNoModelImprovement
to callback collection (@caburu) - Makes the length of keys and values in
HumanOutputFormat
configurable,
depending on desired maximum width of output. - Allow PPO to turn of advantage normalization (see PR #763) @vwxyzjn
SB3-Contrib
- coming soon: Cross Entropy Method, see Stable-Baselines-Team/stable-baselines3-contrib#62
Bug Fixes:
- Fixed a bug in
VecMonitor
. The monitor did not consider theinfo_keywords
during stepping (@ScheiklP) - Fixed a bug in
HumanOutputFormat
. Distinct keys truncated to the same prefix would overwrite each others value,
resulting in only one being output. This now raises an error (this should only affect a small fraction of use cases
with very long keys.) - Routing all the
nn.Module
calls through implicit rather than explict forward as per pytorch guidelines (@manuel-delverme) - Fixed a bug in
VecNormalize
where error occurs whennorm_obs
is set to False for environment with dictionary observation (@buoyancy99) - Set default
env
argument toNone
inHerReplayBuffer.sample
(@qgallouedec) - Fix
batch_size
typing inDQN
(@qgallouedec) - Fixed sample normalization in
DictReplayBuffer
(@qgallouedec)
Others:
- Fixed pytest warnings
- Removed parameter
remove_time_limit_termination
in off policy algorithms since it was dead code (@Gregwar)
Documentation:
- Added doc on Hugging Face integration (@simoninithomas)
- Added furuta pendulum project to project list (@Armandpl)
- Fix indentation 2 spaces to 4 spaces in custom env documentation example (@Gautam-J)
- Update MlpExtractor docstring (@gianlucadecola)
- Added explanation of the logger output
- Update
Directly Accessing The Summary Writer
in tensorboard integration (@xy9485)
Full Changelog: v1.4.0...v1.5.0
SB3 v1.4.0: TRPO, ARS and multi env training for off-policy algorithms
SB3 Contrib: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib
Breaking Changes:
- Dropped python 3.6 support (as announced in previous release)
- Renamed
mask
argument of thepredict()
method toepisode_start
(used with RNN policies only) - local variables
action
,done
andreward
were renamed to their plural form for offpolicy algorithms (actions
,dones
,rewards
),
this may affect custom callbacks. - Removed
episode_reward
field fromRolloutReturn()
type
Warning:
An update to the HER
algorithm is planned to support multi-env training and remove the max episode length constrain.
(see PR #704)
This will be a backward incompatible change (model trained with previous version of HER
won't work with the new version).
New Features:
- Added
norm_obs_keys
param forVecNormalize
wrapper to configure which observation keys to normalize (@kachayev) - Added experimental support to train off-policy algorithms with multiple envs (note:
HerReplayBuffer
currently not supported) - Handle timeout termination properly for on-policy algorithms (when using
TimeLimit
) - Added
skip
option toVecTransposeImage
to skip transforming the channel order when the heuristic is wrong - Added
copy()
andcombine()
methods toRunningMeanStd
SB3-Contrib
- Added Trust Region Policy Optimization (TRPO) (@cyprienc)
- Added Augmented Random Search (ARS) (@sgillen)
- Coming soon: PPO LSTM, see Stable-Baselines-Team/stable-baselines3-contrib#53
Bug Fixes:
- Fixed a bug where
set_env()
withVecNormalize
would result in an error with off-policy algorithms (thanks @cleversonahum) - FPS calculation is now performed based on number of steps performed during last
learn
call, even whenreset_num_timesteps
is set toFalse
(@kachayev) - Fixed evaluation script for recurrent policies (experimental feature in SB3 contrib)
- Fixed a bug where the observation would be incorrectly detected as non-vectorized instead of throwing an error
- The env checker now properly checks and warns about potential issues for continuous action spaces when the boundaries are too small or when the dtype is not float32
- Fixed a bug in
VecFrameStack
with channel first image envs, where the terminal observation would be wrongly created.
Others:
- Added a warning in the env checker when not using
np.float32
for continuous actions - Improved test coverage and error message when checking shape of observation
- Added
newline="\n"
when opening CSV monitor files so that each line ends with\r\n
instead of\r\r\n
on Windows while Linux environments are not affected (@hsuehch) - Fixed
device
argument inconsistency (@qgallouedec)
Documentation:
- Add drivergym to projects page (@theDebugger811)
- Add highway-env to projects page (@eleurent)
- Add tactile-gym to projects page (@ac-93)
- Fix indentation in the RL tips page (@cove9988)
- Update GAE computation docstring
- Add documentation on exporting to TFLite/Coral
- Added JMLR paper and updated citation
- Added link to RL Tips and Tricks video
- Updated
BaseAlgorithm.load
docstring (@Demetrio92) - Added a note on
load
behavior in the examples (@Demetrio92) - Updated SB3 Contrib doc
- Fixed A2C and migration guide guidance on how to set epsilon with RMSpropTFLike (@thomasgubler)
- Fixed custom policy documentation (@IperGiove)
- Added doc on Weights & Biases integration
SB3 v1.3.0 : Bug fixes and improvements for the user
WARNING: This version will be the last one supporting Python 3.6 (end of life in Dec 2021).
We highly recommend you to upgrade to Python >= 3.7.
SB3-Contrib changelog: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/releases/tag/v1.3.0
Breaking Changes:
-
sde_net_arch
argument in policies is deprecated and will be removed in a future version. -
_get_latent
(ActorCriticPolicy
) was removed -
All logging keys now use underscores instead of spaces (@timokau). Concretely this changes:
time/total timesteps
totime/total_timesteps
for off-policy algorithms (PPO and A2C) and the eval callback (on-policy algorithms already used the underscored version),rollout/exploration rate
torollout/exploration_rate
androllout/success rate
torollout/success_rate
.
New Features:
- Added methods
get_distribution
andpredict_values
forActorCriticPolicy
for A2C/PPO/TRPO (@cyprienc) - Added methods
forward_actor
andforward_critic
forMlpExtractor
- Added
sb3.get_system_info()
helper function to gather version information relevant to SB3 (e.g., Python and PyTorch version) - Saved models now store system information where agent was trained, and load functions have
print_system_info
parameter to help debugging load issues.
Bug Fixes:
- Fixed
dtype
of observations forSimpleMultiObsEnv
- Allow
VecNormalize
to wrap discrete-observation environments to normalize reward
when observation normalization is disabled. - Fixed a bug where
DQN
would throw an error when usingDiscrete
observation and stochastic actions - Fixed a bug where sub-classed observation spaces could not be used
- Added
force_reset
argument toload()
andset_env()
in order to be able to calllearn(reset_num_timesteps=False)
with a new environment
Others:
- Cap gym max version to 0.19 to avoid issues with atari-py and other breaking changes
- Improved error message when using dict observation with the wrong policy
- Improved error message when using
EvalCallback
with two envs not wrapped the same way. - Added additional infos about supported python version for PyPi in
setup.py
Documentation:
- Add Rocket League Gym to list of supported projects (@AechPro)
- Added gym-electric-motor to project page (@wkirgsn)
- Added policy-distillation-baselines to project page (@CUN-bjy)
- Added ONNX export instructions (@batu)
- Update read the doc env (fixed
docutils
issue) - Fix PPO environment name (@IljaAvadiev)
- Fix custom env doc and add env registration example
- Update algorithms from SB3 Contrib
- Use underscores for numeric literals in examples to improve clarity
SB3 v1.2.0: Hotfix for VecNormalize, training/eval mode support
Breaking Changes:
- SB3 now requires PyTorch >= 1.8.1
VecNormalize
ret
attribute was renamed toreturns
Bug Fixes:
- Hotfix for
VecNormalize
where the observation filter was not updated at reset (thanks @vwxyzjn) - Fixed model predictions when using batch normalization and dropout layers by calling
train()
andeval()
(@davidblom603) - Fixed model training for DQN, TD3 and SAC so that their target nets always remain in evaluation mode (@ayeright)
- Passing
gradient_steps=0
to an off-policy algorithm will result in no gradient steps being taken (vs as many gradient steps as steps done in the environment
during the rollout in previous versions)
Others:
- Enabled Python 3.9 in GitHub CI
- Fixed type annotations
- Refactored
predict()
by moving the preprocessing toobs_to_tensor()
method
Documentation:
- Updated multiprocessing example
- Added example of
VecEnvWrapper
- Added a note about logging to tensorboard more often
- Added warning about simplicity of examples and link to RL zoo (@MihaiAnca13)
SB3 v1.1.0: Dictionary observation support, timeout handling and refactored HER buffer
Breaking Changes
- All customs environments (e.g. the
BitFlippingEnv
orIdentityEnv
) were moved tostable_baselines3.common.envs
folder - Refactored
HER
which is now theHerReplayBuffer
class that can be passed to any off-policy algorithm - Handle timeout termination properly for off-policy algorithms (when using
TimeLimit
) - Renamed
_last_dones
anddones
to_last_episode_starts
andepisode_starts
inRolloutBuffer
. - Removed
ObsDictWrapper
asDict
observation spaces are now supported
her_kwargs = dict(n_sampled_goal=2, goal_selection_strategy="future", online_sampling=True)
# SB3 < 1.1.0
# model = HER("MlpPolicy", env, model_class=SAC, **her_kwargs)
# SB3 >= 1.1.0:
model = SAC("MultiInputPolicy", env, replay_buffer_class=HerReplayBuffer, replay_buffer_kwargs=her_kwargs)
- Updated the KL Divergence estimator in the PPO algorithm to be positive definite and have lower variance (@09tangriro)
- Updated the KL Divergence check in the PPO algorithm to be before the gradient update step rather than after end of epoch (@09tangriro)
- Removed parameter
channels_last
fromis_image_space
as it can be inferred. - The logger object is now an attribute
model.logger
that be set by the user usingmodel.set_logger()
- Changed the signature of
logger.configure
andutils.configure_logger
, they now return aLogger
object - Removed
Logger.CURRENT
andLogger.DEFAULT
- Moved
warn(), debug(), log(), info(), dump()
methods to theLogger
class .learn()
now throws an import error when the user tries to log to tensorboard but the package is not installed
New Features
- Added support for single-level
Dict
observation space (@JadenTravnik) - Added
DictRolloutBuffer
DictReplayBuffer
to support dictionary observations (@JadenTravnik) - Added
StackedObservations
andStackedDictObservations
that are used withinVecFrameStack
- Added simple 4x4 room Dict test environments
HerReplayBuffer
now supportsVecNormalize
whenonline_sampling=False
- Added VecMonitor and VecExtractDictObs wrappers to handle gym3-style vectorized environments (@vwxyzjn)
- Ignored the terminal observation if the it is not provided by the environment
such as the gym3-style vectorized environments. (@vwxyzjn) - Added policy_base as input to the OnPolicyAlgorithm for more flexibility (@09tangriro)
- Added support for image observation when using
HER
- Added
replay_buffer_class
andreplay_buffer_kwargs
arguments to off-policy algorithms - Added
kl_divergence
helper forDistribution
classes (@09tangriro) - Added support for vector environments with
num_envs > 1
(@benblack769) - Added
wrapper_kwargs
argument tomake_vec_env
(@amy12xx)
Bug Fixes
- Fixed potential issue when calling off-policy algorithms with default arguments multiple times (the size of the replay buffer would be the same)
- Fixed loading of
ent_coef
forSAC
andTQC
, it was not optimized anymore (thanks @Atlis) - Fixed saving of
A2C
andPPO
policy when using gSDE (thanks @liusida) - Fixed a bug where no output would be shown even if
verbose>=1
after passingverbose=0
once - Fixed observation buffers dtype in DictReplayBuffer (@c-rizz)
- Fixed EvalCallback tensorboard logs being logged with the incorrect timestep. They are now written with the timestep at which they were recorded. (@skandermoalla)
Others
- Added
flake8-bugbear
to tests dependencies to find likely bugs - Updated
env_checker
to reflect support of dict observation spaces - Added Code of Conduct
- Added tests for GAE and lambda return computation
- Updated distribution entropy test (thanks @09tangriro)
- Added sanity check
batch_size > 1
in PPO to avoid NaN in advantage normalization
Documentation:
- Added gym pybullet drones project (@JacopoPan)
- Added link to SuperSuit in projects (@justinkterry)
- Fixed DQN example (thanks @ltbd78)
- Clarified channel-first/channel-last recommendation
- Update sphinx environment installation instructions (@tom-doerr)
- Clarified pip installation in Zsh (@tom-doerr)
- Clarified return computation for on-policy algorithms (TD(lambda) estimate was used)
- Added example for using
ProcgenEnv
- Added note about advanced custom policy example for off-policy algorithms
- Fixed DQN unicode checkmarks
- Updated migration guide (@juancroldan)
- Pinned
docutils==0.16
to avoid issue with rtd theme - Clarified callback
save_freq
definition - Added doc on how to pass a custom logger
- Remove recurrent policies from
A2C
docs (@bstee615)
Stable-Baselines3 v1.0
First Major Version
Blog post: https://araffin.github.io/post/sb3/
100+ pre-trained models in the zoo: https://github.com/DLR-RM/rl-baselines3-zoo
Breaking Changes:
- Removed
stable_baselines3.common.cmd_util
(already deprecated), please useenv_util
instead
Warning
A refactoring of the HER
algorithm is planned together with support for dictionary observations (see PR #243 and
#351)
This will be a backward incompatible change (model trained with previous version of HER
won't work with the new version).
New Features:
- Added support for
custom_objects
when loading models
Bug Fixes:
- Fixed a bug with
DQN
predict method when usingdeterministic=False
with image space
Documentation:
- Fixed examples
- Added new project using SB3: rl_reach (@PierreExeter)
- Added note about slow-down when switching to PyTorch
- Add a note on continual learning and resetting environment
- Updated RL-Zoo to reflect the fact that is it more than a collection of trained agents
- Added images to illustrate the training loop and custom policies (created with https://excalidraw.com/)
- Updated the custom policy section