Replies: 1 comment
-
Hello @m-erdemm , we did tune the RL hyperparams per environment. The DM Control Envs have standardized rewards (max 1000 per episode FWIR), but the other envs do not standardize the rewards at all. Having PPO struggle on some of these DM Control Envs is expected, you can take a look at the original paper here. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I've read your paper and have been using the MuJoCo Playground to test my algorithms. Thank you for your great work. From the report, I see that the Brax framework was used for training and evaluation, with results reported across environments. I have two questions:
Were the Brax hyperparameters tuned separately for each environment? I noticed variations in the hyperparameters across environments. For the dm_control environments, only two sets were shared—one for PPO and one for SAC.
Regarding PPO agents, do the maximum achievable returns per episode vary significantly across environments? In some cases, returns reach around 900–1000, but in others, Brax seems to struggle—for example, HopperHop wasn’t solved, FingerSpin reaches ~600, and PendulumSwingup only ~50. Is there a standard expected return for each environment (i.e. around 1000 for each environment for dm_control)? If so, could these differences be due to insufficient hyperparameter tuning or is it expected?
Beta Was this translation helpful? Give feedback.
All reactions