maximum episodic rewards #151

m-erdemm · 2025-06-04T10:13:17Z

m-erdemm
Jun 4, 2025

I've read your paper and have been using the MuJoCo Playground to test my algorithms. Thank you for your great work. From the report, I see that the Brax framework was used for training and evaluation, with results reported across environments. I have two questions:

Were the Brax hyperparameters tuned separately for each environment? I noticed variations in the hyperparameters across environments. For the dm_control environments, only two sets were shared—one for PPO and one for SAC.

Regarding PPO agents, do the maximum achievable returns per episode vary significantly across environments? In some cases, returns reach around 900–1000, but in others, Brax seems to struggle—for example, HopperHop wasn’t solved, FingerSpin reaches ~600, and PendulumSwingup only ~50. Is there a standard expected return for each environment (i.e. around 1000 for each environment for dm_control)? If so, could these differences be due to insufficient hyperparameter tuning or is it expected?

btaba · 2025-06-23T20:12:31Z

btaba
Jun 23, 2025
Maintainer

Hello @m-erdemm , we did tune the RL hyperparams per environment. The DM Control Envs have standardized rewards (max 1000 per episode FWIR), but the other envs do not standardize the rewards at all. Having PPO struggle on some of these DM Control Envs is expected, you can take a look at the original paper here.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

maximum episodic rewards #151

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

maximum episodic rewards #151

Uh oh!

m-erdemm Jun 4, 2025

Replies: 1 comment

Uh oh!

btaba Jun 23, 2025 Maintainer

m-erdemm
Jun 4, 2025

btaba
Jun 23, 2025
Maintainer