What are the recommended strategies for configuring Unizero in long-horizon planning scenarios? #377

ryanon4 · 2025-06-25T09:14:27Z

ryanon4
Jun 25, 2025

I'm interested in configuring Unizero for longer-horizon planning tasks.

In particular, I’m interested in whether there are any known rules of thumb, practical guidelines, or prior experience when adjusting hyperparameters like num_unroll_steps for extended planning horizons.

For example, in the default CartPole setup:

num_unroll_steps is set to 5
context_length is 4
However max_blocks is set to 10 - i'm curious why this is larger than the planning horizon.

When increasing num_unroll_steps (e.g., for deeper planning or longer credit assignment), should we also be adjusting other parameters alongside it — such as td_steps, which affects how far into the future value targets are bootstrapped?

From my own experiments:

Increasing num_simulations tends to help when planning with longer horizons.
Setting td_steps = num_unroll_steps has made sense so far, though I wonder if there are other configurations where a different relationship between the two is preferred.

Open Questions:

Should td_steps always match num_unroll_steps, or are there cases where it makes sense to keep them decoupled?
Should context_length scale with num_unroll_steps?
Any known interactions between longer horizons and other factors like value reanalysis, replay buffer size, or prioritization?

Understandably, CartPole isn’t an ideal environment to test the real benefits of long-horizon credit assignment — the episodes are short and reward is immediate — but it serves as a clean and simple testbed to experiment with config changes before moving on to more complex environments.

I Would love to hear from anyone who's experimented with:

Long-horizon planning in sparse or delayed-reward environments
Scaling num_unroll_steps in Unizero
Any issues around training stability when increasing these values

This was written by a human, however was modified using an LLM to improve English readability.

Answered by puyuan1996

Jul 2, 2025

Thank you very much for your insightful question.

Variable Definitions

max_blocks
- Refers to the range of timesteps considered during all phases of training and testing.
- Note: Each timestep comprises two tokens (one for the state and one for the action). Therefore, max_tokens = max_blocks × 2.
num_unroll_steps
- Represents the length of the sequence unrolled during training of the value-equivalent model.
- It should be less than or equal to max_blocks.
- If num_unroll_steps is less than max_blocks, part of the horizon will not be trained—which is equivalent to setting num_unroll_steps equal to max_blocks. In most cases, we simply set num_unroll_steps = max_blocks.
context_length
- Indic…

View full answer

puyuan1996 · 2025-07-02T16:18:48Z

puyuan1996
Jul 2, 2025
Maintainer

Thank you very much for your insightful question.

Variable Definitions

max_blocks
- Refers to the range of timesteps considered during all phases of training and testing.
- Note: Each timestep comprises two tokens (one for the state and one for the action). Therefore, max_tokens = max_blocks × 2.
num_unroll_steps
- Represents the length of the sequence unrolled during training of the value-equivalent model.
- It should be less than or equal to max_blocks.
- If num_unroll_steps is less than max_blocks, part of the horizon will not be trained—which is equivalent to setting num_unroll_steps equal to max_blocks. In most cases, we simply set num_unroll_steps = max_blocks.
context_length
- Indicates the length of the key-value (kv) cache retained during the testing phase after training.
- Since predicting over longer horizons tends to accumulate errors, context_length is typically set to be less than or equal to num_unroll_steps.
td_steps
- Denotes the horizon used for bootstrapping in the value function (or Q function) estimation.
- This variable is relatively independent of the others.
- A larger td_steps value makes the estimation closer to a Monte Carlo approach, but it requires a trade-off between variance and bias.

Responses to Your Questions (For Discussion Only)

Independence of td_steps:
- td_steps should remain configured independently of num_unroll_steps.
Relationship Between context_length and num_unroll_steps:
- Generally, context_length should be expanded proportionally with num_unroll_steps, because a longer training prediction horizon implies that the accumulated prediction error also extends further.
Further Research on the Third Question:
- The third question requires additional in-depth research; you might refer to papers such as Unplugged MuZero for further insights.
Experiment Observations:
- In our previous experiments with Pong and UniZero, we found that training converges very stably even when num_unroll_steps is increased to around 40.
Sparse-reward Environments (visual-match, MiniGrid):
- In environments with sparse rewards and potential POMDP characteristics, increasing num_unroll_steps (i.e., extending the agent's “memory length”) is necessary for achieving optimal policies.
- However, this adjustment alone may not be sufficient, as efficient exploration mechanisms (e.g., curiosity-driven rewards) are also required to enhance sample efficiency.

I hope these clarifications and responses help address your questions.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What are the recommended strategies for configuring Unizero in long-horizon planning scenarios? #377

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

What are the recommended strategies for configuring Unizero in long-horizon planning scenarios? #377

Uh oh!

ryanon4 Jun 25, 2025

Open Questions:

Variable Definitions

Replies: 1 comment

Uh oh!

puyuan1996 Jul 2, 2025 Maintainer

Variable Definitions

Responses to Your Questions (For Discussion Only)

ryanon4
Jun 25, 2025

puyuan1996
Jul 2, 2025
Maintainer