Hi!
Are there any examples of realistic RL-tuning of 32B model with large model len (~32k with ignore_eos=True
to ensure it can run near the maximum-length responses)?
I found
|
model_path: ${oc.env:TRINITY_MODEL_PATH,Qwen/Qwen2.5-32B-Instruct} |
but it's suspiciously uses tp=1 sp=1. Does 32B training fits a single node in this experiment?
Or what is the actual number of response tokens in this experiment?
What would be your advice on setting tp and sp for 32B models? (both for single-node and for multi-node) Should we try sp=2, tp=2, tp=2xsp=2 configs?
Or should we just try sp=8 first?
Thanks!
Sorry, got wrong the label. This should not be marked as a bug