You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello! When I serve ArmoRM-Llama3-8B-v0.1 using OpenRLHF, the output rewards are almost negative (around -2.0). I've attached some pictures of how I served the reward model. Is the output of this RM naturally around -2.0, or is it because the way I serve the RM is wrong? (The prompt dataset are also from rlhflow, like "RLHFlow/iterative-prompt-v1-iter7-20K", and the responses are generated from "RLHFlow/LLaMA3-iterative-DPO-final". We also apply the chat template when creating the prompt-response dataset
)