I have a question: according to the code calculation method, this reward is unrelated to the model updates. During the training process of OPT-125M, the reward oscillates around a certain value and does not show a clear trend of converging in a specific direction.
