Question:Is the model parameter update related to the reward？

I have a question: according to the code calculation method, this reward is unrelated to the model updates. During the training process of OPT-125M, the reward oscillates around a certain value and does not show a clear trend of converging in a specific direction.

<img width="640" height="480" alt="Image" src="https://github.com/user-attachments/assets/0fe33d85-c7bc-4eae-b61b-6875a9b2800c" />