RLHF_MT_Reward This code implements COMET as a reward model to capture human preferences in machine translation (MT) and in Proximal Policy Optimization (PPO). 本项目基于 Miraclemarvel55/ChatGLM-RLHF 的工作,旨在进一步扩展和改进该项目的功能。