- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 3.9k
Fix grpo nan #3278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix grpo nan #3278
Conversation
        
          
                unsloth/models/rl.py
              
                Outdated
          
        
      | # Selective log softmax | ||
| selective_log_softmax_code = inspect.getsource(selective_log_softmax) | ||
|  | ||
| #GRPO masking code | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #GRPO masking code | |
| # GRPO masking code | 
        
          
                unsloth/models/rl_replacements.py
              
                Outdated
          
        
      |  | ||
| # The new lines you want to insert | ||
| replacement_lines = """batch_size = self.args.per_device_train_batch_size if mode == "train" else self.args.per_device_eval_batch_size | ||
| prompt_completion_ids = left_pack_padding(prompt_completion_ids, self.processing_class.pad_token_id)""" | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe newline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to easierly resolve merge conflicts when testing with the Fast VLM infernece branch I moved everything in this PR to: #3132
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Fixes the grpo nan issues we have been having with ga steps > 1, tested on h100 and collab on T4. This PR was created mainly to avoid passing a SPDA attention mask so it would not eat up a lot of memory. Relies on unslothai/unsloth-zoo#265.