Liger-Kernel fails to compute loss when using train+eval and SP>1

note: This issue at the moment is FYI to compile my observations and what I have tried to make it work. I don't personally have a need to do eval while training so unless someone asks this might not be an issue for anybody so why bother as this is not our code failing.

----------------------

Liger-Kernel fails to compute loss when using train+eval and returns `None` - removing `eval` (leaving pure training) removes the problem. Tested with SFT trainer and SP>1.

Last I checked its fused_linear_cross_entropy feature was computing `backward` in `forward` w/o checking if the model is running under no_grad. So I tried to disable just that kernel with:

```
        return AutoLigerKernelForCausalLM.from_pretrained(
            self.config.name_or_path,
[...]
            fused_linear_cross_entropy=False,
        )
```
but it still returned `None`.

So for now I have added an exception:

```
        if self.config.model.type == "liger":
            # letting liger do fused logits+loss calculation
            outputs = self.model(**batch, use_cache=False)
            loss = outputs.loss
            if loss is None:
                raise ValueError("Liger-Kernel failed to compute loss (returned None) - it's known to fail with eval enabled along train steps.")
```

I also checked `shift_labels` contain valid non `-100` tokens.

if users want this feature we would need to escalate this with the Liger-Kernel devs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Liger-Kernel fails to compute loss when using train+eval and SP>1 #266

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Liger-Kernel fails to compute loss when using train+eval and SP>1 #266

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions