Skip to content

Liger-Kernel fails to compute loss when using train+eval and SP>1 #266

@sfc-gh-sbekman

Description

@sfc-gh-sbekman

note: This issue at the moment is FYI to compile my observations and what I have tried to make it work. I don't personally have a need to do eval while training so unless someone asks this might not be an issue for anybody so why bother as this is not our code failing.


Liger-Kernel fails to compute loss when using train+eval and returns None - removing eval (leaving pure training) removes the problem. Tested with SFT trainer and SP>1.

Last I checked its fused_linear_cross_entropy feature was computing backward in forward w/o checking if the model is running under no_grad. So I tried to disable just that kernel with:

        return AutoLigerKernelForCausalLM.from_pretrained(
            self.config.name_or_path,
[...]
            fused_linear_cross_entropy=False,
        )

but it still returned None.

So for now I have added an exception:

        if self.config.model.type == "liger":
            # letting liger do fused logits+loss calculation
            outputs = self.model(**batch, use_cache=False)
            loss = outputs.loss
            if loss is None:
                raise ValueError("Liger-Kernel failed to compute loss (returned None) - it's known to fail with eval enabled along train steps.")

I also checked shift_labels contain valid non -100 tokens.

if users want this feature we would need to escalate this with the Liger-Kernel devs.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions