-
Notifications
You must be signed in to change notification settings - Fork 26
Description
note: This issue at the moment is FYI to compile my observations and what I have tried to make it work. I don't personally have a need to do eval while training so unless someone asks this might not be an issue for anybody so why bother as this is not our code failing.
Liger-Kernel fails to compute loss when using train+eval and returns None - removing eval (leaving pure training) removes the problem. Tested with SFT trainer and SP>1.
Last I checked its fused_linear_cross_entropy feature was computing backward in forward w/o checking if the model is running under no_grad. So I tried to disable just that kernel with:
return AutoLigerKernelForCausalLM.from_pretrained(
self.config.name_or_path,
[...]
fused_linear_cross_entropy=False,
)
but it still returned None.
So for now I have added an exception:
if self.config.model.type == "liger":
# letting liger do fused logits+loss calculation
outputs = self.model(**batch, use_cache=False)
loss = outputs.loss
if loss is None:
raise ValueError("Liger-Kernel failed to compute loss (returned None) - it's known to fail with eval enabled along train steps.")
I also checked shift_labels contain valid non -100 tokens.
if users want this feature we would need to escalate this with the Liger-Kernel devs.