在保存in _save_checkpoint 的时候，提示metric_value = metrics[metric_to_check] KeyError: 'eval_loss'，如何解决。 #7816

Kb519 · 2025-04-22T15:26:41Z

Reminder

I have read the above rules and searched the existing issues.

System Info

llamafactory version: 0.9.2

Platform: Linux-3.10.0-957.el7.x86_64-x86_64-with-glibc2.17
Python version: 3.10.13
PyTorch version: 2.1.0 (GPU)
Transformers version: 4.41.2

Reproduction

我的报错是in _save_checkpoint metric_value = metrics[metric_to_check] KeyError: 'eval_loss'。
我的yaml文件是，### model
model_name_or_path: /llama/LLaMA-Factory-main/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

method

stage: sft
do_train: true
finetuning_type: lora
lora_rank: 4
lora_alpha: 8
lora_dropout: 0.05
lora_target: "q_proj,v_proj"
gradient_checkpointing: true

dataset

dataset: identityz
template: deepseek
cutoff_len: 5120
max_samples: 100000
overwrite_cache: true
preprocessing_num_workers: 8
dataloader_num_workers: 2

output

output_dir: /llama/LLaMA-Factory-main/saves
logging_steps: 500
logging_strategy: "steps"
save_steps: 500
plot_loss: true
save_only_model: false
load_best_model_at_end: true
ddp_find_unused_parameters: false

train

per_device_train_batch_size: 1
gradient_accumulation_steps: 16
learning_rate: 3.0e-5
num_train_epochs: 2.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
ddp_timeout: 180000000
resume_from_checkpoint: /llama/LLaMA-Factory-main/saves/checkpoint-500
export_device: cpu
dataloader_pin_memory: true
auto_find_batch_size: true

eval

do_eval: true
eval_dataset: identityzz
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

我用的是指令式数据。在保存的checkpoint的时候，提示in _save_checkpoint metric_value = metrics[metric_to_check] KeyError: 'eval_loss'。请问如何解决。

Others

No response

The text was updated successfully, but these errors were encountered:

Kb519 · 2025-04-23T11:29:45Z

没人知道，如何解决吗？

Yu-Yuqing · 2025-04-29T03:05:45Z

我也是，请问解决了吗

hiyouga · 2025-04-29T23:00:39Z

fixed in #7912

Kb519 added bug pending labels Apr 22, 2025

hiyouga closed this as completed Apr 29, 2025

hiyouga mentioned this issue Apr 29, 2025

[data] add eval_on_each_dataset arg #7912

Merged

2 tasks

hiyouga added solved and removed bug pending labels Apr 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

在保存in _save_checkpoint 的时候，提示metric_value = metrics[metric_to_check] KeyError: 'eval_loss'，如何解决。 #7816

在保存in _save_checkpoint 的时候，提示metric_value = metrics[metric_to_check] KeyError: 'eval_loss'，如何解决。 #7816

Kb519 commented Apr 22, 2025 •

edited

Loading

Kb519 commented Apr 23, 2025

Yu-Yuqing commented Apr 29, 2025

hiyouga commented Apr 29, 2025

在保存in _save_checkpoint 的时候，提示metric_value = metrics[metric_to_check] KeyError: 'eval_loss'，如何解决。 #7816

在保存in _save_checkpoint 的时候，提示metric_value = metrics[metric_to_check] KeyError: 'eval_loss'，如何解决。 #7816

Comments

Kb519 commented Apr 22, 2025 • edited Loading

Reminder

System Info

Reproduction

method

dataset

output

train

eval

Others

Kb519 commented Apr 23, 2025

Yu-Yuqing commented Apr 29, 2025

hiyouga commented Apr 29, 2025

Kb519 commented Apr 22, 2025 •

edited

Loading