Skip to content

在保存in _save_checkpoint 的时候,提示metric_value = metrics[metric_to_check] KeyError: 'eval_loss',如何解决。 #7816

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
Kb519 opened this issue Apr 22, 2025 · 3 comments · Fixed by #7912
Labels
solved This problem has been already solved

Comments

@Kb519
Copy link

Kb519 commented Apr 22, 2025

Reminder

  • I have read the above rules and searched the existing issues.

System Info

llamafactory version: 0.9.2

  • Platform: Linux-3.10.0-957.el7.x86_64-x86_64-with-glibc2.17
  • Python version: 3.10.13
  • PyTorch version: 2.1.0 (GPU)
  • Transformers version: 4.41.2

Reproduction

我的报错是in _save_checkpoint metric_value = metrics[metric_to_check] KeyError: 'eval_loss'。
我的yaml文件是,### model
model_name_or_path: /llama/LLaMA-Factory-main/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

method

stage: sft
do_train: true
finetuning_type: lora
lora_rank: 4
lora_alpha: 8
lora_dropout: 0.05
lora_target: "q_proj,v_proj"
gradient_checkpointing: true

dataset

dataset: identityz
template: deepseek
cutoff_len: 5120
max_samples: 100000
overwrite_cache: true
preprocessing_num_workers: 8
dataloader_num_workers: 2

output

output_dir: /llama/LLaMA-Factory-main/saves
logging_steps: 500
logging_strategy: "steps"
save_steps: 500
plot_loss: true
save_only_model: false
load_best_model_at_end: true
ddp_find_unused_parameters: false

train

per_device_train_batch_size: 1
gradient_accumulation_steps: 16
learning_rate: 3.0e-5
num_train_epochs: 2.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
ddp_timeout: 180000000
resume_from_checkpoint: /llama/LLaMA-Factory-main/saves/checkpoint-500
export_device: cpu
dataloader_pin_memory: true
auto_find_batch_size: true

eval

do_eval: true
eval_dataset: identityzz
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

我用的是指令式数据 。在保存的checkpoint的时候,提示in _save_checkpoint metric_value = metrics[metric_to_check] KeyError: 'eval_loss'。请问如何解决。

Others

No response

@Kb519 Kb519 added bug Something isn't working pending This problem is yet to be addressed labels Apr 22, 2025
@Kb519
Copy link
Author

Kb519 commented Apr 23, 2025

没人知道,如何解决吗?

@Yu-Yuqing
Copy link

我也是,请问解决了吗

@hiyouga
Copy link
Owner

hiyouga commented Apr 29, 2025

fixed in #7912

@hiyouga hiyouga closed this as completed Apr 29, 2025
@hiyouga hiyouga added solved This problem has been already solved and removed bug Something isn't working pending This problem is yet to be addressed labels Apr 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants