Runtime error, invalid examples #7777

berkeleyljj · 2025-04-19T21:56:49Z

Reminder

I have read the above rules and searched the existing issues.

System Info

I am trying to use "sharegpt" format to do LoRA ft with my custom dataset.

Bug:
Traceback (most recent call last):
File "/home/ubuntu/llamaf/src/llamafactory/data/loader.py", line 267, in _get_preprocessed_dataset
dataset_processor.print_data_example(next(iter(dataset)))
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/ubuntu/llamaf/llamaf/bin/llamafactory-cli", line 10, in
sys.exit(main())
File "/home/ubuntu/llamaf/src/llamafactory/cli.py", line 117, in main
run_exp()
File "/home/ubuntu/llamaf/src/llamafactory/train/tuner.py", line 107, in run_exp
_training_function(config={"args": args, "callbacks": callbacks})
File "/home/ubuntu/llamaf/src/llamafactory/train/tuner.py", line 69, in _training_function
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/home/ubuntu/llamaf/src/llamafactory/train/sft/workflow.py", line 51, in run_sft
dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module)
File "/home/ubuntu/llamaf/src/llamafactory/data/loader.py", line 310, in get_dataset
dataset = _get_preprocessed_dataset(
File "/home/ubuntu/llamaf/src/llamafactory/data/loader.py", line 272, in _get_preprocessed_dataset
raise RuntimeError("Cannot find valid samples, check data/README.md for the data format.")
RuntimeError: Cannot find valid samples, check data/README.md for the data format.

Important information (some content omitted for privacy):

It's my custom data set following this example format:
[
{
"conversations": [
{
"from": "human",
"value": ""You are an expert software engineering manager working on the Expensify repository. You have tasked your team with addressing the following issue:\n\n[HOLD for payment 2023-04-18] ..."
},
{
"from": "function_call",
"value": "import os\n\n# Write the decision to the required file\ndecision = {\n "selected_proposal_id": 0\n}\nwith open('/app/expensify/manager_decisions.json', 'w') as f:\n import json\n json.dump(decision, f)\nprint("Decision written successfully.")"
},
{
"from": "observation",
"value": "Decision written successfully.\n"
},
{
"from": "gpt",
"value": "model response ..."
}
]
I read previously similar issues. I've used regex and cleanup script to remove any character that is not recognizable by my tokenizer.
I've also checked data/README.md thoroughly and configured my dataset_info.json according to sharegpt requirements like this:
"traces": {
"file_name": "traces.json",
"formatting": "sharegpt",
"split": "train",
"columns": {
"messages": "conversations",
"system": "system",
"tools": "tools"
},

Reproduction

llamafactory-cli train examples/train_lora/{your_file_name}.yaml

Others

No response

The text was updated successfully, but these errors were encountered:

rzgarespo · 2025-04-20T03:06:58Z

you are using (--template argument). If the template doesn't know how to format "function_call" or "observation", conversations containing them might be filtered out or cause errors during processing. This is a very likely cause.
What value are you passing for the --template argument in your command line?
Temporarily use a basic template known to work well with human/gpt roles to see if any data gets processed. If it does, the issue is template compatibility with your custom roles.

[
  {
    "conversations": [
      {
        "from": "human",
        "value": "Hello, World!"
      },
      {
        "from": "gpt",
        "value": "42"
      }
    ]
  }
]

Update dataset_info.json to point to this file (or rename it temporarily) and test.
and just to be sure run jq . traces.json > /dev/null

berkeleyljj · 2025-04-20T04:16:45Z

Yes that is the cause thanks!
However template is not a command line argument; instead, it's set in the yaml file.
In my case I am using "template: qwen" since that's the model I want to fine-tune.
I will try to modify the template script so it covers the two additional fields of tool and observation in sharegpt format. Meanwhile, is there any suggested fix? Please let me know!

berkeleyljj added bug Something isn't working pending This problem is yet to be addressed labels Apr 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime error, invalid examples #7777

Runtime error, invalid examples #7777

berkeleyljj commented Apr 19, 2025

rzgarespo commented Apr 20, 2025

berkeleyljj commented Apr 20, 2025 •

edited

Loading

Runtime error, invalid examples #7777

Runtime error, invalid examples #7777

Comments

berkeleyljj commented Apr 19, 2025

Reminder

System Info

Reproduction

Others

rzgarespo commented Apr 20, 2025

berkeleyljj commented Apr 20, 2025 • edited Loading

berkeleyljj commented Apr 20, 2025 •

edited

Loading