Skip to content

Runtime error, invalid examples #7777

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task done
berkeleyljj opened this issue Apr 19, 2025 · 2 comments
Open
1 task done

Runtime error, invalid examples #7777

berkeleyljj opened this issue Apr 19, 2025 · 2 comments
Labels
bug Something isn't working pending This problem is yet to be addressed

Comments

@berkeleyljj
Copy link

Reminder

  • I have read the above rules and searched the existing issues.

System Info

I am trying to use "sharegpt" format to do LoRA ft with my custom dataset.

Bug:
Traceback (most recent call last):
File "/home/ubuntu/llamaf/src/llamafactory/data/loader.py", line 267, in _get_preprocessed_dataset
dataset_processor.print_data_example(next(iter(dataset)))
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/ubuntu/llamaf/llamaf/bin/llamafactory-cli", line 10, in
sys.exit(main())
File "/home/ubuntu/llamaf/src/llamafactory/cli.py", line 117, in main
run_exp()
File "/home/ubuntu/llamaf/src/llamafactory/train/tuner.py", line 107, in run_exp
_training_function(config={"args": args, "callbacks": callbacks})
File "/home/ubuntu/llamaf/src/llamafactory/train/tuner.py", line 69, in _training_function
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/home/ubuntu/llamaf/src/llamafactory/train/sft/workflow.py", line 51, in run_sft
dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module)
File "/home/ubuntu/llamaf/src/llamafactory/data/loader.py", line 310, in get_dataset
dataset = _get_preprocessed_dataset(
File "/home/ubuntu/llamaf/src/llamafactory/data/loader.py", line 272, in _get_preprocessed_dataset
raise RuntimeError("Cannot find valid samples, check data/README.md for the data format.")
RuntimeError: Cannot find valid samples, check data/README.md for the data format.

Important information (some content omitted for privacy):

  1. It's my custom data set following this example format:
    [
    {
    "conversations": [
    {
    "from": "human",
    "value": ""You are an expert software engineering manager working on the Expensify repository. You have tasked your team with addressing the following issue:\n\n[HOLD for payment 2023-04-18] ..."
    },
    {
    "from": "function_call",
    "value": "import os\n\n# Write the decision to the required file\ndecision = {\n "selected_proposal_id": 0\n}\nwith open('/app/expensify/manager_decisions.json', 'w') as f:\n import json\n json.dump(decision, f)\nprint("Decision written successfully.")"
    },
    {
    "from": "observation",
    "value": "Decision written successfully.\n"
    },
    {
    "from": "gpt",
    "value": "model response ..."
    }
    ]

  2. I read previously similar issues. I've used regex and cleanup script to remove any character that is not recognizable by my tokenizer.

  3. I've also checked data/README.md thoroughly and configured my dataset_info.json according to sharegpt requirements like this:
    "traces": {
    "file_name": "traces.json",
    "formatting": "sharegpt",
    "split": "train",
    "columns": {
    "messages": "conversations",
    "system": "system",
    "tools": "tools"
    },

Reproduction

llamafactory-cli train examples/train_lora/{your_file_name}.yaml

Others

No response

@berkeleyljj berkeleyljj added bug Something isn't working pending This problem is yet to be addressed labels Apr 19, 2025
@rzgarespo
Copy link

you are using (--template argument). If the template doesn't know how to format "function_call" or "observation", conversations containing them might be filtered out or cause errors during processing. This is a very likely cause.
What value are you passing for the --template argument in your command line?
Temporarily use a basic template known to work well with human/gpt roles to see if any data gets processed. If it does, the issue is template compatibility with your custom roles.

[
  {
    "conversations": [
      {
        "from": "human",
        "value": "Hello, World!"
      },
      {
        "from": "gpt",
        "value": "42"
      }
    ]
  }
]

Update dataset_info.json to point to this file (or rename it temporarily) and test.
and just to be sure run jq . traces.json > /dev/null

@berkeleyljj
Copy link
Author

berkeleyljj commented Apr 20, 2025

Yes that is the cause thanks!
However template is not a command line argument; instead, it's set in the yaml file.
In my case I am using "template: qwen" since that's the model I want to fine-tune.
I will try to modify the template script so it covers the two additional fields of tool and observation in sharegpt format. Meanwhile, is there any suggested fix? Please let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

2 participants