You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have read the above rules and searched the existing issues.
System Info
I am trying to use "sharegpt" format to do LoRA ft with my custom dataset.
Bug:
Traceback (most recent call last):
File "/home/ubuntu/llamaf/src/llamafactory/data/loader.py", line 267, in _get_preprocessed_dataset
dataset_processor.print_data_example(next(iter(dataset)))
StopIteration
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/llamaf/llamaf/bin/llamafactory-cli", line 10, in
sys.exit(main())
File "/home/ubuntu/llamaf/src/llamafactory/cli.py", line 117, in main
run_exp()
File "/home/ubuntu/llamaf/src/llamafactory/train/tuner.py", line 107, in run_exp
_training_function(config={"args": args, "callbacks": callbacks})
File "/home/ubuntu/llamaf/src/llamafactory/train/tuner.py", line 69, in _training_function
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/home/ubuntu/llamaf/src/llamafactory/train/sft/workflow.py", line 51, in run_sft
dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module)
File "/home/ubuntu/llamaf/src/llamafactory/data/loader.py", line 310, in get_dataset
dataset = _get_preprocessed_dataset(
File "/home/ubuntu/llamaf/src/llamafactory/data/loader.py", line 272, in _get_preprocessed_dataset
raise RuntimeError("Cannot find valid samples, check data/README.md for the data format.")
RuntimeError: Cannot find valid samples, check data/README.md for the data format.
Important information (some content omitted for privacy):
It's my custom data set following this example format:
[
{
"conversations": [
{
"from": "human",
"value": ""You are an expert software engineering manager working on the Expensify repository. You have tasked your team with addressing the following issue:\n\n[HOLD for payment 2023-04-18] ..."
},
{
"from": "function_call",
"value": "import os\n\n# Write the decision to the required file\ndecision = {\n "selected_proposal_id": 0\n}\nwith open('/app/expensify/manager_decisions.json', 'w') as f:\n import json\n json.dump(decision, f)\nprint("Decision written successfully.")"
},
{
"from": "observation",
"value": "Decision written successfully.\n"
},
{
"from": "gpt",
"value": "model response ..."
}
]
I read previously similar issues. I've used regex and cleanup script to remove any character that is not recognizable by my tokenizer.
I've also checked data/README.md thoroughly and configured my dataset_info.json according to sharegpt requirements like this:
"traces": {
"file_name": "traces.json",
"formatting": "sharegpt",
"split": "train",
"columns": {
"messages": "conversations",
"system": "system",
"tools": "tools"
},
you are using (--template argument). If the template doesn't know how to format "function_call" or "observation", conversations containing them might be filtered out or cause errors during processing. This is a very likely cause.
What value are you passing for the --template argument in your command line?
Temporarily use a basic template known to work well with human/gpt roles to see if any data gets processed. If it does, the issue is template compatibility with your custom roles.
Yes that is the cause thanks!
However template is not a command line argument; instead, it's set in the yaml file.
In my case I am using "template: qwen" since that's the model I want to fine-tune.
I will try to modify the template script so it covers the two additional fields of tool and observation in sharegpt format. Meanwhile, is there any suggested fix? Please let me know!
Reminder
System Info
I am trying to use "sharegpt" format to do LoRA ft with my custom dataset.
Bug:
Traceback (most recent call last):
File "/home/ubuntu/llamaf/src/llamafactory/data/loader.py", line 267, in _get_preprocessed_dataset
dataset_processor.print_data_example(next(iter(dataset)))
StopIteration
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/llamaf/llamaf/bin/llamafactory-cli", line 10, in
sys.exit(main())
File "/home/ubuntu/llamaf/src/llamafactory/cli.py", line 117, in main
run_exp()
File "/home/ubuntu/llamaf/src/llamafactory/train/tuner.py", line 107, in run_exp
_training_function(config={"args": args, "callbacks": callbacks})
File "/home/ubuntu/llamaf/src/llamafactory/train/tuner.py", line 69, in _training_function
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/home/ubuntu/llamaf/src/llamafactory/train/sft/workflow.py", line 51, in run_sft
dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module)
File "/home/ubuntu/llamaf/src/llamafactory/data/loader.py", line 310, in get_dataset
dataset = _get_preprocessed_dataset(
File "/home/ubuntu/llamaf/src/llamafactory/data/loader.py", line 272, in _get_preprocessed_dataset
raise RuntimeError("Cannot find valid samples, check
data/README.md
for the data format.")RuntimeError: Cannot find valid samples, check
data/README.md
for the data format.Important information (some content omitted for privacy):
It's my custom data set following this example format:
[
{
"conversations": [
{
"from": "human",
"value": ""You are an expert software engineering manager working on the Expensify repository. You have tasked your team with addressing the following issue:\n\n[HOLD for payment 2023-04-18] ..."
},
{
"from": "function_call",
"value": "import os\n\n# Write the decision to the required file\ndecision = {\n "selected_proposal_id": 0\n}\nwith open('/app/expensify/manager_decisions.json', 'w') as f:\n import json\n json.dump(decision, f)\nprint("Decision written successfully.")"
},
{
"from": "observation",
"value": "Decision written successfully.\n"
},
{
"from": "gpt",
"value": "model response ..."
}
]
I read previously similar issues. I've used regex and cleanup script to remove any character that is not recognizable by my tokenizer.
I've also checked data/README.md thoroughly and configured my dataset_info.json according to sharegpt requirements like this:
"traces": {
"file_name": "traces.json",
"formatting": "sharegpt",
"split": "train",
"columns": {
"messages": "conversations",
"system": "system",
"tools": "tools"
},
Reproduction
Others
No response
The text was updated successfully, but these errors were encountered: