Replies: 2 comments 1 reply
-
|
Beta Was this translation helpful? Give feedback.
1 reply
-
Hello, have you solved the problem now? I have the same problem now |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
2023-09-26 11:57:18,486 - INFO : Find old submission; recover submission from json file;submission.submission_hash:c40cbdcc7d969ad3ae8930771348a761b6d46f47; machine.context.remote_root:/home/jiang/work/dpgen_example/run/nnwork/c40cbdcc7d969ad3ae8930771348a761b6d46f47; submission.work_base:iter.000000/00.train;
2023-09-26 11:57:18,536 - INFO : info:check_all_finished: False
2023-09-26 11:57:18,539 - INFO : job: 35225006374c02dda988d09b9556589231a34548 6329 terminated;fail_cout is 10; resubmitting job
2023-09-26 11:57:18,547 - INFO : job:35225006374c02dda988d09b9556589231a34548 re-submit after terminated; new job_id is 13634
2023-09-26 11:57:18,794 - INFO : job:35225006374c02dda988d09b9556589231a34548 job_id:13634 after re-submitting; the state now is <JobStatus.terminated: 4>
2023-09-26 11:57:18,794 - INFO : job: 35225006374c02dda988d09b9556589231a34548 13634 terminated;fail_cout is 11; resubmitting job
2023-09-26 11:57:18,799 - INFO : job:35225006374c02dda988d09b9556589231a34548 re-submit after terminated; new job_id is 13656
2023-09-26 11:57:19,046 - INFO : job:35225006374c02dda988d09b9556589231a34548 job_id:13656 after re-submitting; the state now is <JobStatus.terminated: 4>
2023-09-26 11:57:19,047 - INFO : job: 35225006374c02dda988d09b9556589231a34548 13656 terminated;fail_cout is 12; resubmitting job
Traceback (most recent call last):
File "/home/jiang/.local/lib/python3.11/site-packages/dpdispatcher/submission.py", line 352, in handle_unexpected_submission_state
job.handle_unexpected_job_state()
File "/home/jiang/.local/lib/python3.11/site-packages/dpdispatcher/submission.py", line 861, in handle_unexpected_job_state
self.handle_unexpected_job_state()
File "/home/jiang/.local/lib/python3.11/site-packages/dpdispatcher/submission.py", line 861, in handle_unexpected_job_state
self.handle_unexpected_job_state()
File "/home/jiang/.local/lib/python3.11/site-packages/dpdispatcher/submission.py", line 846, in handle_unexpected_job_state
raise RuntimeError(
RuntimeError: job:35225006374c02dda988d09b9556589231a34548 13656 failed 12 times.job_detail:{'35225006374c02dda988d09b9556589231a34548': {'job_task_list': [{'command': "/bin/sh -c '{ if [ ! -f model.ckpt.index ]; then dp train input.json; else dp train input.json --restart model.ckpt; fi }'&&dp freeze", 'task_work_path': '002', 'forward_files': ['input.json'], 'backward_files': ['frozen_model.pb', 'lcurve.out', 'train.log', 'model.ckpt.meta', 'model.ckpt.index', 'model.ckpt.data-00000-of-00001', 'checkpoint'], 'outlog': 'train.log', 'errlog': 'train.log'}, {'command': "/bin/sh -c '{ if [ ! -f model.ckpt.index ]; then dp train input.json; else dp train input.json --restart model.ckpt; fi }'&&dp freeze", 'task_work_path': '001', 'forward_files': ['input.json'], 'backward_files': ['frozen_model.pb', 'lcurve.out', 'train.log', 'model.ckpt.meta', 'model.ckpt.index', 'model.ckpt.data-00000-of-00001', 'checkpoint'], 'outlog': 'train.log', 'errlog': 'train.log'}, {'command': "/bin/sh -c '{ if [ ! -f model.ckpt.index ]; then dp train input.json; else dp train input.json --restart model.ckpt; fi }'&&dp freeze", 'task_work_path': '003', 'forward_files': ['input.json'], 'backward_files': ['frozen_model.pb', 'lcurve.out', 'train.log', 'model.ckpt.meta', 'model.ckpt.index', 'model.ckpt.data-00000-of-00001', 'checkpoint'], 'outlog': 'train.log', 'errlog': 'train.log'}, {'command': "/bin/sh -c '{ if [ ! -f model.ckpt.index ]; then dp train input.json; else dp train input.json --restart model.ckpt; fi }'&&dp freeze", 'task_work_path': '000', 'forward_files': ['input.json'], 'backward_files': ['frozen_model.pb', 'lcurve.out', 'train.log', 'model.ckpt.meta', 'model.ckpt.index', 'model.ckpt.data-00000-of-00001', 'checkpoint'], 'outlog': 'train.log', 'errlog': 'train.log'}], 'resources': {'number_node': 1, 'cpu_per_node': 4, 'gpu_per_node': 0, 'queue_name': '', 'group_size': 4, 'custom_flags': [], 'strategy': {'if_cuda_multi_devices': False, 'ratio_unfinished': 0.0}, 'para_deg': 1, 'module_purge': False, 'module_unload_list': [], 'module_list': [], 'source_list': [], 'envs': {}, 'prepend_script': [], 'append_script': [], 'wait_time': 0, 'kwargs': {}}, 'job_state': <JobStatus.terminated: 4>, 'job_id': 13656, 'fail_count': 12}}
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/jiang/.local/bin/dpgen", line 8, in
sys.exit(main())
^^^^^^
File "/home/jiang/.local/lib/python3.11/site-packages/dpgen/main.py", line 233, in main
args.func(args)
File "/home/jiang/.local/lib/python3.11/site-packages/dpgen/generator/run.py", line 5109, in gen_run
run_iter(args.PARAM, args.MACHINE)
File "/home/jiang/.local/lib/python3.11/site-packages/dpgen/generator/run.py", line 4440, in run_iter
run_train(ii, jdata, mdata)
File "/home/jiang/.local/lib/python3.11/site-packages/dpgen/generator/run.py", line 776, in run_train
submission.run_submission()
File "/home/jiang/.local/lib/python3.11/site-packages/dpdispatcher/submission.py", line 229, in run_submission
self.handle_unexpected_submission_state()
File "/home/jiang/.local/lib/python3.11/site-packages/dpdispatcher/submission.py", line 355, in handle_unexpected_submission_state
raise RuntimeError(
RuntimeError: Meet errors will handle unexpected submission state.
Debug information: remote_root==/home/jiang/work/dpgen_example/run/nnwork/c40cbdcc7d969ad3ae8930771348a761b6d46f47.
Debug information: submission_hash==c40cbdcc7d969ad3ae8930771348a761b6d46f47.
Please check the dirs and scripts in remote_root. The job information mentioned above may help.
throught the above information, I find the train.log file that shows the ‘dp: nov vocab file specified’,but I don't know how to solve this problem, Thanks!
Beta Was this translation helpful? Give feedback.
All reactions