Questions in 1 0 #1251
Unanswered
sooyaaa233
asked this question in
Q&A
Replies: 2 comments 1 reply
-
Beta Was this translation helpful? Give feedback.
0 replies
-
Did you change the model parameters compared to the previous model? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Traceback (most recent call last):
File "/home/hsh/anaconda3/envs/dpgen/lib/python3.9/site-packages/dpdispatcher/submission.py", line 287, in handle_unexpected_submission_state
job.handle_unexpected_job_state()
File "/home/hsh/anaconda3/envs/dpgen/lib/python3.9/site-packages/dpdispatcher/submission.py", line 732, in handle_unexpected_job_state
raise RuntimeError(
RuntimeError: job:d0a3de567b90e608340e6b0e68ef2fbba4468aae 17624 failed 3 times.job_detail:{'d0a3de567b90e608340e6b0e68ef2fbba4468aae': {'job_task_list': [{'command': "/bin/sh -c '{ if [ ! -f model.ckpt.index ]; then dp train input.json --init-model old/model.ckpt; else dp train input.json --restart model.ckpt; fi }'&&dp freeze", 'task_work_path': '002', 'forward_files': ['input.json', 'old/model.ckpt.meta', 'old/model.ckpt.index', 'old/model.ckpt.data-00000-of-00001'], 'backward_files': ['frozen_model.pb', 'lcurve.out', 'train.log', 'model.ckpt.meta', 'model.ckpt.index', 'model.ckpt.data-00000-of-00001', 'checkpoint'], 'outlog': 'train.log', 'errlog': 'train.log'}], 'resources': {'number_node': 1, 'cpu_per_node': 4, 'gpu_per_node': 1, 'queue_name': '', 'group_size': 1, 'custom_flags': [], 'strategy': {'if_cuda_multi_devices': False, 'ratio_unfinished': 0.0}, 'para_deg': 1, 'module_purge': False, 'module_unload_list': [], 'module_list': [], 'source_list': ['/home/hsh/learnmd/study/ntodpgen/run/train.env'], 'envs': {}, 'prepend_script': [], 'append_script': [], 'wait_time': 0, 'kwargs': {}}, 'job_state': <JobStatus.terminated: 4>, 'job_id': 17624, 'fail_count': 3}}
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/hsh/anaconda3/envs/dpgen/bin/dpgen", line 8, in
sys.exit(main())
File "/home/hsh/anaconda3/envs/dpgen/lib/python3.9/site-packages/dpgen/main.py", line 233, in main
args.func(args)
File "/home/hsh/anaconda3/envs/dpgen/lib/python3.9/site-packages/dpgen/generator/run.py", line 5109, in gen_run
run_iter(args.PARAM, args.MACHINE)
File "/home/hsh/anaconda3/envs/dpgen/lib/python3.9/site-packages/dpgen/generator/run.py", line 4440, in run_iter
run_train(ii, jdata, mdata)
File "/home/hsh/anaconda3/envs/dpgen/lib/python3.9/site-packages/dpgen/generator/run.py", line 776, in run_train
submission.run_submission()
File "/home/hsh/anaconda3/envs/dpgen/lib/python3.9/site-packages/dpdispatcher/submission.py", line 252, in run_submission
self.handle_unexpected_submission_state()
File "/home/hsh/anaconda3/envs/dpgen/lib/python3.9/site-packages/dpdispatcher/submission.py", line 290, in handle_unexpected_submission_state
raise RuntimeError(
RuntimeError: Meet errors will handle unexpected submission state.
Debug information: remote_root==/home/hsh/learnmd/study/ntodpgen/remote/train/2c78e9718521b0c2b6713d78e993f4c9d2bf16cf.
Debug information: submission_hash==2c78e9718521b0c2b6713d78e993f4c9d2bf16cf.
Please check the dirs and scripts in remote_root. The job information mentioned above may help.
Beta Was this translation helpful? Give feedback.
All reactions