An error meet when run dpgen #1175

maoxinxina · 2023-04-04T08:09:37Z

maoxinxina
Apr 4, 2023

Here I am running the dpgen on the Slurm squeue, An error occur.

Description

2023-04-04 15:56:02,277 - INFO : info:check_all_finished: False
Traceback (most recent call last):
File "/HOME/scz0aai/run/deepmd-kit/lib/python3.10/site-packages/dpdispatcher/submission.py", line 285, in handle_unexpected_submission_state
job.handle_unexpected_job_state()
File "/HOME/scz0aai/run/deepmd-kit/lib/python3.10/site-packages/dpdispatcher/submission.py", line 751, in handle_unexpected_job_state
self.submit_job()
File "/HOME/scz0aai/run/deepmd-kit/lib/python3.10/site-packages/dpdispatcher/submission.py", line 798, in submit_job
job_id = self.machine.do_submit(self)
job_id = self.machine.do_submit(self)

2023-04-04 15:56:02,277 - INFO : info:check_all_finished: False
Traceback (most recent call last):
File "/HOME/scz0aai/run/deepmd-kit/lib/python3.10/site-packages/dpdispatcher/submission.py", line 285, in handle_unexpected_submission_state
job.handle_unexpected_job_state()
File "/HOME/scz0aai/run/deepmd-kit/lib/python3.10/site-packages/dpdispatcher/submission.py", line 751, in handle_unexpected_job_state
self.submit_job()
File "/HOME/scz0aai/run/deepmd-kit/lib/python3.10/site-packages/dpdispatcher/submission.py", line 798, in submit_job
job_id = self.machine.do_submit(self)
File "/HOME/scz0aai/run/deepmd-kit/lib/python3.10/site-packages/dpdispatcher/utils.py", line 179, in wrapper
return func(*args, **kwargs)
File "/HOME/scz0aai/run/deepmd-kit/lib/python3.10/site-packages/dpdispatcher/slurm.py", line 84, in do_submit
raise RuntimeError(
RuntimeError: status command squeue fails to execute
error message:sbatch: error: Batch job submission failed: Requested node configuration is not available

return code 1

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/HOME/scz0aai/run/deepmd-kit/bin/dpgen", line 8, in
sys.exit(main())
File "/HOME/scz0aai/run/deepmd-kit/lib/python3.10/site-packages/dpgen/main.py", line 233, in main
args.func(args)
File "/HOME/scz0aai/run/deepmd-kit/lib/python3.10/site-packages/dpgen/generator/run.py", line 5109, in gen_run
run_iter(args.PARAM, args.MACHINE)
File "/HOME/scz0aai/run/deepmd-kit/lib/python3.10/site-packages/dpgen/generator/run.py", line 4440, in run_iter
run_train(ii, jdata, mdata)
File "/HOME/scz0aai/run/deepmd-kit/lib/python3.10/site-packages/dpgen/generator/run.py", line 776, in run_train
submission.run_submission()
File "/HOME/scz0aai/run/deepmd-kit/lib/python3.10/site-packages/dpdispatcher/submission.py", line 222, in run_submission
self.handle_unexpected_submission_state()
File "/HOME/scz0aai/run/deepmd-kit/lib/python3.10/site-packages/dpdispatcher/submission.py", line 288, in handle_unexpected_submission_state
raise RuntimeError(
RuntimeError: Meet errors will handle unexpected submission state.
Debug information: remote_root==/HOME/scz0aai/run/maoxin/dpgen_test/tmp2023/rererun/work/1f2a3a2a757b38d4b506119950b64ccf1c5c9d04.
Debug information: submission_hash==1f2a3a2a757b38d4b506119950b64ccf1c5c9d04.
Please check the dirs and scripts in remote_root. The job information mentioned above may help.

The machine.json is set as:

{
"api_version": "1.0",
"deepmd_version": "2.0.1",
"train" :[
{
"command": "dp",
"machine": {
"batch_type": "Slurm",
"context_type": "local",
"local_root" : "./",
"remote_root": "/HOME/scz0aai/run/maoxin/dpgen_test/tmp2023/rererun/work"
},
"resources": {
"number_node": 1,
"_cpu_per_node": 4,
"gpu_per_node": 1,
"group_size": 1,
"queue_name":"gpu",
"_custom_flags" :["#SBATCH --mem=20G"],
"source_list":[ "/HOME/scz0aai/run/deepmd-kit"
],
"module_list":["cuda/11.6"]
}
}
],
"model_devi":[
{
"command": "lmp",
"machine": {
"batch_type": "Slurm",
"context_type": "local",
"local_root" : "./",
"remote_root": "/HOME/scz0aai/run/maoxin/dpgen_test/tmp2023/rererun/work"
},
"resources": {
"number_node": 1,
"_cpu_per_node": 4,
"gpu_per_node": 1,
"group_size": 10,
"queue_name":"gpu",
"_custom_flags" : ["#SBATCH --mem=20G"],
"exlued_list":[],
"source_list":["source activate /HOME/scz0aai/run/deepmd-kit; module load cuda/11.6"
],
"module_list":[]
}
}
],
"fp":[
{
"command": "mpirun -np 4 vasp_std",
"machine": {
"batch_type": "Slurm",
"context_type": "local",
"local_root" : "./",
"remote_root": "/HOME/scz0aai/run/maoxin/dpgen_test/tmp2023/rererun/work"
},
"resources": {
"number_node": 1,
"cpu_per_node": 4,
"gpu_per_node": 1,
"_group_size": 125,
"source_list":["module load intel/parallelstudio/2017.1.5; export PATH=/HOME/scz0aai/run/vasp.5.4.4/bin:$PATH"
]
}
}
]
}

So I wonder how to tackle the issue. Thanks a lot.

cvlaile · 2023-12-15T07:41:50Z

cvlaile
Dec 15, 2023

Did you find a solution please, I'm having the same problem?

0 replies

njzjz · 2023-12-16T21:11:34Z

njzjz
Dec 16, 2023
Maintainer

error message:sbatch: error: Batch job submission failed: Requested node configuration is not available

It's an error reported by Slurm, saying there was no available node you requested. You might ask your cluster administrator what is available.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

An error meet when run dpgen #1175

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

An error meet when run dpgen #1175

Uh oh!

maoxinxina Apr 4, 2023

Description

Replies: 2 comments

Uh oh!

cvlaile Dec 15, 2023

Uh oh!

njzjz Dec 16, 2023 Maintainer

maoxinxina
Apr 4, 2023

cvlaile
Dec 15, 2023

njzjz
Dec 16, 2023
Maintainer