An error when the job is submitted to a remote cluster #757
Unanswered
phyoung123
asked this question in
Q&A
Replies: 2 comments 3 replies
-
|
Beta Was this translation helpful? Give feedback.
1 reply
-
problem fixed |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello, I have completed the first two steps that is training and exploration in a remote GPU cluster, and I want to put the fp task to another remote cpu cluster, but an error encountered.
error message:
Traceback (most recent call last):
File "/data/home/scv6293/.conda/envs/deepmd/bin/dpgen", line 8, in
sys.exit(main())
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpgen/main.py", line 175, in main
args.func(args)
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpgen/generator/run.py", line 3236, in gen_run
run_iter (args.PARAM, args.MACHINE)
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpgen/generator/run.py", line 3222, in run_iter
run_fp (ii, jdata, mdata)
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpgen/generator/run.py", line 2669, in run_fp
run_fp_inner(iter_index, jdata, mdata, forward_files, backward_files, _vasp_check_fin,
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpgen/generator/run.py", line 2637, in run_fp_inner
submission = make_submission(
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 358, in make_submission
machine = Machine.load_from_dict(abs_mdata_machine)
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpdispatcher/machine.py", line 128, in load_from_dict
context = BaseContext.load_from_dict(machine_dict)
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpdispatcher/base_context.py", line 34, in load_from_dict
context = context_class.load_from_dict(context_dict)
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpdispatcher/ssh_context.py", line 253, in load_from_dict
ssh_context = cls(
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpdispatcher/ssh_context.py", line 227, in init
self.ssh_session = SSHSession(**remote_profile)
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpdispatcher/ssh_context.py", line 38, in init
self._setup_ssh()
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpdispatcher/ssh_context.py", line 111, in _setup_ssh
self.ssh.connect(hostname=self.hostname, port=self.port,
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/paramiko/client.py", line 349, in connect
retry_on_signal(lambda: sock.connect(addr))
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/paramiko/util.py", line 279, in retry_on_signal
return function()
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/paramiko/client.py", line 349, in
retry_on_signal(lambda: sock.connect(addr))
socket.timeout: timed out
and the machine.json
"fp":[
{
"machine":{
"batch_type":"Slurm",
"context_type":"SSHContext",
"local_root":"./",
"remote_profile":{
"hostname":"ssh.cn-zhongwei-1.paracloud.com",
"port":22,
"password":"111111111",
"username":"scfa0089@NC-E"
},
"remote_root":"/public1/home/scfa0089/lzg/xufy/z/run/work"
},
"resources":{
"cpu_per_node":48,
"_node_cpu":24,
"number_node":1,
"gpu_per_node":0,
"queue_name":"v5_192",
"_exclude_list":[],
"_with_mpi":false,
"group_size":100,
"_source_list":["/public1/soft/other/module.sh"],
"module_list":[
"mpi/intel/19.3.0"],
"_partition":"large",
"_comment":"that's all"
},
"command":"mpirun -np 48 vasp_gam"
}
]
}
Otherwise, I can ssh to this hostname by ssh scfa0089@NC-E@ssh.cn-zhongwei-1.paracloud.com
How can I solve this problem and dispatch the task to different cluster. Thanks in advance.
Beta Was this translation helpful? Give feedback.
All reactions