Status squeue fails to execute, no error message attached #1628
Unanswered
Sanderson1887
asked this question in
Q&A
Replies: 1 comment 1 reply
-
The error message is wrong, which I have fixed in deepmodeling/dpdispatcher#483. The actual executed command is |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Description
I'm trying to replicate the dpgen tutorial on an HPC cluster to familiarize myself with the program. I followed the Hands-On DPGen tutorial. The only major changes I made were to the machine.json file, where I updated the train, model_deviation, and first principles calculations to all be run on a SLURM HPC cluster.
Issue description
Shortly after submitting the job, the job terminates after no more than 20 seconds. The job hash folder is created at the remote root correctly, and contains the [hash].sub, [hash].sub.run, [hash].json, along with the '000' folder and the 'data.init' folder. The submission folder does not contain anything else however. Attached here are my machine.json file and the error output.
machine.json
error.txt
Further, attempting to run 'dpdisp submission --download-terminated-log [hash]' said the error log was downloaded to a file, but did not list what file it was downloaded to (if any).
Attempted Fixes
The error occurring so quickly after submission and having trouble parsing the relevance of the errors given means I haven't had much luck troubleshooting this error. I see 4 similar errors in the discussion forum, but reading them has not been enlightening for my error.
Versions
Python 3.11.9
dpgen 0.12.1
DeePMD-kit 2.2.10
Beta Was this translation helpful? Give feedback.
All reactions