-
Notifications
You must be signed in to change notification settings - Fork 15
Description
We have had pretty good success with the batch job feature, but we have not been able to carry out parallel batch jobs successfully.
We are running VASP with Custodian, where each VASP job uses one full CPU node and we are requesting multiple CPU nodes per Slurm allocation. Unfortunately, some VASP jobs are killed with SIGTERM
calls even though there is no indication anything is going wrong with them, perhaps because Custodian is sending the SIGTERM
to the wrong node? I should note that the jobs are not all SIGTERM
-ing at the same time; a rare one or two might complete, and others SIGTERM
later on, so it is not that Custodian is sending the SIGTERM
to all the processes. We have the same issue across multiple machines, so it is not related to the VASP build or node architecture.
The following is a typical worker configuration that @blaked8619 is using. We tried adding the --exclusive
flag, but it did not help. Note that even though we are using quacc here, it is calling Custodian in exactly the same way as Atomate2, so I do not think that should be related to the issue. Of course, we could try to generate an Atomate2 example if absolutely needed.
stellar_vasp:
type: remote
host: <REDACTED>
user: <REDACTED>
scheduler_type: slurm
work_dir: /scratch/gpfs/ROSENGROUP/bd8619/path/to/my/stuff
max_jobs: 5
pre_run: |
source ~/.bashrc
module load anaconda3/2024.10
conda activate jobflow
module load vasp/6.5.1
export QUACC_VASP_PARALLEL_CMD="srun -N 1 --ntasks-per-node 96 --exclusive"
export QUACC_WORKFLOW_ENGINE=jobflow
export QUACC_CREATE_UNIQUE_DIR=False
export QUACC_GZIP_FILES=True
timeout_execute: 60
resources:
nodes: 5
ntasks_per_node: 96
cpus_per_task: 1
mem: 700G
time: 48:00:00
account: cbe
batch:
jobs_handle_dir: /scratch/gpfs/ROSENGROUP/bd8619/path/to/my/stuff/jfr_handle_dir
work_dir: /scratch/gpfs/ROSENGROUP/bd8619/path/to/my/stuff/jf_batch_jobs
parallel_jobs: 5
max_time: 144000
I am also attaching the failed VASP run (failed.zip), minus the POTCAR which is proprietary. The job is going along fine and then the stderr
hits us with the following SIGTERM
. It is not related to this particular system.
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libpthread-2.28.s 0000152CCC12A990 Unknown Unknown Unknown
libmkl_avx512.so. 00001529FA5B234B mkl_blas_avx512_d Unknown Unknown
Please let us know if you have any suggestions or what other information would be helpful for debugging. This is, of course, a challenging one to debug given how many components are involved.