Running parallel batch jobs is not performant due to all Python subprocesses running on the same master node

We have had pretty good success with the batch job feature, but we have not been able to carry out parallel batch jobs successfully.

We are running VASP with Custodian, where each VASP job uses one full CPU node and we are requesting multiple CPU nodes per Slurm allocation. Unfortunately, some VASP jobs are killed with `SIGTERM` calls even though there is no indication anything is going wrong with them, perhaps because Custodian is sending the `SIGTERM` to the wrong node? I should note that the jobs are not all `SIGTERM`-ing at the same time; a rare one or two might complete, and others `SIGTERM` later on, so it is not that Custodian is sending the `SIGTERM` to all the processes. We have the same issue across multiple machines, so it is not related to the VASP build or node architecture.

The following is a typical worker configuration that @blaked8619 is using. We tried adding the `--exclusive` flag, but it did not help. Note that even though we are using quacc here, it is calling Custodian in exactly the same way as Atomate2, so I do not think that should be related to the issue. Of course, we could try to generate an Atomate2 example if absolutely needed.

```yaml
stellar_vasp:
    type: remote
    host: <REDACTED>
    user: <REDACTED>
    scheduler_type: slurm
    work_dir: /scratch/gpfs/ROSENGROUP/bd8619/path/to/my/stuff
    max_jobs: 5
    pre_run: |
      source ~/.bashrc
      module load anaconda3/2024.10
      conda activate jobflow
      module load vasp/6.5.1
      export QUACC_VASP_PARALLEL_CMD="srun -N 1 --ntasks-per-node 96 --exclusive"
      export QUACC_WORKFLOW_ENGINE=jobflow
      export QUACC_CREATE_UNIQUE_DIR=False
      export QUACC_GZIP_FILES=True
    timeout_execute: 60
    resources:
      nodes: 5
      ntasks_per_node: 96
      cpus_per_task: 1
      mem: 700G
      time: 48:00:00
      account: cbe
    batch:
      jobs_handle_dir: /scratch/gpfs/ROSENGROUP/bd8619/path/to/my/stuff/jfr_handle_dir
      work_dir: /scratch/gpfs/ROSENGROUP/bd8619/path/to/my/stuff/jf_batch_jobs
      parallel_jobs: 5
      max_time: 144000
```

I am also attaching the failed VASP run ([failed.zip](https://github.com/user-attachments/files/21490803/failed.zip)), minus the POTCAR which is proprietary. The job is going along fine and then the `stderr` hits us with the following `SIGTERM`. It is not related to this particular system.

```
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
libpthread-2.28.s  0000152CCC12A990  Unknown               Unknown  Unknown
libmkl_avx512.so.  00001529FA5B234B  mkl_blas_avx512_d     Unknown  Unknown
```

Please let us know if you have any suggestions or what other information would be helpful for debugging. This is, of course, a challenging one to debug given how many components are involved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Running parallel batch jobs is not performant due to all Python subprocesses running on the same master node #323

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Running parallel batch jobs is not performant due to all Python subprocesses running on the same master node #323

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions