Skip to content

Running parallel batch jobs is not performant due to all Python subprocesses running on the same master node #323

@Andrew-S-Rosen

Description

@Andrew-S-Rosen

We have had pretty good success with the batch job feature, but we have not been able to carry out parallel batch jobs successfully.

We are running VASP with Custodian, where each VASP job uses one full CPU node and we are requesting multiple CPU nodes per Slurm allocation. Unfortunately, some VASP jobs are killed with SIGTERM calls even though there is no indication anything is going wrong with them, perhaps because Custodian is sending the SIGTERM to the wrong node? I should note that the jobs are not all SIGTERM-ing at the same time; a rare one or two might complete, and others SIGTERM later on, so it is not that Custodian is sending the SIGTERM to all the processes. We have the same issue across multiple machines, so it is not related to the VASP build or node architecture.

The following is a typical worker configuration that @blaked8619 is using. We tried adding the --exclusive flag, but it did not help. Note that even though we are using quacc here, it is calling Custodian in exactly the same way as Atomate2, so I do not think that should be related to the issue. Of course, we could try to generate an Atomate2 example if absolutely needed.

stellar_vasp:
    type: remote
    host: <REDACTED>
    user: <REDACTED>
    scheduler_type: slurm
    work_dir: /scratch/gpfs/ROSENGROUP/bd8619/path/to/my/stuff
    max_jobs: 5
    pre_run: |
      source ~/.bashrc
      module load anaconda3/2024.10
      conda activate jobflow
      module load vasp/6.5.1
      export QUACC_VASP_PARALLEL_CMD="srun -N 1 --ntasks-per-node 96 --exclusive"
      export QUACC_WORKFLOW_ENGINE=jobflow
      export QUACC_CREATE_UNIQUE_DIR=False
      export QUACC_GZIP_FILES=True
    timeout_execute: 60
    resources:
      nodes: 5
      ntasks_per_node: 96
      cpus_per_task: 1
      mem: 700G
      time: 48:00:00
      account: cbe
    batch:
      jobs_handle_dir: /scratch/gpfs/ROSENGROUP/bd8619/path/to/my/stuff/jfr_handle_dir
      work_dir: /scratch/gpfs/ROSENGROUP/bd8619/path/to/my/stuff/jf_batch_jobs
      parallel_jobs: 5
      max_time: 144000

I am also attaching the failed VASP run (failed.zip), minus the POTCAR which is proprietary. The job is going along fine and then the stderr hits us with the following SIGTERM. It is not related to this particular system.

forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
libpthread-2.28.s  0000152CCC12A990  Unknown               Unknown  Unknown
libmkl_avx512.so.  00001529FA5B234B  mkl_blas_avx512_d     Unknown  Unknown

Please let us know if you have any suggestions or what other information would be helpful for debugging. This is, of course, a challenging one to debug given how many components are involved.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions