-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Description
when running a batch job on SLURM, before terminating the job, SLURM sends a SIGTERM to the process 1min before terminating. We can catch this signal and save a checkpoint to have the latest weights.
relevant material: https://services.criann.fr/en/services/hpc/cluster-myria/guide/signals-sent-by-slurm/#sigterm-signal
Metadata
Metadata
Assignees
Labels
No labels