-
Notifications
You must be signed in to change notification settings - Fork 0
Description
This is not a key priority and I am still trying to understand exactly how it happens.
I am currently running toil-cwl-runner to submit jobs on the spider cluster. This works very well for the pipeline and can greatly speed things up. However, quite often aoflagging jobs on the concatenated MSs fail. I have --retryCount=2
set, and often they complete on a retry when more memory is provided. If they completely fail then luckily with toil I can use the--restart
flag and it will repair and continue running.
I thought I would make this an issue now so others can see it and comment their experience. It could be important for updating the job cpu and memory requirements as we move forwards.
I would say on a typical run I find 5/24 subbands fail and need to be restarted. Maybe @jurjen93 has some more experience with his ELAIS-N1 runs.