Understanding Job Failure - Memory Issues?

This is not a key priority and I am still trying to understand exactly how it happens. 

I am currently running toil-cwl-runner to submit jobs on the spider cluster. This works very well for the pipeline and can greatly speed things up. However, quite often aoflagging jobs on the concatenated MSs fail. I have `--retryCount=2` set, and often they complete on a retry when more memory is provided. If they completely fail then luckily with toil I can use the`--restart` flag and it will repair and continue running. 

I thought I would make this an issue now so others can see it and comment their experience. It could be important for updating the job cpu and memory requirements as we move forwards.

I would say on a typical run I find 5/24 subbands fail and need to be restarted. Maybe @jurjen93 has some more experience with his ELAIS-N1 runs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Understanding Job Failure - Memory Issues? #11

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Understanding Job Failure - Memory Issues? #11

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions