Skip to content

Understanding Job Failure - Memory Issues? #11

@jwpetley

Description

@jwpetley

This is not a key priority and I am still trying to understand exactly how it happens.

I am currently running toil-cwl-runner to submit jobs on the spider cluster. This works very well for the pipeline and can greatly speed things up. However, quite often aoflagging jobs on the concatenated MSs fail. I have --retryCount=2 set, and often they complete on a retry when more memory is provided. If they completely fail then luckily with toil I can use the--restart flag and it will repair and continue running.

I thought I would make this an issue now so others can see it and comment their experience. It could be important for updating the job cpu and memory requirements as we move forwards.

I would say on a typical run I find 5/24 subbands fail and need to be restarted. Maybe @jurjen93 has some more experience with his ELAIS-N1 runs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions