-
Notifications
You must be signed in to change notification settings - Fork 18
Description
Currently we have several places where the amount of memory required by a job is dependent on the size of the input files, which in turn are dependent on the project. Whenever somebody runs a pipeline where the currently assigned memory is insufficient, they increase the memory requirements for that job in the pipeline.
The result of this is that for the resource requirements for any given task are set for the largest input files that have every been processed, even if that is much larger than is the norm. Further, this is a rachet-type process, and the resource requirements of our tasks are only ever going to go up.
I can think of two solutions to this. The first is to encode more of the resource requirements in the ini file. The second would be to implement some sort of dynamic resource requirement determination for certain tasks.
For example:
The memory requirement of fastqs2fastqs when reconciling pairs is highly dependent on the size of the input fastq files. This can be run by pipeline_readqc, but it would need to be set to a very high memory requirement so that large files could be handled (for 30M read files the requirement is well in excess of 10GB). Instead, I propose that the pipeline task measures the size of the input file, and sets the memory requirement appropriately. This would require some experiments to determine the shape of the memory vs size curve, but shouldn't be too difficult.
Is this something we should do more of? Thoughts?