Skip to content

Fix race-condition between collect_artifacts.sh instances #92

@theurich

Description

@theurich

The test_esmf.py currently spawns separate child processes to run collect_artifacts.sh after build and test phases for each of the combos executed. All of those collect_artifacts.sh scripts execute collect_artifacts.py which contains Git command from under the local esmf-test-artifacts clone directory. This causes race-conditions between all of those collect_artifacts.py instances.

There is currently code inside collect_artifacts.py that is supposed to function as a lock mechanism to prevent the race-condition. The locking implementation is file-based, and with file-system (FS) issues, does not guarantee to function. In fact, on lustre FS it does not work reliably at all!

One solution might be to prevent multiple collect_artifacts.py instances in the first place. Instead maybe there should be only one of them, but it is responsible to process all of the running build & test jobs. This could be managed by a simple file that contains all of the job-ids to wait on. The single collect_artifacts.py instances then just loops over those ids, looking if any of them is done, and if so handles the collection. There is virtually no potential of conflict in this approach between the collect-processing of the different combos, and at the same time it should be just as flexible, i.e. what ever gets done gets processed asap, i.e. no serialization of the order, since the single instance collect_artifacts.py loops over all ids, checking which ones are done for collection. The process finishes once all ids have finished and have been processed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions