Skip to content

Server: validator - assimilator race condition - should validator set assimilate_state? #5706

@bema-aei

Description

@bema-aei

Occasionally Einstein@Home encounters the following situation:

  • quorum of a workunit is completed, the workunit is validated, a canonical result is found and the validator sets assimilate_state =1
  • while the assimilator is working on the workunit (and a chunk of others), another result is reported by the scheduler, need_validate is set and the validator reads the workunit record
  • the assimilator finishes assimilation and sets assimilate_state=2. file deletion is triggered
  • the validator finishes validation and sets assimilate_state=1, overriding the value the assimilator just set
  • the assimilator tries to assimilate the same workunit again.
    Depending on whether the file deleter has already finished, the assimilator either tries to assimilate the same canonical result files a second time or it doesn't find these at all. Both conditions cause our assimilator to halt (for good reason), and require manual intervention to analyze and resolve the situation.

Also, when setting assimilate_state, the validator doesn't care about results of the workunit that are still "in progress". Triggering assimilation (and subsequently file deletion) of a workunit immediately after finding a canonical result means that these "late" results can not be (successfully) validated, assimilated and credited even if these are valid and arrive on time, i.e. before their deadline.

I was about to modify the validator to not set assimilate_state directly, but just let the transitioner know to take a look at that workunit (immediately). It does handle such late results correctly and, and the transitioner doesn't have such a delay between reading the workuits and writing modifications as the validator has (at least occasionally).

But then I came across that comment in the validator:

https://github.com/BOINC/boinc/blob/master/sched/validator.cpp#L677-L679

                // if we found a canonical result,
                // trigger the assimilator, but do NOT trigger
                // the transitioner - doing so creates a race condition

What race condition is this comment referring to, and how would we resolve the problems above then?

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    Status

    Backlog

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions