Skip to content

Sidejob races and eqc failures (related to #10?) #17

@russelldb

Description

@russelldb

Supervisor_eqc:prop_par/0 and supervisor_eqc:prop_pulse/0 have shown a couple of races in sidejob.

Caveats

It may well be that the original author was content with these races,
as the code is supposed to be performance critical, they may have felt
the windows open for incorrectness were acceptable, but there is no
remaining documentation of such a decision. It could be that the
author of the code and the author of the tests did not communicate and
the tests make wrong assumptions.

Which children?

The property tests show a problem with
sidejob_supervisor:which_children/0 there are checked in
counter_examples in the nhs-riak repo (where this issue is duplicated
for now) that show this issue, though I don't yet have a fix.

https://github.com/nhs-riak/sidejob/blob/rdb/2.0.1-nhs-2.2.5/test/which_children_ce.eqc

Limit not a limit

There are two races around the process limit for sidejob_supervisor,
one is a check then act in `sidejob:is_availble/3'
https://github.com/basho/sidejob/blob/rdb/2.0.1-nhs-2.2.5/src/sidejob.erl#L144
here the ets lookup (check) counter update, and set of full (act) are
all capable of being interleaved. In one attempt to fix I added a
gen_server per worker-ets table and serialised access to the worker
ets with the gen_server, thus turning this group of commands into an
atomic call. This fixed this one race only

The larger issue is a larger window for a
race. sidejob:is_available/3 will increment the counter for a
worker-ets table, when LIMIT is met no more workers should be
created. However, after a worker is created be sidejob_supervisor
there is a call to sidejob_worker:update_usage/1 which sets the
usage counter in ets to the number of actually created processes
regardless of the number requested and set by is_available.

Two issues around this race:

  1. There are acquire (is_available) and use (start_child) phases
    that a process can crash between, leaving riak with no more
    capacity and no fsms running at all. This is imo serious and should
    be addressed. This is essentially a lock with no timeout or unlock.

  2. The limit on the number of processes in the worse case is
    N(N+1)/2 since LIMIT processes could set the ETS counter, then
    1 gets created and set the counter to 1, and then LIMIT-1 update
    the ETS counter, and then 1 get created and set the counter to 2,
    and LIMIT-2 update the counter, etc etc etc.

I said that I'd tag @paegun for this issue. You done been tagged.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions