You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Supervisor_eqc:prop_par/0 and supervisor_eqc:prop_pulse/0 have shown a couple of races in sidejob.
Caveats
It may well be that the original author was content with these races,
as the code is supposed to be performance critical, they may have felt
the windows open for incorrectness were acceptable, but there is no
remaining documentation of such a decision. It could be that the
author of the code and the author of the tests did not communicate and
the tests make wrong assumptions.
Which children?
The property tests show a problem with sidejob_supervisor:which_children/0 there are checked in
counter_examples in the nhs-riak repo (where this issue is duplicated
for now) that show this issue, though I don't yet have a fix.
There are two races around the process limit for sidejob_supervisor,
one is a check then act in `sidejob:is_availble/3' https://github.com/basho/sidejob/blob/rdb/2.0.1-nhs-2.2.5/src/sidejob.erl#L144
here the ets lookup (check) counter update, and set of full (act) are
all capable of being interleaved. In one attempt to fix I added a
gen_server per worker-ets table and serialised access to the worker
ets with the gen_server, thus turning this group of commands into an
atomic call. This fixed this one race only
The larger issue is a larger window for a
race. sidejob:is_available/3 will increment the counter for a
worker-ets table, when LIMIT is met no more workers should be
created. However, after a worker is created be sidejob_supervisor
there is a call to sidejob_worker:update_usage/1 which sets the
usage counter in ets to the number of actually created processes regardless of the number requested and set by is_available.
Two issues around this race:
There are acquire (is_available) and use (start_child) phases
that a process can crash between, leaving riak with no more
capacity and no fsms running at all. This is imo serious and should
be addressed. This is essentially a lock with no timeout or unlock.
The limit on the number of processes in the worse case is N(N+1)/2 since LIMIT processes could set the ETS counter, then
1 gets created and set the counter to 1, and then LIMIT-1 update
the ETS counter, and then 1 get created and set the counter to 2,
and LIMIT-2 update the counter, etc etc etc.
I said that I'd tag @paegun for this issue. You done been tagged.
The text was updated successfully, but these errors were encountered:
Supervisor_eqc:prop_par/0
andsupervisor_eqc:prop_pulse/0
have shown a couple of races in sidejob.Caveats
It may well be that the original author was content with these races,
as the code is supposed to be performance critical, they may have felt
the windows open for incorrectness were acceptable, but there is no
remaining documentation of such a decision. It could be that the
author of the code and the author of the tests did not communicate and
the tests make wrong assumptions.
Which children?
The property tests show a problem with
sidejob_supervisor:which_children/0
there are checked incounter_examples in the nhs-riak repo (where this issue is duplicated
for now) that show this issue, though I don't yet have a fix.
https://github.com/nhs-riak/sidejob/blob/rdb/2.0.1-nhs-2.2.5/test/which_children_ce.eqc
Limit not a limit
There are two races around the process limit for sidejob_supervisor,
one is a check then act in `sidejob:is_availble/3'
https://github.com/basho/sidejob/blob/rdb/2.0.1-nhs-2.2.5/src/sidejob.erl#L144
here the ets lookup (check) counter update, and set of full (act) are
all capable of being interleaved. In one attempt to fix I added a
gen_server per worker-ets table and serialised access to the worker
ets with the gen_server, thus turning this group of commands into an
atomic call. This fixed this one race only
The larger issue is a larger window for a
race.
sidejob:is_available/3
will increment the counter for aworker-ets table, when LIMIT is met no more workers should be
created. However, after a worker is created be
sidejob_supervisor
there is a call to
sidejob_worker:update_usage/1
which sets theusage counter in ets to the number of actually created processes
regardless of the number requested and set by
is_available
.Two issues around this race:
There are acquire (
is_available
) and use (start_child
) phasesthat a process can crash between, leaving riak with no more
capacity and no fsms running at all. This is imo serious and should
be addressed. This is essentially a lock with no timeout or unlock.
The limit on the number of processes in the worse case is
N(N+1)/2
sinceLIMIT
processes could set the ETS counter, then1 gets created and set the counter to 1, and then LIMIT-1 update
the ETS counter, and then 1 get created and set the counter to 2,
and LIMIT-2 update the counter, etc etc etc.
I said that I'd tag @paegun for this issue. You done been tagged.
The text was updated successfully, but these errors were encountered: