What are the usecases for Scheduler.handle_missing_data
?
#6445
Labels
deadlock
The cluster appears to not make any progress
discussion
Discussing a topic with no specific actions yet
stability
Issue or feature related to cluster stability (e.g. deadlock)
Introduction
The scheduler offers a handler
handle_missing_data
that is called inone place on the worker, during the response handling of
gather_dep
This handling has been known to cause stability and deadlocking issues. In
recent refactoring attempts the
Worker.find_missing
functionality has beenflagged as a source of complexity and I'm wondering if we should remove this
entire system in favor of a much simpler mechanism.
Looking at the scheduler side, reading code and doc string, there are a couple
of different use cases where this handler is supposed to be called. It always
assumes that the signal originates from a worker, called
worker
, which reportsthat another worker
errant_worker
is not in possession of a keykey
.The doc string explains that this can occur in two situations
worker
is operating on stale data, i.e.worker.TaskState.who_has
/Worker.has_what
includes stale information, suchthat after requesting
Worker.get_data
onerrant_worker
it returns aresponse without data for
key
.errant_worker
is indeed dead which is likely inferred byworker
becauseit encountered a network error.
If the scheduler already knows about the death of
errant_worker
, this request isignored.
Otherwise, the scheduler accepts this signal as the ultimate truth and removes
all state indicating that
key
was onerrant_worker
. There is some transitionlogic in place to heal and/or raise the problem as is appropriate.
The circumstances under which a worker actually calls this handler is a bit
harder to describe. I am much more interested in whether or not this handler on
scheduler side makes sense.
Investigating use case 1.) - The worker is operating on stale data. This can
have two reasons
1.a) The data is indeed stale. Something happened that caused us to request
key
to be forgotten onerrant_worker
but we didn't tellworker
.I believe the only valid reason for this is AMM requesting
worker
to forgetthe data. All other instances where we would properly release keys due to
transitions,
worker
would eventually be asked to release the key as well.I would argue that this case is not something a worker should escalate to the
scheduler. If anything, the worker should ask the scheduler for the most recent,
accurate information and try again. Particularly with the changes proposed in
#6435 where the worker indeed removes no
longer accurate entries, this use case feels obsolete
1.b) The scheduler has faulty information
I am struggling to come up with a situation where the scheduler is indeed
operating on false information since workers are never allowed to forget data
without the scheduler telling them so. It is possible for workers to have more
data than the scheduler knows (e.g. because messages where not, yet, delivered)
but it's hard to come up with a scenario where the worker lost data.
This could obviously be possible if a worker would loose its data due to
external factors, e.g. disk is lost or a user deliberately removes keys from
Worker.data
. I would argue in this situationerrant_worker
should noticethat an external party is messing with its state and it should escalate to the
scheduler, iff we even want to deal with this edge case.
2.)
errant_worker
is indeed dead. The scheduler would eventually noticebecause
Scheduler.stream_comms[errant_worker]
would be aborted (at the latestafter
distributed.comm.timeouts.tcp
seconds)The only reason why
worker
should ping the scheduler on this event is if wewant to accelerate recovery. From a consistency perspective, this should not be
required
Conclusion
I am starting to believe that
worker
should simply removeerrant_worker
fromit's internal bookkeeping and rely on the scheduler to eventually clean up. We
would accept that a
worker.TaskState
in statefetch
is allowed to have anempty
who_has
and we'd rely on the scheduler to clean this up eventually.The only functionality we'd need on
Worker
is a mechanism to eventuallyrefresh the worker's
who_has
/has_what
information.This could be a system that is not connected to the state machine directly
that ensures that all, or subsets of, data locality information
(who_has/has_what) is updated eventually. This generic system would also
benefit
busy
tasks.I believe this reduced coupling would significantly reduce complexity. I'm
wondering if I missed a valid use case.
cc @gjoseph92 @crusaderky
The text was updated successfully, but these errors were encountered: