Improve handling of tasks failing/succeeding on unexpected workers #6956

hendrikmakait · 2022-08-25T13:49:35Z

This issue is a follow-up on an offline discussion centering around #6884 (comment) and #6939 with @crusaderky and @gjoseph92 where we encountered some issues with the way we currently handle tasks failing/succeeding on unexpected workers.

In transition_processing_memory, when encountering an unexpected worker that has successfully completed a task, we send a cancel-compute message to the worker that was supposed to process this task assuming that the unexpected worker would hold onto our result for us.

distributed/distributed/scheduler.py

Lines 1987 to 2001 in c15a10e

    
           if ws != ts.processing_on:  # someone else has this task 
        
               logger.info( 
        
                   "Unexpected worker completed task. Expected: %s, Got: %s, Key: %s", 
        
                   ts.processing_on, 
        
                   ws, 
        
                   key, 
        
               ) 
        
               assert ts.processing_on 
        
               worker_msgs[ts.processing_on.address] = [ 
        
                   { 
        
                       "op": "cancel-compute", 
        
                       "key": key, 
        
                       "stimulus_id": stimulus_id, 
        
                   } 
        
               ]

However, this series of events should only be triggered in a chain of events where free-keys message for that task should already be on its way to the unexpected worker, causing it to remove the result. This leaves us with no worker holding on to the task/result.

While we do suspect this chain of events to unfold, we need a test verifying it. If this is indeed the case, we should not send the cancel-compute message to ts.processing_on, but rather make sure that we clean up anything related to the work happening on the unexpected worker.

In stimulus_task_erred, we don't check whether the task erred on an unexpected worker at all and run through the retry-logic. In #6884 (comment), @fjetter suggested instead ignoring that the task erred on an unexpected worker and continuing on as if nothing happened. This general approach is reasonable as the unexpected worker suggests something faulty going on. We might need to do additional work to ensure that the unexpected worker is reset, e.g., sending a free-keys message to the unexpected workers and we want to log a warning. To ensure everything is reset correctly, we need a test that verifies the intended behavior.

The text was updated successfully, but these errors were encountered:

gjoseph92 · 2022-08-25T15:31:33Z

Especially since we know #6390 is a problem, I'd like to see this happen (ignore task-finished or task-erred from unexpected workers, and instead tell them to free-keys).

It's not quite the same case as what's mentioned here, but we should think about it as well. This behavior may not be good either:

distributed/distributed/scheduler.py

Lines 1982 to 1985 in c15a10e

    
           ws = self.workers.get(worker) 
        
           if ws is None: 
        
               recommendations[key] = "released" 
        
               return recommendations, client_msgs, worker_msgs

As things stand right now, it's entirely possible to get messages from removed workers, so rather than releasing the entire task, we should probably just ignore the spurious message.

Basically, whether the worker exists or not, let's just ignore its message if it's not the one we expected to be talking to?

hendrikmakait mentioned this issue Aug 25, 2022

stimulus_task_erred is no-op when task-erred is not coming from the worker the task is actually processing on #6941

Closed

2 tasks

This was referenced Aug 25, 2022

Remove transition_no_worker_memory #6960

Open

cancelled/resumed->long-running transitions #6916

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of tasks failing/succeeding on unexpected workers #6956

Improve handling of tasks failing/succeeding on unexpected workers #6956

hendrikmakait commented Aug 25, 2022 •

edited

Loading

gjoseph92 commented Aug 25, 2022

Improve handling of tasks failing/succeeding on unexpected workers #6956

Improve handling of tasks failing/succeeding on unexpected workers #6956

Comments

hendrikmakait commented Aug 25, 2022 • edited Loading

gjoseph92 commented Aug 25, 2022

hendrikmakait commented Aug 25, 2022 •

edited

Loading