-
-
Notifications
You must be signed in to change notification settings - Fork 722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task state validation failure for fetch with who_has #6147
Comments
This failed again yesterday in CI. Copying the story here for easier reading [("('rechunk-merge-06fee79f9945080fdc867fb46044dc51', 0, 166)",
'compute-task',
'compute-task-1650782205.654249',
1650782205.699733),
("('rechunk-merge-06fee79f9945080fdc867fb46044dc51', 0, 166)",
'released',
'waiting',
'waiting',
{"('rechunk-split-06fee79f9945080fdc867fb46044dc51', 664)": 'fetch'},
'compute-task-1650782205.654249',
1650782205.6998458),
("('rechunk-merge-06fee79f9945080fdc867fb46044dc51', 0, 166)",
'waiting',
'ready',
'ready',
{},
'ensure-communicating-1650782205.700746',
1650782205.7857978),
("('rechunk-merge-06fee79f9945080fdc867fb46044dc51', 0, 166)",
'ready',
'executing',
'executing',
{},
'compute-task-1650782202.6983402',
1650782205.806335),
("('rechunk-merge-06fee79f9945080fdc867fb46044dc51', 0, 166)",
'put-in-memory',
'compute-task-1650782202.6983402',
1650782205.836979),
("('rechunk-merge-06fee79f9945080fdc867fb46044dc51', 0, 166)",
'executing',
'memory',
'memory',
{"('rechunk-split-06fee79f9945080fdc867fb46044dc51', 819)": 'executing'},
'compute-task-1650782202.6983402',
1650782205.8370361),
('free-keys',
("('rechunk-merge-06fee79f9945080fdc867fb46044dc51', 0, 166)",),
'processing-released-1650782206.635477',
1650782206.9505332),
("('rechunk-merge-06fee79f9945080fdc867fb46044dc51', 0, 166)",
'release-key',
'processing-released-1650782206.635477',
1650782206.950552),
("('rechunk-merge-06fee79f9945080fdc867fb46044dc51', 0, 166)",
'memory',
'released',
'released',
{"('rechunk-merge-06fee79f9945080fdc867fb46044dc51', 0, 166)": 'forgotten'},
'processing-released-1650782206.635477',
1650782206.950907),
("('rechunk-merge-06fee79f9945080fdc867fb46044dc51', 0, 166)",
'released',
'forgotten',
'forgotten',
{},
'processing-released-1650782206.635477',
1650782206.9509299),
("('rechunk-merge-06fee79f9945080fdc867fb46044dc51', 0, 166)",
'ensure-task-exists',
'released',
'compute-task-1650782206.894887',
1650782206.959998),
("('rechunk-merge-06fee79f9945080fdc867fb46044dc51', 0, 166)",
'released',
'fetch',
'fetch',
{},
'compute-task-1650782206.894887',
1650782206.9600902)] |
This also occurred in the same test run 2022-04-24 06:36:48,925 - distributed.worker - INFO - Starting Worker plugin kill
2022-04-24 06:36:48,926 - distributed.worker - INFO - Registered to: tcp://127.0.0.1:62207
2022-04-24 06:36:48,927 - distributed.worker - INFO - -------------------------------------------------
2022-04-24 06:36:48,931 - distributed.core - INFO - Starting established connection
2022-04-24 06:36:48,936 - distributed.nanny - WARNING - Restarting worker
2022-04-24 06:36:49,081 - distributed.stealing - ERROR - 'tcp://127.0.0.1:62227'
Traceback (most recent call last):
File "/Users/runner/work/distributed/distributed/distributed/stealing.py", line 242, in move_task_request
self.scheduler.stream_comms[victim.address].send(
KeyError: 'tcp://127.0.0.1:62227'
2022-04-24 06:36:49,084 - distributed.utils - ERROR - 'tcp://127.0.0.1:62227'
Traceback (most recent call last):
File "/Users/runner/work/distributed/distributed/distributed/utils.py", line 693, in log_errors
yield
File "/Users/runner/work/distributed/distributed/distributed/stealing.py", line 456, in balance
maybe_move_task(
File "/Users/runner/work/distributed/distributed/distributed/stealing.py", line 362, in maybe_move_task
self.move_task_request(ts, sat, idl)
File "/Users/runner/work/distributed/distributed/distributed/stealing.py", line 242, in move_task_request
self.scheduler.stream_comms[victim.address].send(
KeyError: 'tcp://127.0.0.1:62227'
2022-04-24 06:36:49,085 - tornado.application - ERROR - Exception in callback <bound method WorkStealing.balance of <distributed.stealing.WorkStealing object at 0x13dd740d0>>
Traceback (most recent call last):
File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/ioloop.py", line 905, in _run
return self.callback()
File "/Users/runner/work/distributed/distributed/distributed/stealing.py", line 456, in balance
maybe_move_task(
File "/Users/runner/work/distributed/distributed/distributed/stealing.py", line 362, in maybe_move_task
self.move_task_request(ts, sat, idl)
File "/Users/runner/work/distributed/distributed/distributed/stealing.py", line 242, in move_task_request
self.scheduler.stream_comms[victim.address].send(
KeyError: 'tcp://127.0.0.1:62227' |
Ensure_task_exists is new to me. @crusaderky it looks like you might also have some familiarity here. The thing that confuses me here is this shift: ("('rechunk-merge-06fee79f9945080fdc867fb46044dc51', 0, 166)",
'released',
'forgotten',
'forgotten',
{},
'processing-released-1650782206.635477',
1650782206.9509299),
("('rechunk-merge-06fee79f9945080fdc867fb46044dc51', 0, 166)",
'ensure-task-exists',
'released',
'compute-task-1650782206.894887',
1650782206.959998), |
This is basically a debug log that prints the state of a task whenever it is encountered as a dependency. The log you posted tells us that the task was properly, cleanly forgotten, i.e. all state should have been purged during the forgotten transition. I suspect |
I can easily reproduce errors running the |
indeed test_chaos_rechunk can't fail. See my comment #6123 (comment) |
I'll work on better exposing bad task validation throughout the test suite
…On Mon, Apr 25, 2022 at 5:42 AM crusaderky ***@***.***> wrote:
indeed test_chaos_rechunk can't fail. See my comment #6123 (comment)
<#6123 (comment)>
—
Reply to this email directly, view it on GitHub
<#6147 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTGVAJBP4RLOAHRQPMTVGZZJFANCNFSM5TSXFRVA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Here is an attempt to make worker-side task validation errors more visibile
#6192
…On Mon, Apr 25, 2022 at 7:22 AM Matthew Rocklin ***@***.***> wrote:
I'll work on better exposing bad task validation throughout the test suite
On Mon, Apr 25, 2022 at 5:42 AM crusaderky ***@***.***>
wrote:
> indeed test_chaos_rechunk can't fail. See my comment #6123 (comment)
> <#6123 (comment)>
>
> —
> Reply to this email directly, view it on GitHub
> <#6147 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AACKZTGVAJBP4RLOAHRQPMTVGZZJFANCNFSM5TSXFRVA>
> .
> You are receiving this because you authored the thread.Message ID:
> ***@***.***>
>
|
Aside from the issue around raising errors (which I think may be fixed above) do folks have thoughts on the actual exception? |
I could trace this to a race condition where a task is assigned and released frequently on a worker. The task is in memory for a time but released again. ultimately, the worker receives a request like "op": "compute-task",
"key": "foo",
"who_has": {"bar": [self.address]} even though distributed/distributed/worker.py Lines 3184 to 3192 in 198522b
self.address from who_has such that it it empty then.
I think we should simply transition the task to missing in this case which would then cause the scheduler to properly correct the state. I'm currently trying to narrow this behavior down a bit better by reducing a test case. |
Punting problems like these back to the scheduler and letting it have another try sounds like a good strategy in general. |
I dug a bit deeper since something in the logs appeared to be off. Particularly, why the key is forgotten in the first place. Using #6161 I could track this down to a transition triggered by
I am a bit confused and worried why the worker is added (and removed) twice. The chaos code should let the worker connect and after it dies disconnect. I don't understand why it reconnects. The reconnection seems to put us into the corrupt state. I think the missing-data events are not related |
https://github.com/dask/distributed/runs/6048720092?check_suite_focus=true
This was found by the new
test_chaos_rechunk
test.cc @fjetter @gjoseph92
The text was updated successfully, but these errors were encountered: