Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_stress_scatter_death #6305

Closed
crusaderky opened this issue May 9, 2022 · 14 comments · Fixed by #6327, #6318 or #6404
Closed

test_stress_scatter_death #6305

crusaderky opened this issue May 9, 2022 · 14 comments · Fixed by #6327, #6318 or #6404
Assignees
Labels
flaky test Intermittent failures on CI. regression

Comments

@crusaderky
Copy link
Collaborator

test_stress_scatter_death has suddenly started hanging very frequently on Windows and MacOSX.
According to https://dask.org/distributed/test_report.html, the potential culprit is either #6217 or #6248.
CC @fjetter

Note: in the test report above, the failures on Windows are marked as a white box instead of red due to #6304.

@crusaderky
Copy link
Collaborator Author

Update: it hangs on Ubuntu too: https://github.com/dask/distributed/runs/6349119823?check_suite_focus=true

It looks like an infinite transition loop (but I may be wrong)

ERROR    asyncio:base_events.py:1744 Exception in callback IOLoop.add_future.<locals>.<lambda>(<Task finishe...meout >300.0s>) at /usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/ioloop.py:688
handle: <Handle IOLoop.add_future.<locals>.<lambda>(<Task finishe...meout >300.0s>) at /usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/ioloop.py:688>
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/ioloop.py", line 688, in <lambda>
    lambda f: self._run_callback(functools.partial(callback, future))
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 175, in wrapper
    return await method(self, *args, **kwargs)
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 1278, in handle_scheduler
    await self.handle_stream(comm, every_cycle=[self.ensure_communicating])
  File "/home/runner/work/distributed/distributed/distributed/core.py", line 651, in handle_stream
    handler(**merge(extra, msg))
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2013, in handle_compute_task
    self.transitions(recommendations, stimulus_id=stimulus_id)
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2764, in transitions
    a_recs, a_instructions = self._transition(
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2651, in _transition
    func = self._transitions_table.get((start, cast(str, finish)))
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/typing.py", line 1736, in cast
    def cast(typ, val):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/pytest_timeout.py", line 241, in handler
    timeout_sigalrm(item, settings.timeout)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/pytest_timeout.py", line 409, in timeout_sigalrm
    pytest.fail("Timeout >%ss" % timeout)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/_pytest/outcomes.py", line 196, in fail
    raise Failed(msg=reason, pytrace=pytrace)
Failed: Timeout >300.0s

@crusaderky crusaderky assigned crusaderky and unassigned fjetter May 9, 2022
@jakirkham
Copy link
Member

Also seen in this CI run.

@crusaderky
Copy link
Collaborator Author

Replicated on high-powered desktop after 51 runs

@crusaderky
Copy link
Collaborator Author

Downstream of #6318, I found three separate issues:

  1. Infinite transition loop (failed 8 out of 1000 runs)

2022-05-11 11:37:20,905 - distributed.core - ERROR - TransitionCounterMaxExceeded: slowadd-1-17 :: released->missing
  Story:
    ('slowadd-1-17', 'ensure-task-exists', 'released', 'compute-task-1652265440.70282', 1652265440.7033925)
    ('slowadd-1-17', 'released', 'fetch', 'fetch', {}, 'compute-task-1652265440.70282', 1652265440.703413)
    ('gather-dependencies', 'tcp://127.0.0.1:35075', {'slowadd-1-17'}, 'ensure-communicating-1652265440.7034416', 1652265440.7034585)
    ('slowadd-1-17', 'fetch', 'flight', 'flight', {}, 'ensure-communicating-1652265440.7034416', 1652265440.7034647)
    ('request-dep', 'tcp://127.0.0.1:35075', {'slowadd-1-17'}, 'ensure-communicating-1652265440.7034416', 1652265440.7036455)
    ('receive-dep', 'tcp://127.0.0.1:35075', {'slowadd-1-17'}, 'ensure-communicating-1652265440.7034416', 1652265440.7087567)
    ('slowadd-1-17', 'put-in-memory', 'ensure-communicating-1652265440.7034416', 1652265440.7087805)
    ('slowadd-1-17', 'flight', 'memory', 'memory', {'slowadd-2-16': 'ready'}, 'ensure-communicating-1652265440.7034416', 1652265440.708784)
    ('slowadd-1-17', 'compute-task', 'compute-task-1652265440.7051446', 1652265440.7092702)
    ('remove-replicas', ('slowadd-1-17',), 'ensure-communicating-1652265440.7034416', 1652265440.7145007)
    ('slowadd-1-17', 'remove-replica-confirmed', 'ensure-communicating-1652265440.7034416', 1652265440.714504)
    ('slowadd-1-17', 'release-key', 'ensure-communicating-1652265440.7034416', 1652265440.714511)
    ('slowadd-1-17', 'memory', 'released', 'released', {'slowadd-1-17': 'forgotten'}, 'ensure-communicating-1652265440.7034416', 1652265440.7145405)
    ('slowadd-1-17', 'released', 'forgotten', 'forgotten', {}, 'ensure-communicating-1652265440.7034416', 1652265440.7145462)
    ('slowadd-1-17', 'ensure-task-exists', 'released', 'compute-task-1652265440.7120328', 1652265440.7145758)
    ('slowadd-1-17', 'released', 'fetch', 'released', {'slowadd-1-17': 'missing'}, 'compute-task-1652265440.7120328', 1652265440.7145944)
    ('slowadd-1-17', 'released', 'missing', 'released', {'slowadd-1-17': 'missing'}, 'compute-task-1652265440.7120328', 1652265440.7145967)
    ('slowadd-1-17', 'released', 'missing', 'released', {'slowadd-1-17': 'missing'}, 'compute-task-1652265440.7120328', 1652265440.7145994)
    [repeat last line to infinity]
  1. failed has_what validaton (failed 5 out of 1000 runs)
  File "/home/crusaderky/github/distributed/distributed/worker.py", line 4186, in validate_task
    self.validate_task_released(ts)
  File "/home/crusaderky/github/distributed/distributed/worker.py", line 4157, in validate_task_released
    assert not any(ts.key in has_what for has_what in self.has_what.values())
AssertionError
  1. Failed data_needed validation (failed 1 out of 1000 runs)

Traceback (most recent call last):
  File "/home/crusaderky/github/distributed/distributed/worker.py", line 4184, in validate_task
    self.validate_task_fetch(ts)
  File "/home/crusaderky/github/distributed/distributed/worker.py", line 4125, in validate_task_fetch
    assert ts in self.data_needed
AssertionError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/crusaderky/github/distributed/distributed/worker.py", line 4246, in validate_state
    self.validate_task(ts)
  File "/home/crusaderky/github/distributed/distributed/worker.py", line 4204, in validate_task
    raise AssertionError(
AssertionError: Invalid TaskState encountered for <TaskState 'ndarray-9e6174fa82b2a0609f4482bf0c7e4d80' fetch>.
Story:
[('ndarray-9e6174fa82b2a0609f4482bf0c7e4d80', 'ensure-task-exists', 'released', 'compute-task-1652266312.2669225', 1652266312.2685668), ('ndarray-9e6174fa82b2a0609f4482bf0c7e4d80', 'released', 'fetch', 'fetch', {}, 'compute-task-1652266312.2669225', 1652266312.2685966), ('ndarray-9e6174fa82b2a0609f4482bf0c7e4d80', 'release-key', 'worker-close-1652266312.265506', 1652266312.2686384), ('ndarray-9e6174fa82b2a0609f4482bf0c7e4d80', 'fetch', 'released', 'released', {'ndarray-9e6174fa82b2a0609f4482bf0c7e4d80': 'forgotten'}, 'worker-close-1652266312.265506', 1652266312.2686439), ('ndarray-9e6174fa82b2a0609f4482bf0c7e4d80', 'released', 'forgotten', 'forgotten', {}, 'worker-close-1652266312.265506', 1652266312.2686467), ('ndarray-9e6174fa82b2a0609f4482bf0c7e4d80', 'ensure-task-exists', 'released', 'compute-task-1652266312.2671275', 1652266312.2687173), ('ndarray-9e6174fa82b2a0609f4482bf0c7e4d80', 'released', 'fetch', 'fetch', {}, 'compute-task-1652266312.2671275', 1652266312.268733)]

I have no evidence to suggest that the last two are recent regressions; they may have been there for a long time but we were not invoking Worker.validate_state before.

@crusaderky
Copy link
Collaborator Author

@fjetter @jrbourbeau @jsignell I think this should be treated as a blocker for the next release

@crusaderky
Copy link
Collaborator Author

Dumping down my understanding so far of the story:

    ('remove-replicas', ('slowadd-1-17',), 'ensure-communicating-1652265440.7034416', 1652265440.7145007)
    ('slowadd-1-17', 'remove-replica-confirmed', 'ensure-communicating-1652265440.7034416', 1652265440.714504)
    ('slowadd-1-17', 'release-key', 'ensure-communicating-1652265440.7034416', 1652265440.714511)
    ('slowadd-1-17', 'memory', 'released', 'released', {'slowadd-1-17': 'forgotten'}, 'ensure-communicating-1652265440.7034416', 1652265440.7145405)
    ('slowadd-1-17', 'released', 'forgotten', 'forgotten', {}, 'ensure-communicating-1652265440.7034416', 1652265440.7145462)
    ('slowadd-1-17', 'ensure-task-exists', 'released', 'compute-task-1652265440.7120328', 1652265440.7145758)
    ('slowadd-1-17', 'released', 'fetch', 'released', {'slowadd-1-17': 'missing'}, 'compute-task-1652265440.7120328', 1652265440.7145944)
    ('slowadd-1-17', 'released', 'missing', 'released', {'slowadd-1-17': 'missing'}, 'compute-task-1652265440.7120328', 1652265440.7145967)

Let's break it down:

    ('remove-replicas', ('slowadd-1-17',), 'ensure-communicating-1652265440.7034416', 1652265440.7145007)
    ('slowadd-1-17', 'remove-replica-confirmed', 'ensure-communicating-1652265440.7034416', 1652265440.714504)
    ('slowadd-1-17', 'release-key', 'ensure-communicating-1652265440.7034416', 1652265440.714511)
    ('slowadd-1-17', 'memory', 'released', 'released', {'slowadd-1-17': 'forgotten'}, 'ensure-communicating-1652265440.7034416', 1652265440.7145405)
    ('slowadd-1-17', 'released', 'forgotten', 'forgotten', {}, 'ensure-communicating-1652265440.7034416', 1652265440.7145462)

The key was in memory on the worker.
The scheduler asked the worker to drop it, either from Scheduler.add_keys or Scheduler.add_worker.

    ('slowadd-1-17', 'ensure-task-exists', 'released', 'compute-task-1652265440.7120328', 1652265440.7145758)

this is saying that the scheduler believes that slowadd-1-17 is in memory on another worker, so it just scheduled either slowadd-2-16 or slowadd-2-17 to run on this worker (the two dependents of slowadd-1-17).
The method call should look like this:

Worker.handle_compute_task(
   key="slowadd-2-17",
   who_has={"slowadd-1-17": [w1, w2, ...], "slowadd-1-18": [w1, w2, ...]},
    ...
)

The same method, at the end, calls self.update_who_has(who_has), which sets ts.who_has = {w1, w2, ...}.

The released bit next to ensure-task-exists is confirming that the task is nowhere to be found in Worker.tasks - e.g. the previous release-key command was completed cleanly.

    ('slowadd-1-17', 'released', 'fetch', 'released', {'slowadd-1-17': 'missing'}, 'compute-task-1652265440.7120328', 1652265440.7145944)

This is what baffles me. This should not be possible when stimulus_id="compute-task-...".

Here we just executed Worker.transition_released_fetch(ts); however ts.who_has by now has become empty, so the function returned {ts: "missing"}, [].

However, this is in the same stimulus_id as in ensure-task-exists - meaning there shouldn't have been anything in between. Notably, there should not have been a different handle_compute_task which emptied the who_has.

So either

  • my understanding of the above - particularly the last paragraph - is wrong, or
  • Worker.handle_compute_task passed an empty who_has to begin with - which should never happen? At the time the Scheduler issues the compute command, it should be convinced that all dependencies are in memory.

@crusaderky
Copy link
Collaborator Author

Updates and corrections:

  • update_who_has can't remove elements from who_has, only add them
  • scheduler.py::_task_to_msg() is sending a compute-task command with empty who_has. Unsure if it's called from Scheduler.transition_waiting_processing or from stealing.py. To my understanding this should never happen; however, this should also not be fatal: the dependency should simply transition from fetch to missing.
  • Worker.ensure_task_exists occasionally finds a task in Worker.tasks with state="released". This seems a fairly rare occurrence and not related to this specific test/issue. However, it makes no sense to me, as released should just be a degenerate state in the middle of Worker.transitions, and ensure_task_exists is never called in the middle of one.

@crusaderky
Copy link
Collaborator Author

crusaderky commented May 11, 2022

Found it. The infinite transition loop is a one-liner typo introduced in #6248: https://github.com/dask/distributed/pull/6248/files#r870780929

Downstream of fixing the one-liner, I'm now getting:

7 failures out of 1000:

  File "/home/crusaderky/github/distributed/distributed/worker.py", line 4281, in validate_task
    self.validate_task_released(ts)
  File "/home/crusaderky/github/distributed/distributed/worker.py", line 4252, in validate_task_released
    assert not any(ts.key in has_what for has_what in self.has_what.values())

This failure is (deliberately) introduced by #6318; I don't think it's a recent regression.
It's related to the fact I mentioned above that we have released tasks in Worker.tasks, outside of a transitions loop, and I can't think of any valid use case for it.

2 failures out of 1000:

  File "/home/crusaderky/github/distributed/distributed/worker.py", line 2026, in handle_compute_task
    self.transitions(recommendations, stimulus_id=stimulus_id)
  File "/home/crusaderky/github/distributed/distributed/worker.py", line 2795, in transitions
    a_recs, a_instructions = self._transition(
  File "/home/crusaderky/github/distributed/distributed/worker.py", line 2686, in _transition
    recs, instructions = func(ts, *args, stimulus_id=stimulus_id, **kwargs)
  File "/home/crusaderky/github/distributed/distributed/worker.py", line 2045, in transition_missing_fetch
    assert ts.who_has
AssertionError

This again is caused by handle_compute_task with an empty who_has. We should either crash loudly in handle_compute_task or handle it gracefully and keep the task in missing state.

@crusaderky
Copy link
Collaborator Author

Downstream of #6327, I have only these failures left:

4 out of 1000

  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2044, in handle_compute_task
    assert dep_ts.state != "released", self.story(dep_ts)
AssertionError: [('ndarray-a373b7c1c84bcff2443e589a33117df5', 'ensure-task-exists', 'released', 'compute-task-1652371566.7855935', 1652371566.8192596)]

This is a new assertion, which detects earlier both of the issues listed in my post above.

@crusaderky
Copy link
Collaborator Author

crusaderky commented May 13, 2022

I now have a better understanding of the error above:

 ('scatter-0', 'ensure-task-exists', 'released', 'compute-task-1652440434.4929621', 1652440434.501996),
 ('scatter-0', 'released', 'fetch', 'fetch', {}, 'compute-task-1652440434.4929621', 1652440434.5020301),
 ('gather-dependencies', 'tcp://127.0.0.1:33719', {'scatter-0'}, 'compute-task-1652440434.4929621', 1652440434.5020645),
 ('scatter-0', 'fetch', 'flight', 'flight', {}, 'compute-task-1652440434.4929621', 1652440434.502074),
 ('request-dep', 'tcp://127.0.0.1:33719', {'scatter-0'}, 'compute-task-1652440434.4929621', 1652440434.5090077),
 ('receive-dep', 'tcp://127.0.0.1:33719', {'scatter-0'}, 'compute-task-1652440434.4929621', 1652440434.5544157),
 ('scatter-0', 'put-in-memory', 'compute-task-1652440434.4929621', 1652440434.5544426),
 ('scatter-0', 'flight', 'memory', 'memory', {'slowadd-1-2': 'ready'}, 'compute-task-1652440434.4929621', 1652440434.5544457),

 ('slowadd-1-38', 'ensure-task-exists', 'released', 'compute-task-1652440434.8344328', 1652440434.8419213),
 ('slowadd-1-38', 'released', 'fetch', 'fetch', {}, 'compute-task-1652440434.8344328', 1652440434.841943),
 ('gather-dependencies', 'tcp://127.0.0.1:42613', {'slowadd-1-38'}, 'compute-task-1652440434.8344328', 1652440434.8419714),
 ('slowadd-1-38', 'fetch', 'flight', 'flight', {}, 'compute-task-1652440434.8344328', 1652440434.8419788),
 ('slowadd-1-38', 'flight', 'released', 'cancelled', {}, 'worker-close-1652440434.833075', 1652440434.842025),

 ('slowadd-1-38', 'compute-task', 'cancelled', 'compute-task-1652440434.8352494', 1652440434.8420353),
 ('scatter-3', 'ensure-task-exists', 'released', 'compute-task-1652440434.8352494', 1652440434.8420422),
 ('scatter-0', 'ensure-task-exists', 'memory', 'compute-task-1652440434.8352494', 1652440434.8420436),
 ('slowadd-1-38', 'cancelled', 'waiting', 'resumed', {}, 'compute-task-1652440434.8352494', 1652440434.8420599)]

'slowadd-1-38' is a dependent of 'scatter-0' and 'scatter-3', and a dependency of 'slowadd-2-37' and 'slowadd-2-38'.
What's happening above is that the scheduler asks the worker to compute 'slowadd-1-38', which already exists in cancelled state from a previous attempt to fetch it from tcp://127.0.0.1:42613 (which died in the meantime).
One of its two dependencies, 'scatter-3', is not in memory and needs to be fetched from somewhere else.

The transitions system calls transition_cancelled_waiting, which moves the task to resumed.
The task is never moved OUT of resumed, and crucially transition_released_waiting is never called.
As a direct consequence, 'scatter-3' is never transitioned from released to fetch.

There has been a recent change on this specific mechanism: #6217 @fjetter

@jakirkham
Copy link
Member

Given PR ( #6327 ) says it partially closes this issue, should we reopen or did the PR end up fixing this completely?

@gjoseph92
Copy link
Collaborator

As I understand, #6305 (comment) is catching a real (but rarer) problem that still needs to be fixed. Reopening.

@fjetter
Copy link
Member

fjetter commented May 20, 2022

Should be fixed by #6370

@crusaderky
Copy link
Collaborator Author

Not fixed - it's still failing 0.4% of the time on my desktop host and much more frequently on CI.

Stack trace:

  File "/home/crusaderky/github/distributed/distributed/worker.py", line 1990, in handle_compute_task
    self.transitions(recommendations, stimulus_id=stimulus_id)
  File "/home/crusaderky/github/distributed/distributed/worker.py", line 2820, in transitions
    a_recs, a_instructions = self._transition(
  File "/home/crusaderky/github/distributed/distributed/worker.py", line 2709, in _transition
    recs, instructions = func(
  File "/home/crusaderky/github/distributed/distributed/worker.py", line 2014, in transition_missing_fetch
    assert ts.who_has

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment