Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix regressions in #4651 #4719

Merged
merged 5 commits into from
Apr 21, 2021
Merged

Fix regressions in #4651 #4719

merged 5 commits into from
Apr 21, 2021

Conversation

crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented Apr 20, 2021

Fix regressions introduced in #4651:

  • failing test
  • if a worker sends a heartbeat before its SystemMonitor has a chance to run, metrics["memory"] will be None and that will cause a crash in the scheduler.

@crusaderky crusaderky closed this Apr 20, 2021
@crusaderky crusaderky reopened this Apr 20, 2021
crusaderky referenced this pull request Apr 20, 2021
* code cleanup

* code cleanup

* Early exit pattern

* refactor

* cleanup

* fix regression

* annotations

* revert

* backend prototype

* slightly faster both in CPython and Cython

* slightly faster both in CPython and Cython

* polish

* polish

* early exit

* polish

* polish

* backend review

* nonfunctional GUI prototype

* GUI prototype (unpolished)

* tooltip

* refactor

* GUI

* GUI

* GUI

* refactor

* polish

* simpler tooltip

* Reduce spilled size on delitem

* tweak cluster-wide nbytes gauge

* workers tab

* Self-review

* bokeh unit tests

* test SpillBuffer

* Code review

* cython optimizations

* test MemoryState

* test backend

* Remove unnecessary casts uint->sint

* Self-review

* Test edge cases

* fix test failure

* redesign test

* relax maximums

* fix test

* lint

* fix test

* fix test

* fix bar on small screens

* height in em

* larger

* fix flaky test
@pentschev
Copy link
Member

I confirmed locally this fixes the problem for me too, thanks so much for the quick fix @crusaderky !

@crusaderky
Copy link
Collaborator Author

crusaderky commented Apr 21, 2021

Stress test outcome:

  • 77 out of 77 times, test_memory was successful
  • once, an unrelated test failed
  • 22 times, CI fell over with no log whatsoever, very fast and definitely before reaching the test suite

# ws._nbytes is updated at a different time and sizeof() may not be accurate,
# so size may be (temporarily) negative; floor it to zero.
size = max(0, (metrics["memory"] or 0) - ws._nbytes + metrics["spilled_nbytes"])

Copy link
Collaborator Author

@crusaderky crusaderky Apr 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see a way to write a unit test for this short of monkey-patching SystemMonitor?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it's going to be the only way, another test does exactly that:

@pytest.mark.asyncio
@pytest.mark.parametrize("reconnect", [True, False])
async def test_heartbeat_comm_closed(cleanup, monkeypatch, reconnect):
with captured_logger("distributed.worker", level=logging.WARNING) as logger:
async with await Scheduler() as s:
def bad_heartbeat_worker(*args, **kwargs):
raise CommClosedError()
async with await Worker(s.address, reconnect=reconnect) as w:
# Trigger CommClosedError during worker heartbeat
monkeypatch.setattr(
w.scheduler, "heartbeat_worker", bad_heartbeat_worker
)
await w.heartbeat()
if reconnect:
assert w.status == Status.running
else:
assert w.status == Status.closed
assert "Heartbeat to scheduler failed" in logger.getvalue()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is distributed.admin.system-monitor.interval which controls how often the monitor runs. You could set it to incredibly high values such that it is never executed during the test runtime


Another patch version w/ using monkeypatch, you could remove the PC before it is even started. Something like

@pyters.mark.asyncio
async def test_foo():
    s = Scheduler()
    s.periodic_callbacks["monitor"] = None
    w = Worker(s)
    w.periodic_callbacks["monitor"] = None
    await s
    await w
    async with Client(s):
        ....

Copy link
Collaborator Author

@crusaderky crusaderky Apr 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting distributed.admin.system-monitor.interval to a very high value before I create the Scheduler has no effect ( I can see data arriving in the heartbeat from the SystemMonitor.update).

Setting

s.periodic_callbacks["monitor"] = None
w.periodic_callbacks["monitor"] = None

fails with

        for pc in self.periodic_callbacks.values():
>           pc.stop()
E           AttributeError: 'NoneType' object has no attribute 'stop'

this has no effect:

del s.periodic_callbacks["monitor"]
del w.periodic_callbacks["monitor"]

this has no effect:

pc = PeriodicCallback(lamba: None, 999999999)
s.periodic_callbacks["monitor"] = pc
w.periodic_callbacks["monitor"] = pc

@crusaderky crusaderky marked this pull request as ready for review April 21, 2021 10:55
@crusaderky crusaderky changed the title WIP fix test_memory Fix regressions in #4651 Apr 21, 2021
@pentschev
Copy link
Member

Would people be ok with merging this even without a test and leaving a test for a different PR? This would unblock CI in Dask-CUDA and UCX-Py (potentially other RAPIDS projects too).

@fjetter
Copy link
Member

fjetter commented Apr 21, 2021

I'm fine with postponing tests since our CI is also failing hard. @crusaderky your call

@crusaderky
Copy link
Collaborator Author

I need an extra 2-3 hours to cook a unit test. I'm ok merging without

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work here @crusaderky. I'm going to merge this as is to unblock CI, but left a comment below

@@ -2379,34 +2383,45 @@ def test_memory():
]
sleep(2)
assert_memory(s, "managed_spilled", 1, 999)
# Wait for the spilling to finish. Note that this does not make the test take
# longer as we're waiting for recent_to_old_time anyway.
sleep(10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there something more direct we can probe here instead of sleeping for 10 seconds?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really because the unmanaged memory is very volatile, so we don't know how many keys are going to be spilled out exactly. Also, as noted it doesn't slow the test down.

@jrbourbeau jrbourbeau merged commit f492aa7 into dask:main Apr 21, 2021
@crusaderky crusaderky deleted the test_memory branch April 21, 2021 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants