-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distributed.comm.tests.test_ucx_config.test_ucx_config_w_env_var
flaky
#5229
Comments
Thanks for reporting @fjetter , I'll see if I can reproduce it and fix on my end. |
It looks like |
It looks like there are some other things happening in that CI build. In particular
|
I would need more than five lines of traceback. My first guess would be that this is a version mismatch but I need more context information |
You can take a look at the gpuCI build link that @jrbourbeau posted above, specifically lines 1699-1751. But indeed there's a mismatched version error starting in line 1820 of the same build. Maybe this was just a temporary failure, could you confirm if that happened again since that day @jrbourbeau ? |
I don't recall seeing this failure recently, but I also haven't been tracking too closely. I'll keep an eye out over the next few days. |
This time with a different exception async def connect(self, address: str, deserialize=True, **connection_args) -> UCX:
logger.debug("UCXConnector.connect: %s", address)
ip, port = parse_host_port(address)
init_once()
try:
ep = await ucx_create_endpoint(ip, port)
> except (ucp.exceptions.UCXCloseError, ucp.exceptions.UCXCanceled,) + (
getattr(ucp.exceptions, "UCXConnectionReset", ()),
getattr(ucp.exceptions, "UCXNotConnected", ()),
):
E TypeError: catching classes that do not inherit from BaseException is not allowed over in #5441 |
@fjetter have you seen this more recently as well? I can't reproduce this locally and that build was from a time where UCX-Py may have been broken but fixed already in rapidsai/ucx-py#801 . With that said, I believe this should not be an issue anymore, but please ping me again if it does show up. |
Bumping this issue because I ran into a timeout on one of my gpuCI runs which seems to have been caused by this test:
|
Also saw the same thing as Charles here |
distributed.comm.tests.test_ucx_config.test_ucx_config_w_env_var
flaky
This is failing pretty much at all times now. I'm xfailing it for the time being. |
rapidsai/ucx-py#994 will hopefully fix the root of this issue. |
After some debugging of Distributed tests with UCX it was observed that sometimes `exchange_peer_info` hangs indefinitely, specifically when executing `stream_recv` on the client side. The causes for this is unknown but believed to be due to messages being lost if there's either multiple stream messages being transferred simultaneously among various endpoints or being lost due to the receiving end taking too long to launch `stream_recv`, see #509 for a similar issue related to stream API. By adding a timeout doesn't allow recovery, but at least allows a UCX-Py client to retry upon failure to establish the endpoint. This change seems to resolve dask/distributed#5229, at least it isn't reproducible locally with this change. Additionally do a roundtrip message transfer for `test_send_recv_am, which should resolve #797 and seems to be caused by checking received messages too early, before they are actually received by the listener. A roundtrip ensures the client receives the reply and thus prevents us from the checking for a transfer that didn't complete yet. Ensure now also that the listener is closed before validating `test_close_callback` conditions, which was also flaky. Finally, ensure we close the loop in test fixture, thus preventing `DeprecationWarning`s from pytest-asyncio which currently closes unclosed event loop but will stop doing that in future releases. Closes #797 Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Ray Douglass (https://github.com/raydouglass) - Charles Blackmon-Luca (https://github.com/charlesbluca) - Lawrence Mitchell (https://github.com/wence-) URL: #994
I noticed the UCX test
distributed.comm.tests.test_ucx_config.test_ucx_config_w_env_var
to occasionally failE.g. https://gpuci.gpuopenanalytics.com/job/dask/job/distributed/job/prb/job/distributed-prb/234/CUDA_VER=11.2,LINUX_VER=ubuntu18.04,PYTHON_VER=3.8,RAPIDS_VER=21.10/testReport/junit/distributed.comm.tests/test_ucx_config/test_ucx_config_w_env_var/
Traceback
cc @dask/gpu
The text was updated successfully, but these errors were encountered: