Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unpin dask and distributed for 23.12 development #5627

Merged
merged 7 commits into from
Oct 30, 2023

Conversation

galipremsagar
Copy link
Contributor

This PR relaxes dask and distributed versions pinning for 23.12 development.

xref: rapidsai/cudf#14320

@galipremsagar galipremsagar added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 23, 2023
@galipremsagar galipremsagar self-assigned this Oct 23, 2023
@galipremsagar galipremsagar requested review from a team as code owners October 23, 2023 23:31
@github-actions github-actions bot added conda conda issue Cython / Python Cython or Python issue ci labels Oct 23, 2023
@csadorf
Copy link
Contributor

csadorf commented Oct 26, 2023

/merge

@csadorf
Copy link
Contributor

csadorf commented Oct 27, 2023

@galipremsagar This now in conflict with dask-cuda: ImportError: cannot import name 'AnyKeyFile' from 'distributed.spill' Do we need to rebuild to build against cudf with Arrow 13?

@galipremsagar
Copy link
Contributor Author

@galipremsagar This now in conflict with dask-cuda: ImportError: cannot import name 'AnyKeyFile' from 'distributed.spill' Do we need to rebuild to build against cudf with Arrow 13?

Logs indicate an older version of dask is being picked when performing a pip install. I re-ordered the install command that should fix it.

@galipremsagar
Copy link
Contributor Author

@csadorf We have encountered an issue with dask-cuda wheels. We are actively working on resolving it so that we can unblock cuml CI. Until then cuml CI would be blocked.

@jakirkham jakirkham closed this Oct 27, 2023
@jakirkham jakirkham reopened this Oct 27, 2023
@jakirkham
Copy link
Member

Closing and reopening for CI

ci/test_wheel.sh Outdated Show resolved Hide resolved
@pentschev
Copy link
Member

I see Dask errors (e.g., here) as below:

==================================== ERRORS ====================================
________ ERROR at teardown of test_validate_dask_array[4-False-12-8-24] ________

fut = <Future cancelled>, timeout = 0.7971904277801514

    async def wait_for(fut, timeout):
        """Wait for the single Future or coroutine to complete, with timeout.
    
        Coroutine will be wrapped in Task.
    
        Returns result of the Future or coroutine.  When a timeout occurs,
        it cancels the task and raises TimeoutError.  To avoid the task
        cancellation, wrap it in shield().
    
        If the wait is cancelled, the task is also cancelled.
    
        This function is a coroutine.
        """
        loop = events.get_running_loop()
    
        if timeout is None:
            return await fut
    
        if timeout <= 0:
            fut = ensure_future(fut, loop=loop)
    
            if fut.done():
                return fut.result()
    
            await _cancel_and_wait(fut, loop=loop)
            try:
                return fut.result()
            except exceptions.CancelledError as exc:
                raise exceptions.TimeoutError() from exc
    
        waiter = loop.create_future()
        timeout_handle = loop.call_later(timeout, _release_waiter, waiter)
        cb = functools.partial(_release_waiter, waiter)
    
        fut = ensure_future(fut, loop=loop)
        fut.add_done_callback(cb)
    
        try:
            # wait until the future completes or the timeout
            try:
                await waiter
            except exceptions.CancelledError:
                if fut.done():
                    return fut.result()
                else:
                    fut.remove_done_callback(cb)
                    # We must ensure that the task is not running
                    # after wait_for() returns.
                    # See https://bugs.python.org/issue32751
                    await _cancel_and_wait(fut, loop=loop)
                    raise
    
            if fut.done():
                return fut.result()
            else:
                fut.remove_done_callback(cb)
                # We must ensure that the task is not running
                # after wait_for() returns.
                # See https://bugs.python.org/issue32751
                await _cancel_and_wait(fut, loop=loop)
                # In case task cancellation failed with some
                # exception, we should re-raise it
                # See https://bugs.python.org/issue40607
                try:
>                   return fut.result()
E                   asyncio.exceptions.CancelledError

/opt/conda/envs/test/lib/python3.10/asyncio/tasks.py:456: CancelledError

The above exception was the direct cause of the following exception:

    @pytest.fixture(scope="module")
    def cluster():
    
        cluster = LocalCUDACluster(protocol="tcp", scheduler_port=0)
        yield cluster
>       cluster.close()

It took me a lot of time to figure out what was happening but I did so recently (details in the description of rapidsai/dask-cuda#1260). However, after that PR I've changed that to be less intrusive to non-CI workloads, and to do so we pass worker_class=IncreasedCloseTimeoutNanny when launching clusters that often timeout in CI.

Those shouldn't be related to unpinning Dask as they were already happening for some time in Dask-CUDA (perhaps since August or even before that) was that the case in cuML as well @dantegd @csadorf ?

ci/test_wheel.sh Outdated Show resolved Hide resolved
ci/test_wheel.sh Outdated Show resolved Hide resolved
@dantegd
Copy link
Member

dantegd commented Oct 30, 2023

@pentschev you're correct, we've seen these erros for quite a while. They became way more prominent when we have errors like cuda context creation (for example #5598) but even then they didn't fully go away. We can work on a similar fix to what you did in dask-cuda in a follow up PR.

@pentschev
Copy link
Member

We can work on a similar fix to what you did in dask-cuda in a follow up PR.

That seems appropriate in cuML then. I opened #5636 to increase the timeout, I think I covered all relevant places.

@rapids-bot rapids-bot bot merged commit 79aa490 into rapidsai:branch-23.12 Oct 30, 2023
49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci conda conda issue Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants