Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] [python] lightgbm.dask hangs indefinitely after an error #4771

Open
Tracked by #5153
jameslamb opened this issue Nov 5, 2021 · 9 comments
Open
Tracked by #5153

[dask] [python] lightgbm.dask hangs indefinitely after an error #4771

jameslamb opened this issue Nov 5, 2021 · 9 comments

Comments

@jameslamb
Copy link
Collaborator

jameslamb commented Nov 5, 2021

Description

Shortly after dask=2021.10.0 and distributed=2021.10.0 were uploaded to the Anaconda default channels (used in LightGBM's CI), LightGBM's CI job started failing with timeouts (#4769).

Logs suggested that one of the unit tests on lightgbm.dask was hanging indefinitely. Specifically, the test after test_errors.

Reproducible example

Run inside a container with conda available.

docker run \
    --rm \
    --workdir /usr/local/testing \
    -it \
    continuumio/miniconda3:4.10.3 \
    /bin/bash

apt-get update
apt-get install \
    --no-install-recommends \
    -y \
        build-essential

Using dask and distributed v2021.10.0, after a run which produces an error out on the cluster, the next test gets stuck indefinitely.

conda create \
    --name dask-test \
    --override-channels \
    -c conda-forge \
    --yes \
        cloudpickle=2.0.0 \
        cmake=3.21.3 \
        dask=2021.10.0 \
        distributed=2021.10.0 \
        joblib=1.1.0 \
        numpy=1.21.2 \
        pandas=1.3.4 \
        pytest=6.2.4 \
        scikit-learn=1.0.1 \
        scipy=1.7.1

source activate dask-test

git clone \
    --recursive \
    https://github.com/microsoft/LightGBM.git

cd LightGBM

pushd python-package/
    python setup.py install
popd

# this will succeed
pytest -v \
    tests/python_package_test/test_dask.py \
    -k 'test_errors'

# this will succeed
pytest -v \
    tests/python_package_test/test_dask.py \
    -k 'test_training_succeeds_even_if_some_workers_do_not_have_any_data'

# running both tests, the "training_succeeds" one hangs indefinitely
pytest -v \
    tests/python_package_test/test_dask.py \
    -k 'test_errors or test_training_succeeds_even_if_some_workers_do_not_have_any_data'

Downgrading to dask / distributed 2021.9.0 fixes the issue.

source deactivate
conda env remove --name test-dask
conda create \
    --name dask-test \
    --override-channels \
    -c conda-forge \
    --yes \
        cloudpickle=2.0.0 \
        cmake=3.21.3 \
        dask=2021.9.1 \
        distributed=2021.9.1 \
        joblib=1.1.0 \
        numpy=1.21.2 \
        pandas=1.3.4 \
        pytest=6.2.4 \
        scikit-learn=1.0.1 \
        scipy=1.7.1

source activate dask-test
pushd python-package/
    python setup.py install
popd

# running both tests, they succeed!
pytest -v \
    tests/python_package_test/test_dask.py \
    -k 'test_errors or test_training_succeeds_even_if_some_workers_do_not_have_any_data'

Environment info

See commands in reproducible example.

Additional Comments

I suspect that lightgbm's tests are being affected by the bug documented here: dask/distributed#5480.

I think this because timeout errors happen immediately after a test that produces an error, and as of #4159, all of LightGBM's Dask unit tests share the same cluster.

@jameslamb jameslamb changed the title [dask] [python] LightGBM hangs indefinitely after an error [dask] [python] lightgbm.dask hangs indefinitely after an error Nov 5, 2021
@jmoralez
Copy link
Collaborator

jmoralez commented Nov 5, 2021

Thanks for the example @jameslamb, I couldn't reproduce the hang on the workers test, didn't realize it was the consecutive runs that failed. I just ran this locally and looked at the worker logs and see this:

distributed.worker - WARNING - Compute Failed Function: subgraph_callable-f29839b8-aeb0-4760-859c-87277d2d args: ('make-timeseries-7d984faa6c361aaa55a14af47f1a69fe', ([Timestamp('2000-01-31 00:00:00', freq='M'), Timestamp('2000-02-29 00:00:00', freq='M')], 914658368)) kwargs: {} Exception: "Exception('foo')" 

I'd say that's strange to say the least.

@jameslamb
Copy link
Collaborator Author

Sure! I only had the idea of running the two consecutive tests after noticing @gjoseph92's writeup in dask/distributed#5480.

@jmoralez
Copy link
Collaborator

jmoralez commented Nov 5, 2021

Oh I just realized that test_errors raises that exception, I see now it's not strange at all seeing that in the worker logs haha.

@jmoralez
Copy link
Collaborator

jmoralez commented Nov 5, 2021

Adding a client.restart at the end of test_errors seems to fix the issue. Should I try in a PR?

@jameslamb
Copy link
Collaborator Author

Adding a client.restart at the end of test_errors seems to fix the issue. Should I try in a PR?

Yeah!

Since we have evidence that this is a bug in distributed and since Dask maintainers are talking about trying to fix it in the next distributed release (dask/community#197 (comment)), I think it's ok to client.restart() in LightGBM's tests, so that we continue to test against the newest Dask releases as they come out.

I'd support a PR that adds that .restart() and reverts #4770

@gjoseph92
Copy link

@jameslamb do you have any logs from the cluster you could share? Specifically, I'm interested in evidence of dask/distributed#5482, or some other error like it—if you see Reconnecting in logs, then indeed it's probably dask/distributed#5480. I'm not familiar with your tests, but the message @jmoralez posted looks to me like an error within user code in a task, which I don't think should cause this deadlock (if it does, this is a way bigger problem).

But if you're reproducibly showing a deadlock with 2021.10.0 but not 2021.09.0, that's very helpful information.

@jameslamb
Copy link
Collaborator Author

do you have any logs from the cluster you could share?

Let me try running the repro from this issue's description with more verbose logging!

To be honest, I just jumped to conclusions when I saw the combination of these factors:

  • tests started failing as soon as we started getting distributed 2021.10.0
  • the specific issue was "the first test run AFTER one that produces errors on a worker hangs indefinitely"
  • downgrading back to 2021.9.1, no issues in tests
  • lightgbm.dask hasn't been changed in a few months, and only uses battle-tested parts of the Dask API:
    from dask import delayed
    from dask.array import Array as dask_Array
    from dask.array import from_delayed as dask_array_from_delayed
    from dask.bag import from_delayed as dask_bag_from_delayed
    from dask.dataframe import DataFrame as dask_DataFrame
    from dask.dataframe import Series as dask_Series
    from dask.distributed import Client, default_client, wait

@jmoralez
Copy link
Collaborator

jmoralez commented Nov 5, 2021

@gjoseph92 the following reproduces what we're seeing in our tests:

from dask.datasets import timeseries
from dask.distributed import Client, LocalCluster


def raise_exc(part):
    raise Exception


if __name__ == '__main__':
    cluster = LocalCluster(n_workers=2, threads_per_worker=2)
    with Client(cluster) as client:
        ts = timeseries()
        ts = ts.map_partitions(raise_exc, meta={})
        try:
            ts.compute()
        except:
            pass
    with Client(cluster) as client:
        X = timeseries().compute()
        print(X.shape)

@gjoseph92
Copy link

@jmoralez thanks for the excellent reproducer! Turns out this was a separate issue dask/distributed#5497. Your code is very helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants