Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q28 fails in automated nightly runs #147

Open
beckernick opened this issue Dec 8, 2020 · 3 comments
Open

Q28 fails in automated nightly runs #147

beckernick opened this issue Dec 8, 2020 · 3 comments

Comments

@beckernick
Copy link
Member

This is the same error as we had in #140 that was in theory resolved by rapidsai/cuml#3152 . cc @dantegd @VibhuJawa

28                                                                                                                      [958/1807]
Encountered Exception while running query
Traceback (most recent call last):
  File "/raid/nicholasb/prod/tpcx-bb/tpcx_bb/xbb_tools/utils.py", line 280, in run_dask_cudf_query
    config=config,
  File "/raid/nicholasb/prod/tpcx-bb/tpcx_bb/xbb_tools/utils.py", line 61, in benchmark
    result = func(*args, **kwargs)
  File "queries/q28/tpcx_bb_query_28.py", line 341, in main
    client=client, train_data=train_data, test_data=test_data
  File "queries/q28/tpcx_bb_query_28.py", line 285, in post_etl_processing
    model.fit(X_train, y_train)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/cuml/common/memory_utils.py", l$
ne 93, in cupy_rmm_wrapper
    return func(*args, **kwargs)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/cuml/dask/naive_bayes/naive_bay$
s.py", line 190, in fit
    client=self.client)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/cuml/dask/common/func.py", line
63, in reduce
    workers = [(first(who_has[m.key]), m) for m in futures]
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/cuml/dask/common/func.py", line
63, in <listcomp>
    workers = [(first(who_has[m.key]), m) for m in futures]
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/toolz/itertoolz.py", line 376, i
n first
    return next(iter(seq))
StopIteration
@beckernick beckernick changed the title Q28 UCX fails in 2020-12-08 nightlies Q28 fails in 2020-12-08 nightlies Dec 9, 2020
@beckernick
Copy link
Member Author

beckernick commented Dec 9, 2020

I cannot consistently reproduce this (though others have seen it as well). There may be something subtle happening with the Naive Bayes classifier.

@VibhuJawa
Copy link
Member

@beckernick , Could you able to fetch the logs from the workers (when you see this error again) , i suspect they might have some more context.

@beckernick
Copy link
Member Author

beckernick commented Dec 14, 2020

Lost the logs from the failure in the automated nightly run, unfortunately. Could not reproduce this with 100 consecutive runs of Q28. Will be triggering a few long-running tests to see if I can grab them.

I believe I saw a CUSPARSE_STATUS_NOT_INITIALIZED in the past causing the StopIteration, which on it's own might make me wonder if there's some odd behavior going on with the CUDA runtime.

However, this query clearly succeeds repeatedly on its own. Perhaps there's some unexpected interaction occurring somewhere during the full sweep.

@beckernick beckernick changed the title Q28 fails in 2020-12-08 nightlies Q28 fails in automated nightly runs Dec 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants