Q27 intermittent failure in nightly automation #159

beckernick · 2020-12-16T14:08:40Z

This is the intermittent CUBLAS_STATUS_NOT_INITIALIZED error that I thought was in Q28 @VibhuJawa

Encountered Exception while running query
Traceback (most recent call last):
  File "/raid/nicholasb/prod/tpcx-bb/tpcx_bb/xbb_tools/utils.py", line 280, in run_dask_cudf_query
    config=config,
  File "/raid/nicholasb/prod/tpcx-bb/tpcx_bb/xbb_tools/utils.py", line 61, in benchmark
    result = func(*args, **kwargs)
  File "queries/q27/tpcx_bb_query_27.py", line 130, in main
    ["review_idx_global_pos", "pr_item_sk", "word", "sentence"]
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask_cudf/core.py", line 255, in sor
t_values
    ignore_index=ignore_index,
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask_cudf/sorting.py", line 225, in
sort_values
    divisions = quantile_divisions(df, by, npartitions)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask_cudf/sorting.py", line 177, in
quantile_divisions
    divisions = _approximate_quantile(df[by], qn).compute()
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/base.py", line 279, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/base.py", line 567, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/distributed/client.py", line 2676, i
n get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/distributed/client.py", line 1991, i
n gather
    asynchronous=asynchronous,
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/distributed/client.py", line 832, in sync
    self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/distributed/utils.py", line 340, in
sync
    raise exc.with_traceback(tb)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/distributed/utils.py", line 324, in
f
    result[0] = yield future
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/distributed/client.py", line 1850, i
n _gather
    raise exception.with_traceback(traceback)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/optimization.py", line 963, in
__call__
    return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/core.py", line 151, in get
    result = _execute_task(task, cache)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/core.py", line 121, in _execute
_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/utils.py", line 30, in apply
    return func(*args, **kwargs)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/dataframe/core.py", line 5401,
in apply_and_enforce
    df = func(*args, **kwargs)
  File "queries/q27/tpcx_bb_query_27.py", line 63, in ner_parser
    for doc in docs:
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/language.py", line 819, in pipe
    for doc in docs:
  File "nn_parser.pyx", line 253, in pipe
  File "nn_parser.pyx", line 273, in spacy.syntax.nn_parser.Parser.predict
  File "nn_parser.pyx", line 286, in spacy.syntax.nn_parser.Parser.greedy_parse
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/neural/_classes/model.py", line 167, in __call__
    return self.predict(x)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/neural/_classes/model.py", line 131, in predict
    y, _ = self.begin_update(X, drop=None)
  File "_parser_model.pyx", line 243, in spacy.syntax._parser_model.ParserModel.begin_update
  File "_parser_model.pyx", line 293, in spacy.syntax._parser_model.ParserStepModel.__init__
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/neural/_classes/feed_forward.py", line 46, in begin_update
    X, inc_layer_grad = layer.begin_update(X, drop=drop)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/api.py", line 295, in begin_update
    X, bp_layer = layer.begin_update(layer.ops.flatten(seqs_in, pad=pad), drop=drop)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/neural/_classes/feed_forward.py", line 46, in begin_update
    X, inc_layer_grad = layer.begin_update(X, drop=drop)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/api.py", line 379, in uniqued_fwd
    Y_uniq, bp_Y_uniq = layer.begin_update(X_uniq, drop=drop)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/neural/_classes/feed_forward.py", line 46, in begin_update
    X, inc_layer_grad = layer.begin_update(X, drop=drop)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/neural/_classes/layernorm.py", line 62, in begin_update
    X, backprop_child = self.child.begin_update(X, drop=0.0)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/neural/_classes/maxout.py", line 76, in begin_update
    output__boc = self.ops.gemm(X__bi, W, trans2=True)
  File "ops.pyx", line 986, in thinc.neural.ops.CupyOps.gemm
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/cupy/linalg/_product.py", line 34, i
n dot
    return a.dot(b, out)
  File "cupy/core/core.pyx", line 1412, in cupy.core.core.ndarray.dot
  File "cupy/core/_routines_linalg.pyx", line 418, in cupy.core._routines_linalg.dot
  File "cupy/core/_routines_linalg.pyx", line 503, in cupy.core._routines_linalg.tensordot_core
  File "cupy/cuda/device.pyx", line 47, in cupy.cuda.device.get_cublas_handle
  File "cupy/cuda/device.pyx", line 213, in cupy.cuda.device.Device.cublas_handle.__get__
  File "cupy/cuda/device.pyx", line 200, in cupy.cuda.device.Device._get_handle
  File "cupy/cuda/device.pyx", line 201, in cupy.cuda.device.Device._get_handle
  File "cupy_backends/cuda/libs/cublas.pyx", line 346, in cupy_backends.cuda.libs.cublas.create
  File "cupy_backends/cuda/libs/cublas.pyx", line 350, in cupy_backends.cuda.libs.cublas.create
  File "cupy_backends/cuda/libs/cublas.pyx", line 339, in cupy_backends.cuda.libs.cublas.check_status
cupy_backends.cuda.libs.cublas.CUBLASError: CUBLAS_STATUS_NOT_INITIALIZED

The text was updated successfully, but these errors were encountered:

beckernick · 2020-12-16T14:26:44Z

CUDA version mismatches would be the normal culprit, but I'm not sure how this would only show up intermittently. @jakirkham do you have any thoughts on this?

jakirkham · 2020-12-16T17:36:31Z

cc @anaruse (in case you have thoughts here 🙂)

anaruse · 2020-12-16T17:53:09Z

Hmm, it might have something to do with the issue below.
cupy/cupy#3935

VibhuJawa · 2020-12-16T17:58:05Z

Based on the comment they recommend setting the CUPY_GPU_MEMORY_LIMIT="90%" environment variable as they say there's no enough memory left to create cublas handler .

QQ : Is the cublas handler here supposed to use the existing rmm pool (because if it's not this may explain the pool growth we see while using SPACY)

cc: @beckernick

jakirkham · 2020-12-16T18:15:14Z

QQ : Is the cublas handler here supposed to use the existing rmm pool (because if it's not this may explain the pool growth we see while using SPACY)

I don't think so (meaning it shouldn't use the pool).

Edit: To expand on this, we would need an API in cuBLAS (that CuPy would then use), which would allow us to specify a chunk of memory to use for the initialization.

jakirkham · 2020-12-16T18:22:17Z

Is there a way for us to trigger cuBLAS initialization early? If so, maybe we can do this as part of Dask-CUDA startup.

beckernick · 2020-12-16T19:06:36Z

Thanks for the pointer @anaruse @jakirkham . Let's look into triggering this early and/or perhaps reserving memory.

beckernick · 2021-02-19T15:13:03Z

@anaruse @jakirkham , do you think there is any downside/risk of initializing a handle to the cuBLAS library context and then essentially "throwing it away" before we do anything else on the cluster?

I.e., running something like this on every Dask worker?

def init_cublas():
    from cupy.cuda import cublas
    cublas.create() # allocates 64MB of GPU memory and returns a handle
    return None

client.run(init_cublas)

beckernick · 2021-02-19T15:44:27Z

Initializing the cuBLAS context beforehand wouldn't necessarily change the allocation dynamics of the workload that triggers RMM to grow the pool to just at the edge of total capacity. Without visibility into that chunk of memory, this might still be a risk, right?

Perhaps in combination with CUPY_GPU_MEMORY_LIMIT we may be able to avoid it, but it's not immediately clear to me how that would work.

jakirkham · 2021-02-26T19:58:38Z

Idk about creating cuBLAS handle like that. The library may expect us to do cleanup. Not sure what happens if we don't do that cleanup.

That said, maybe we can do some warmup step (like matrix multiplication), which would get CuPy to initialize cuBLAS. Admittedly that's a bit hacky, but perhaps workable.

@anaruse may have a better suggestion

jakirkham · 2021-03-10T20:09:28Z

I poked around the CuPy code and think something like this might work. Should add CuPy takes care of the cleanup of the handle in this case

In [1]: import cupy

In [2]: cupy.cuda.device.get_cublas_handle()
Out[2]: 94517515719584

In [3]: cupy.cuda.device.get_cublas_handle()
Out[3]: 94517515719584

The first run is a bit slow (as it allocates the handle), but the second one is a bit faster (as it is cached). Note the pointer returned as a Python int is the same in both cases

We could call this with client.run or similar to make sure this is setup on all workers

jakirkham · 2021-03-10T20:15:28Z

Mentioned in an edit in the RMM thread, but it seems spaCy just uses CuPy for cuBLAS and doesn't use cuBLAS directly. So I think that initialization step should be sufficient

VibhuJawa mentioned this issue Mar 10, 2021

[FEA]Decrease Pool Size on the fly rapidsai/rmm#724

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q27 intermittent failure in nightly automation #159

Q27 intermittent failure in nightly automation #159

beckernick commented Dec 16, 2020

beckernick commented Dec 16, 2020

jakirkham commented Dec 16, 2020

anaruse commented Dec 16, 2020

VibhuJawa commented Dec 16, 2020

jakirkham commented Dec 16, 2020 •

edited

Loading

jakirkham commented Dec 16, 2020

beckernick commented Dec 16, 2020

beckernick commented Feb 19, 2021 •

edited

Loading

beckernick commented Feb 19, 2021 •

edited

Loading

jakirkham commented Feb 26, 2021

jakirkham commented Mar 10, 2021 •

edited

Loading

jakirkham commented Mar 10, 2021

Q27 intermittent failure in nightly automation #159

Q27 intermittent failure in nightly automation #159

Comments

beckernick commented Dec 16, 2020

beckernick commented Dec 16, 2020

jakirkham commented Dec 16, 2020

anaruse commented Dec 16, 2020

VibhuJawa commented Dec 16, 2020

jakirkham commented Dec 16, 2020 • edited Loading

jakirkham commented Dec 16, 2020

beckernick commented Dec 16, 2020

beckernick commented Feb 19, 2021 • edited Loading

beckernick commented Feb 19, 2021 • edited Loading

jakirkham commented Feb 26, 2021

jakirkham commented Mar 10, 2021 • edited Loading

jakirkham commented Mar 10, 2021

jakirkham commented Dec 16, 2020 •

edited

Loading

beckernick commented Feb 19, 2021 •

edited

Loading

beckernick commented Feb 19, 2021 •

edited

Loading

jakirkham commented Mar 10, 2021 •

edited

Loading