Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q27 intermittent failure in nightly automation #159

Open
beckernick opened this issue Dec 16, 2020 · 12 comments
Open

Q27 intermittent failure in nightly automation #159

beckernick opened this issue Dec 16, 2020 · 12 comments

Comments

@beckernick
Copy link
Member

This is the intermittent CUBLAS_STATUS_NOT_INITIALIZED error that I thought was in Q28 @VibhuJawa

Encountered Exception while running query
Traceback (most recent call last):
  File "/raid/nicholasb/prod/tpcx-bb/tpcx_bb/xbb_tools/utils.py", line 280, in run_dask_cudf_query
    config=config,
  File "/raid/nicholasb/prod/tpcx-bb/tpcx_bb/xbb_tools/utils.py", line 61, in benchmark
    result = func(*args, **kwargs)
  File "queries/q27/tpcx_bb_query_27.py", line 130, in main
    ["review_idx_global_pos", "pr_item_sk", "word", "sentence"]
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask_cudf/core.py", line 255, in sor
t_values
    ignore_index=ignore_index,
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask_cudf/sorting.py", line 225, in
sort_values
    divisions = quantile_divisions(df, by, npartitions)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask_cudf/sorting.py", line 177, in
quantile_divisions
    divisions = _approximate_quantile(df[by], qn).compute()
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/base.py", line 279, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/base.py", line 567, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/distributed/client.py", line 2676, i
n get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/distributed/client.py", line 1991, i
n gather
    asynchronous=asynchronous,
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/distributed/client.py", line 832, in sync
    self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/distributed/utils.py", line 340, in
sync
    raise exc.with_traceback(tb)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/distributed/utils.py", line 324, in
f
    result[0] = yield future
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/distributed/client.py", line 1850, i
n _gather
    raise exception.with_traceback(traceback)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/optimization.py", line 963, in
__call__
    return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/core.py", line 151, in get
    result = _execute_task(task, cache)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/core.py", line 121, in _execute
_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/utils.py", line 30, in apply
    return func(*args, **kwargs)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/dataframe/core.py", line 5401,
in apply_and_enforce
    df = func(*args, **kwargs)
  File "queries/q27/tpcx_bb_query_27.py", line 63, in ner_parser
    for doc in docs:
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/language.py", line 819, in pipe
    for doc in docs:
  File "nn_parser.pyx", line 253, in pipe
  File "nn_parser.pyx", line 273, in spacy.syntax.nn_parser.Parser.predict
  File "nn_parser.pyx", line 286, in spacy.syntax.nn_parser.Parser.greedy_parse
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/neural/_classes/model.py", line 167, in __call__
    return self.predict(x)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/neural/_classes/model.py", line 131, in predict
    y, _ = self.begin_update(X, drop=None)
  File "_parser_model.pyx", line 243, in spacy.syntax._parser_model.ParserModel.begin_update
  File "_parser_model.pyx", line 293, in spacy.syntax._parser_model.ParserStepModel.__init__
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/neural/_classes/feed_forward.py", line 46, in begin_update
    X, inc_layer_grad = layer.begin_update(X, drop=drop)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/api.py", line 295, in begin_update
    X, bp_layer = layer.begin_update(layer.ops.flatten(seqs_in, pad=pad), drop=drop)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/neural/_classes/feed_forward.py", line 46, in begin_update
    X, inc_layer_grad = layer.begin_update(X, drop=drop)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/api.py", line 379, in uniqued_fwd
    Y_uniq, bp_Y_uniq = layer.begin_update(X_uniq, drop=drop)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/neural/_classes/feed_forward.py", line 46, in begin_update
    X, inc_layer_grad = layer.begin_update(X, drop=drop)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/neural/_classes/layernorm.py", line 62, in begin_update
    X, backprop_child = self.child.begin_update(X, drop=0.0)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/neural/_classes/maxout.py", line 76, in begin_update
    output__boc = self.ops.gemm(X__bi, W, trans2=True)
  File "ops.pyx", line 986, in thinc.neural.ops.CupyOps.gemm
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/cupy/linalg/_product.py", line 34, i
n dot
    return a.dot(b, out)
  File "cupy/core/core.pyx", line 1412, in cupy.core.core.ndarray.dot
  File "cupy/core/_routines_linalg.pyx", line 418, in cupy.core._routines_linalg.dot
  File "cupy/core/_routines_linalg.pyx", line 503, in cupy.core._routines_linalg.tensordot_core
  File "cupy/cuda/device.pyx", line 47, in cupy.cuda.device.get_cublas_handle
  File "cupy/cuda/device.pyx", line 213, in cupy.cuda.device.Device.cublas_handle.__get__
  File "cupy/cuda/device.pyx", line 200, in cupy.cuda.device.Device._get_handle
  File "cupy/cuda/device.pyx", line 201, in cupy.cuda.device.Device._get_handle
  File "cupy_backends/cuda/libs/cublas.pyx", line 346, in cupy_backends.cuda.libs.cublas.create
  File "cupy_backends/cuda/libs/cublas.pyx", line 350, in cupy_backends.cuda.libs.cublas.create
  File "cupy_backends/cuda/libs/cublas.pyx", line 339, in cupy_backends.cuda.libs.cublas.check_status
cupy_backends.cuda.libs.cublas.CUBLASError: CUBLAS_STATUS_NOT_INITIALIZED
@beckernick
Copy link
Member Author

CUDA version mismatches would be the normal culprit, but I'm not sure how this would only show up intermittently. @jakirkham do you have any thoughts on this?

@jakirkham
Copy link
Member

cc @anaruse (in case you have thoughts here 🙂)

@anaruse
Copy link

anaruse commented Dec 16, 2020

Hmm, it might have something to do with the issue below.
cupy/cupy#3935

@VibhuJawa
Copy link
Member

Based on the comment they recommend setting the CUPY_GPU_MEMORY_LIMIT="90%" environment variable as they say there's no enough memory left to create cublas handler .

QQ : Is the cublas handler here supposed to use the existing rmm pool (because if it's not this may explain the pool growth we see while using SPACY)

cc: @beckernick

@jakirkham
Copy link
Member

jakirkham commented Dec 16, 2020

QQ : Is the cublas handler here supposed to use the existing rmm pool (because if it's not this may explain the pool growth we see while using SPACY)

I don't think so (meaning it shouldn't use the pool).

Edit: To expand on this, we would need an API in cuBLAS (that CuPy would then use), which would allow us to specify a chunk of memory to use for the initialization.

@jakirkham
Copy link
Member

Is there a way for us to trigger cuBLAS initialization early? If so, maybe we can do this as part of Dask-CUDA startup.

@beckernick
Copy link
Member Author

Thanks for the pointer @anaruse @jakirkham . Let's look into triggering this early and/or perhaps reserving memory.

@beckernick
Copy link
Member Author

beckernick commented Feb 19, 2021

@anaruse @jakirkham , do you think there is any downside/risk of initializing a handle to the cuBLAS library context and then essentially "throwing it away" before we do anything else on the cluster?

I.e., running something like this on every Dask worker?

def init_cublas():
    from cupy.cuda import cublas
    cublas.create() # allocates 64MB of GPU memory and returns a handle
    return None

client.run(init_cublas)

@beckernick
Copy link
Member Author

beckernick commented Feb 19, 2021

Initializing the cuBLAS context beforehand wouldn't necessarily change the allocation dynamics of the workload that triggers RMM to grow the pool to just at the edge of total capacity. Without visibility into that chunk of memory, this might still be a risk, right?

Perhaps in combination with CUPY_GPU_MEMORY_LIMIT we may be able to avoid it, but it's not immediately clear to me how that would work.

@jakirkham
Copy link
Member

Idk about creating cuBLAS handle like that. The library may expect us to do cleanup. Not sure what happens if we don't do that cleanup.

That said, maybe we can do some warmup step (like matrix multiplication), which would get CuPy to initialize cuBLAS. Admittedly that's a bit hacky, but perhaps workable.

@anaruse may have a better suggestion

@jakirkham
Copy link
Member

jakirkham commented Mar 10, 2021

I poked around the CuPy code and think something like this might work. Should add CuPy takes care of the cleanup of the handle in this case

In [1]: import cupy

In [2]: cupy.cuda.device.get_cublas_handle()
Out[2]: 94517515719584

In [3]: cupy.cuda.device.get_cublas_handle()
Out[3]: 94517515719584

The first run is a bit slow (as it allocates the handle), but the second one is a bit faster (as it is cached). Note the pointer returned as a Python int is the same in both cases

We could call this with client.run or similar to make sure this is setup on all workers

@jakirkham
Copy link
Member

Mentioned in an edit in the RMM thread, but it seems spaCy just uses CuPy for cuBLAS and doesn't use cuBLAS directly. So I think that initialization step should be sufficient

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants