Q27 SF1K failing during full query sweep on 16 32GB GPUs in 2020-10-03 nightlies with UCX #126

beckernick · 2020-11-03T21:43:05Z

In the nightlies on 2020-10-03, we're going out of memory running q27 at SF1K during a full sweep on 16 32GB GPUs with UCX NVLink. An RMM wrapped CuPy allocation via spaCy is going out of memory during this query.

This query operates on < 10GB of data in total across the 16 GPUs, so this should not happen while using a 30GB pool. It looks like what's happening is that the GPU with the client process is spiking memory in the initial stages of the spaCy/CuPy section.

The query succeeds when run on its own many times in a row. Each query is executed sequentially in a loop and there are no loop-carried dependencies, so running other queries beforehand should not cause any issues (in theory). While the OOM appears to only happens with UCX NVLink so far, the spike itself feels like it shouldn't be happening.

python benchmark_runner.py --config_file benchmark_runner/benchmark_config.yaml
...
26
27
Encountered Exception while running query
Traceback (most recent call last):
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/site-packages/dask/dataframe/utils.py", line 174, in raise_on_meta_error
    yield
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/site-packages/dask/dataframe/core.py", line 5178, in _emulate
    return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
  File "queries/q27/tpcx_bb_query_27.py", line 60, in ner_parser
    nlp = spacy.load("en_core_web_sm")
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/__init__.py", line 30, in load
    return util.load_model(name, **overrides)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/util.py", line 164, in load_model
    return load_model_from_package(name, **overrides)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/util.py", line 185, in load_model_from_package
    return cls.load(**overrides)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/en_core_web_sm/__init__.py", line 12, in load
    return load_model_from_init_py(__file__, **overrides)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/util.py", line 228, in load_model_from_init_py
    return load_model_from_path(data_path, meta, **overrides)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/util.py", line 211, in load_model_from_path
    return nlp.from_disk(model_path, exclude=disable)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/language.py", line 947, in from_disk
    util.from_disk(path, deserializers, exclude)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/util.py", line 654, in from_disk
    reader(path / key)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/language.py", line 931, in <lambda>
    p
  File "vocab.pyx", line 470, in spacy.vocab.Vocab.from_disk
  File "vectors.pyx", line 447, in spacy.vectors.Vectors.from_disk
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/util.py", line 654, in from_disk
    reader(path / key)
  File "vectors.pyx", line 440, in spacy.vectors.Vectors.from_disk.load_vectors
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/site-packages/cupy/io/npz.py", line 71, in load
    return cupy.array(obj)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/site-packages/cupy/creation/from_data.py", line 41, in array
    return core.array(obj, dtype, copy, order, subok, ndmin)
  File "cupy/core/core.pyx", line 1785, in cupy.core.core.array
  File "cupy/core/core.pyx", line 1862, in cupy.core.core.array
  File "cupy/core/core.pyx", line 1938, in cupy.core.core._send_object_to_gpu
  File "cupy/core/core.pyx", line 134, in cupy.core.core.ndarray.__init__
  File "cupy/cuda/memory.pyx", line 544, in cupy.cuda.memory.alloc
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/site-packages/rmm/rmm.py", line 196, in rmm_cupy_allocator
    buf = librmm.device_buffer.DeviceBuffer(size=nbytes)
  File "rmm/_lib/device_buffer.pyx", line 79, in rmm._lib.device_buffer.DeviceBuffer.__cinit__
RuntimeError: CUDA error at: /raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/include/rmm/cuda_stream_view.hpp:96: cudaErrorMemoryAllocation out of memory

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/raid/nicholasb/prod/tpcx-bb/tpcx_bb/xbb_tools/utils.py", line 280, in run_dask_cudf_query
    config=config,
  File "/raid/nicholasb/prod/tpcx-bb/tpcx_bb/xbb_tools/utils.py", line 61, in benchmark
    result = func(*args, **kwargs)
  File "queries/q27/tpcx_bb_query_27.py", line 101, in main
    ner_parsed = sentences.map_partitions(ner_parser, "sentence")
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/site-packages/dask/dataframe/core.py", line 653, in map_partitions
    return map_partitions(func, self, *args, **kwargs)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/site-packages/dask/dataframe/core.py", line 5228, in map_partitions
    meta = _emulate(func, *args, udf=True, **kwargs)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/site-packages/dask/dataframe/core.py", line 5178, in _emulate
    return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/site-packages/dask/dataframe/utils.py", line 195, in raise_on_meta_error
    raise ValueError(msg) from e
ValueError: Metadata inference failed in `ner_parser`.

You have supplied a custom function and Dask is unable to
determine the type of output that that function returns.

To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.

Original error is below:
------------------------
RuntimeError('CUDA error at: /raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/include/rmm/cuda_stream_view.hpp:96: cudaErrorMemoryAllocation out of memory')

Traceback:
---------
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/site-packages/dask/dataframe/utils.py", line 174, in raise_on_meta_error
    yield
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/site-packages/dask/dataframe/core.py", line 5178, in _emulate
    return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
  File "queries/q27/tpcx_bb_query_27.py", line 60, in ner_parser
    nlp = spacy.load("en_core_web_sm")
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/__init__.py", line 30, in load
    return util.load_model(name, **overrides)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/util.py", line 164, in load_model
    return load_model_from_package(name, **overrides)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/util.py", line 185, in load_model_from_package
    return cls.load(**overrides)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/en_core_web_sm/__init__.py", line 12, in load
    return load_model_from_init_py(__file__, **overrides)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/util.py", line 228, in load_model_from_init_py
    return load_model_from_path(data_path, meta, **overrides)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/util.py", line 211, in load_model_from_path
    return nlp.from_disk(model_path, exclude=disable)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/language.py", line 947, in from_disk
    util.from_disk(path, deserializers, exclude)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/util.py", line 654, in from_disk
    reader(path / key)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/language.py", line 931, in <lambda>
    p
  File "vocab.pyx", line 470, in spacy.vocab.Vocab.from_disk
  File "vectors.pyx", line 447, in spacy.vectors.Vectors.from_disk
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/util.py", line 654, in from_disk
    reader(path / key)
  File "vectors.pyx", line 440, in spacy.vectors.Vectors.from_disk.load_vectors
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/site-packages/cupy/io/npz.py", line 71, in load
    return cupy.array(obj)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/site-packages/cupy/creation/from_data.py", line 41, in array
    return core.array(obj, dtype, copy, order, subok, ndmin)
  File "cupy/core/core.pyx", line 1785, in cupy.core.core.array
  File "cupy/core/core.pyx", line 1862, in cupy.core.core.array
  File "cupy/core/core.pyx", line 1938, in cupy.core.core._send_object_to_gpu
  File "cupy/core/core.pyx", line 134, in cupy.core.core.ndarray.__init__
  File "cupy/cuda/memory.pyx", line 544, in cupy.cuda.memory.alloc
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/site-packages/rmm/rmm.py", line 196, in rmm_cupy_allocator
    buf = librmm.device_buffer.DeviceBuffer(size=nbytes)
  File "rmm/_lib/device_buffer.pyx", line 79, in rmm._lib.device_buffer.DeviceBuffer.__cinit__

The text was updated successfully, but these errors were encountered:

VibhuJawa · 2020-11-06T17:18:21Z

The problem seems to be that we cant use the existing pool that we have with SPACY and we need room to grow the pool by 1200 MiB . Which ends up going OOM because when you are using the runner your rmm pool on a GPU is around 31114 MiB .

Why does it show with UCX and not with TCP ?
The peak rmm pool with UCX is 30708 vs 31114 and even that 400-500 MB is enough to cause OOM.

How to recreate it without the runner or UCX ?

Set up a pool of 30000MiB with TCP
Run query to see OOM errors

What should have happened:
It should have used the exiting pool without expansion

How to check for peak memory used in the query itself:

Setup pool of 1000MiB
Memory before running 1362 MiB
Memory after running (once or n times) 2562 MiB

beckernick · 2020-12-02T17:02:09Z

I can no longer reproduce this (failure), but we're still exploring this broader issue outside of the context of this repository. I'm going to close this issue for now, but will reopen if this resurfaces here.

beckernick mentioned this issue Nov 3, 2020

[REVIEW] Default cluster to TCP #127

Merged

beckernick changed the title ~~Q27 failing during full query sweep on 16 32GB GPUs in 2020-10-03 nightlies with UCX~~ Q27 SF1K failing during full query sweep on 16 32GB GPUs in 2020-10-03 nightlies with UCX Nov 3, 2020

VibhuJawa self-assigned this Nov 6, 2020

beckernick closed this as completed Dec 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q27 SF1K failing during full query sweep on 16 32GB GPUs in 2020-10-03 nightlies with UCX #126

Q27 SF1K failing during full query sweep on 16 32GB GPUs in 2020-10-03 nightlies with UCX #126

beckernick commented Nov 3, 2020 •

edited

Loading

VibhuJawa commented Nov 6, 2020

beckernick commented Dec 2, 2020 •

edited

Loading

Q27 SF1K failing during full query sweep on 16 32GB GPUs in 2020-10-03 nightlies with UCX #126

Q27 SF1K failing during full query sweep on 16 32GB GPUs in 2020-10-03 nightlies with UCX #126

Comments

beckernick commented Nov 3, 2020 • edited Loading

VibhuJawa commented Nov 6, 2020

beckernick commented Dec 2, 2020 • edited Loading

beckernick commented Nov 3, 2020 •

edited

Loading

beckernick commented Dec 2, 2020 •

edited

Loading