You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the nightlies on 2020-10-03, we're going out of memory running q27 at SF1K during a full sweep on 16 32GB GPUs with UCX NVLink. An RMM wrapped CuPy allocation via spaCy is going out of memory during this query.
This query operates on < 10GB of data in total across the 16 GPUs, so this should not happen while using a 30GB pool. It looks like what's happening is that the GPU with the client process is spiking memory in the initial stages of the spaCy/CuPy section.
The query succeeds when run on its own many times in a row. Each query is executed sequentially in a loop and there are no loop-carried dependencies, so running other queries beforehand should not cause any issues (in theory). While the OOM appears to only happens with UCX NVLink so far, the spike itself feels like it shouldn't be happening.
python benchmark_runner.py --config_file benchmark_runner/benchmark_config.yaml
...
26
27
Encountered Exception while running query
Traceback (most recent call last):
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/site-packages/dask/dataframe/utils.py", line 174, in raise_on_meta_error
yield
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/site-packages/dask/dataframe/core.py", line 5178, in _emulate
return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
File "queries/q27/tpcx_bb_query_27.py", line 60, in ner_parser
nlp = spacy.load("en_core_web_sm")
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/__init__.py", line 30, in load
return util.load_model(name, **overrides)
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/util.py", line 164, in load_model
return load_model_from_package(name, **overrides)
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/util.py", line 185, in load_model_from_package
return cls.load(**overrides)
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/en_core_web_sm/__init__.py", line 12, in load
return load_model_from_init_py(__file__, **overrides)
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/util.py", line 228, in load_model_from_init_py
return load_model_from_path(data_path, meta, **overrides)
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/util.py", line 211, in load_model_from_path
return nlp.from_disk(model_path, exclude=disable)
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/language.py", line 947, in from_disk
util.from_disk(path, deserializers, exclude)
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/util.py", line 654, in from_disk
reader(path / key)
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/language.py", line 931, in <lambda>
p
File "vocab.pyx", line 470, in spacy.vocab.Vocab.from_disk
File "vectors.pyx", line 447, in spacy.vectors.Vectors.from_disk
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/util.py", line 654, in from_disk
reader(path / key)
File "vectors.pyx", line 440, in spacy.vectors.Vectors.from_disk.load_vectors
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/site-packages/cupy/io/npz.py", line 71, in load
return cupy.array(obj)
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/site-packages/cupy/creation/from_data.py", line 41, in array
return core.array(obj, dtype, copy, order, subok, ndmin)
File "cupy/core/core.pyx", line 1785, in cupy.core.core.array
File "cupy/core/core.pyx", line 1862, in cupy.core.core.array
File "cupy/core/core.pyx", line 1938, in cupy.core.core._send_object_to_gpu
File "cupy/core/core.pyx", line 134, in cupy.core.core.ndarray.__init__
File "cupy/cuda/memory.pyx", line 544, in cupy.cuda.memory.alloc
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/site-packages/rmm/rmm.py", line 196, in rmm_cupy_allocator
buf = librmm.device_buffer.DeviceBuffer(size=nbytes)
File "rmm/_lib/device_buffer.pyx", line 79, in rmm._lib.device_buffer.DeviceBuffer.__cinit__
RuntimeError: CUDA error at: /raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/include/rmm/cuda_stream_view.hpp:96: cudaErrorMemoryAllocation out of memory
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/raid/nicholasb/prod/tpcx-bb/tpcx_bb/xbb_tools/utils.py", line 280, in run_dask_cudf_query
config=config,
File "/raid/nicholasb/prod/tpcx-bb/tpcx_bb/xbb_tools/utils.py", line 61, in benchmark
result = func(*args, **kwargs)
File "queries/q27/tpcx_bb_query_27.py", line 101, in main
ner_parsed = sentences.map_partitions(ner_parser, "sentence")
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/site-packages/dask/dataframe/core.py", line 653, in map_partitions
return map_partitions(func, self, *args, **kwargs)
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/site-packages/dask/dataframe/core.py", line 5228, in map_partitions
meta = _emulate(func, *args, udf=True, **kwargs)
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/site-packages/dask/dataframe/core.py", line 5178, in _emulate
return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/site-packages/dask/dataframe/utils.py", line 195, in raise_on_meta_error
raise ValueError(msg) from e
ValueError: Metadata inference failed in `ner_parser`.
You have supplied a custom function and Dask is unable to
determine the type of output that that function returns.
To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.
Original error is below:
------------------------
RuntimeError('CUDA error at: /raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/include/rmm/cuda_stream_view.hpp:96: cudaErrorMemoryAllocation out of memory')
Traceback:
---------
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/site-packages/dask/dataframe/utils.py", line 174, in raise_on_meta_error
yield
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/site-packages/dask/dataframe/core.py", line 5178, in _emulate
return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
File "queries/q27/tpcx_bb_query_27.py", line 60, in ner_parser
nlp = spacy.load("en_core_web_sm")
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/__init__.py", line 30, in load
return util.load_model(name, **overrides)
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/util.py", line 164, in load_model
return load_model_from_package(name, **overrides)
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/util.py", line 185, in load_model_from_package
return cls.load(**overrides)
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/en_core_web_sm/__init__.py", line 12, in load
return load_model_from_init_py(__file__, **overrides)
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/util.py", line 228, in load_model_from_init_py
return load_model_from_path(data_path, meta, **overrides)
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/util.py", line 211, in load_model_from_path
return nlp.from_disk(model_path, exclude=disable)
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/language.py", line 947, in from_disk
util.from_disk(path, deserializers, exclude)
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/util.py", line 654, in from_disk
reader(path / key)
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/language.py", line 931, in <lambda>
p
File "vocab.pyx", line 470, in spacy.vocab.Vocab.from_disk
File "vectors.pyx", line 447, in spacy.vectors.Vectors.from_disk
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/util.py", line 654, in from_disk
reader(path / key)
File "vectors.pyx", line 440, in spacy.vectors.Vectors.from_disk.load_vectors
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/site-packages/cupy/io/npz.py", line 71, in load
return cupy.array(obj)
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/site-packages/cupy/creation/from_data.py", line 41, in array
return core.array(obj, dtype, copy, order, subok, ndmin)
File "cupy/core/core.pyx", line 1785, in cupy.core.core.array
File "cupy/core/core.pyx", line 1862, in cupy.core.core.array
File "cupy/core/core.pyx", line 1938, in cupy.core.core._send_object_to_gpu
File "cupy/core/core.pyx", line 134, in cupy.core.core.ndarray.__init__
File "cupy/cuda/memory.pyx", line 544, in cupy.cuda.memory.alloc
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201103/lib/python3.7/site-packages/rmm/rmm.py", line 196, in rmm_cupy_allocator
buf = librmm.device_buffer.DeviceBuffer(size=nbytes)
File "rmm/_lib/device_buffer.pyx", line 79, in rmm._lib.device_buffer.DeviceBuffer.__cinit__
The text was updated successfully, but these errors were encountered:
beckernick
changed the title
Q27 failing during full query sweep on 16 32GB GPUs in 2020-10-03 nightlies with UCX
Q27 SF1K failing during full query sweep on 16 32GB GPUs in 2020-10-03 nightlies with UCX
Nov 3, 2020
The problem seems to be that we cant use the existing pool that we have with SPACY and we need room to grow the pool by 1200 MiB . Which ends up going OOM because when you are using the runner your rmm pool on a GPU is around 31114 MiB .
Why does it show with UCX and not with TCP ?
The peak rmm pool with UCX is 30708 vs 31114 and even that 400-500 MB is enough to cause OOM.
How to recreate it without the runner or UCX ?
Set up a pool of 30000MiB with TCP
Run query to see OOM errors
What should have happened:
It should have used the exiting pool without expansion
How to check for peak memory used in the query itself:
I can no longer reproduce this (failure), but we're still exploring this broader issue outside of the context of this repository. I'm going to close this issue for now, but will reopen if this resurfaces here.
In the nightlies on 2020-10-03, we're going out of memory running q27 at SF1K during a full sweep on 16 32GB GPUs with UCX NVLink. An RMM wrapped CuPy allocation via spaCy is going out of memory during this query.
This query operates on < 10GB of data in total across the 16 GPUs, so this should not happen while using a 30GB pool. It looks like what's happening is that the GPU with the client process is spiking memory in the initial stages of the spaCy/CuPy section.
The query succeeds when run on its own many times in a row. Each query is executed sequentially in a loop and there are no loop-carried dependencies, so running other queries beforehand should not cause any issues (in theory). While the OOM appears to only happens with UCX NVLink so far, the spike itself feels like it shouldn't be happening.
The text was updated successfully, but these errors were encountered: