Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Catch and propagate rmm out of memory errors during copy operations up to Python #13743

Closed
beckernick opened this issue Jul 24, 2023 · 5 comments · Fixed by #13746
Closed
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@beckernick
Copy link
Member

beckernick commented Jul 24, 2023

I'd like for out-of-memory errors caused by copying data to be caught and propagated up to Python like other out-of-memory errors.

If I run out of memory during a copy operation, the error is uncaught and my kernel crashes:

import cudf

df = cudf.datasets.randomdata(nrows=500000000) # 12 GB
df2 = cudf.concat([df, df])
df3 = df2.copy()
df4 = df2.copy()
terminate called after throwing an instance of 'rmm::out_of_memory'
  what():  std::bad_alloc: out_of_memory: CUDA error at: /raid/nicholasb/miniconda3/envs/rapids-23.08/include/rmm/mr/device/cuda_memory_resource.hpp
Aborted (core dumped)

But, for example, when I try to concatenate and run out of memory, the error is caught and I can continue using my IPython kernel.

import cudf

df = cudf.datasets.randomdata(nrows=500000000) # 12 GB
out = cudf.concat([df, df, df, df])
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
Cell In[7], line 1
----> 1 out = cudf.concat([df, df, df, df])

File /raid/nicholasb/miniconda3/envs/rapids-23.08/lib/python3.10/site-packages/cudf/core/reshape.py:410, in concat(objs, axis, join, ignore_index, sort)
    407         if join == "inner" and len(old_objs) != len(objs):
    408             # don't filter out empty df's
    409             objs = old_objs
--> 410         result = cudf.DataFrame._concat(
    411             objs,
    412             axis=axis,
    413             join=join,
    414             ignore_index=ignore_index,
    415             # Explicitly cast rather than relying on None being falsy.
    416             sort=bool(sort),
    417         )
    418     return result
    420 elif typ is cudf.Series:

File /raid/nicholasb/miniconda3/envs/rapids-23.08/lib/python3.10/site-packages/nvtx/nvtx.py:101, in annotate.__call__.<locals>.inner(*args, **kwargs)
     98 @wraps(func)
     99 def inner(*args, **kwargs):
    100     libnvtx_push_range(self.attributes, self.domain.handle)
--> 101     result = func(*args, **kwargs)
    102     libnvtx_pop_range(self.domain.handle)
    103     return result

File /raid/nicholasb/miniconda3/envs/rapids-23.08/lib/python3.10/site-packages/cudf/core/dataframe.py:1574, in DataFrame._concat(cls, objs, axis, join, ignore_index, sort)
   1572     out._index = cudf.RangeIndex(result_index_length)
   1573 elif are_all_range_index and not ignore_index:
-> 1574     out._index = cudf.core.index.GenericIndex._concat(
   1575         [o._index for o in objs]
   1576     )
   1578 # Reassign the categories for any categorical table cols
   1579 _reassign_categories(
   1580     categories, out._data, indices[first_data_column_position:]
   1581 )

File /raid/nicholasb/miniconda3/envs/rapids-23.08/lib/python3.10/site-packages/nvtx/nvtx.py:101, in annotate.__call__.<locals>.inner(*args, **kwargs)
     98 @wraps(func)
     99 def inner(*args, **kwargs):
    100     libnvtx_push_range(self.attributes, self.domain.handle)
--> 101     result = func(*args, **kwargs)
    102     libnvtx_pop_range(self.domain.handle)
    103     return result

File /raid/nicholasb/miniconda3/envs/rapids-23.08/lib/python3.10/site-packages/cudf/core/index.py:1055, in GenericIndex._concat(cls, objs)
   1051 @classmethod
   1052 @_cudf_nvtx_annotate
   1053 def _concat(cls, objs):
   1054     if all(isinstance(obj, RangeIndex) for obj in objs):
-> 1055         result = _concat_range_index(objs)
   1056     else:
   1057         data = concat_columns([o._values for o in objs])

File /raid/nicholasb/miniconda3/envs/rapids-23.08/lib/python3.10/site-packages/nvtx/nvtx.py:101, in annotate.__call__.<locals>.inner(*args, **kwargs)
     98 @wraps(func)
     99 def inner(*args, **kwargs):
    100     libnvtx_push_range(self.attributes, self.domain.handle)
--> 101     result = func(*args, **kwargs)
    102     libnvtx_pop_range(self.domain.handle)
    103     return result

File /raid/nicholasb/miniconda3/envs/rapids-23.08/lib/python3.10/site-packages/cudf/core/index.py:3360, in _concat_range_index(indexes)
   3356 non_consecutive = (step != obj.step and len(obj) > 1) or (
   3357     next_ is not None and obj.start != next_
   3358 )
   3359 if non_consecutive:
-> 3360     result = as_index(concat_columns([x._values for x in indexes]))
   3361     return result
   3362 if step is not None:

File /raid/nicholasb/miniconda3/envs/rapids-23.08/lib/python3.10/site-packages/cudf/core/column/column.py:2631, in concat_columns(objs)
   2628     return column_empty(0, head.dtype, masked=True)
   2630 # Filter out inputs that have 0 length, then concatenate.
-> 2631 return libcudf.concat.concat_columns([o for o in objs if len(o)])

File /raid/nicholasb/miniconda3/envs/rapids-23.08/lib/python3.10/contextlib.py:79, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
     76 @wraps(func)
     77 def inner(*args, **kwds):
     78     with self._recreate_cm():
---> 79         return func(*args, **kwds)

File concat.pyx:44, in cudf._lib.concat.concat_columns()

MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /raid/nicholasb/miniconda3/envs/rapids-23.08/include/rmm/mr/device/cuda_memory_resource.hpp

Environment:

conda list | grep rapids
# packages in environment at /raid/nicholasb/miniconda3/envs/rapids-23.08:
cubinlinker               0.3.0           py310hfdf336d_0    rapidsai-nightly
cucim                     23.08.00a       cuda11_py310_230718_gfc33e56_16    rapidsai-nightly
cudf                      23.08.00a       cuda11_py310_230720_gd4c2d1ccee_200    rapidsai-nightly
cudf_kafka                23.08.00a       py310_230720_gd4c2d1ccee_200    rapidsai-nightly
cugraph                   23.08.00a       cuda11_py310_230720_g145a92bb_57    rapidsai-nightly
cuml                      23.08.00a       cuda11_py310_230720_ga47c73e04_37    rapidsai-nightly
cusignal                  23.08.00a       py39_230720_gc80a6eb_7    rapidsai-nightly
cuspatial                 23.08.00a       cuda11_py310_230720_gb0657dda_47    rapidsai-nightly
custreamz                 23.08.00a       py310_230720_gd4c2d1ccee_200    rapidsai-nightly
cuxfilter                 23.08.00a       cuda11_py310_230720_g8e8c78f_20    rapidsai-nightly
dask-cuda                 23.08.00a       py310_230720_g8ff7af9_31    rapidsai-nightly
dask-cudf                 23.08.00a       cuda11_py310_230720_gd4c2d1ccee_200    rapidsai-nightly
libcucim                  23.08.00a       cuda11_230718_gfc33e56_16    rapidsai-nightly
libcudf                   23.08.00a       cuda11_230720_gd4c2d1ccee_200    rapidsai-nightly
libcudf_kafka             23.08.00a       cuda11_230720_gd4c2d1ccee_200    rapidsai-nightly
libcugraph                23.08.00a       cuda11_230720_g145a92bb_57    rapidsai-nightly
libcugraph_etl            23.08.00a       cuda11_230720_g145a92bb_57    rapidsai-nightly
libcugraphops             23.08.00a       cuda11_230720_g5f2e4953_19    rapidsai-nightly
libcuml                   23.08.00a       cuda11_230720_ga47c73e04_37    rapidsai-nightly
libcumlprims              23.08.00a       cuda11_230710_gd32fef7_2    rapidsai-nightly
libcuspatial              23.08.00a       cuda11_230720_gb0657dda_47    rapidsai-nightly
libkvikio                 23.08.00a       cuda11_230718_gf624203_28    rapidsai-nightly
libraft                   23.08.00a       cuda11_230720_g0abedc66_63    rapidsai-nightly
libraft-headers           23.08.00a       cuda11_230720_g0abedc66_63    rapidsai-nightly
libraft-headers-only      23.08.00a       cuda11_230720_g0abedc66_63    rapidsai-nightly
librmm                    23.08.00a       cuda11_230720_g0a5b86cf_24    rapidsai-nightly
libxgboost                1.7.5dev.rapidsai23.08        cuda11_0    rapidsai-nightly
py-xgboost                1.7.5dev.rapidsai23.08  cuda11_py310_0    rapidsai-nightly
pylibcugraph              23.08.00a       cuda11_py310_230720_g145a92bb_57    rapidsai-nightly
pylibraft                 23.08.00a       cuda11_py310_230720_g0abedc66_63    rapidsai-nightly
raft-dask                 23.08.00a       cuda11_py310_230720_g0abedc66_63    rapidsai-nightly
rapids                    23.08.00a       cuda11_py310_230713_g393c3ad_19    rapidsai-nightly
rapids-xgboost            23.08.00a       cuda11_py310_230713_g393c3ad_19    rapidsai-nightly
rmm                       23.08.00a       cuda11_py310_230720_g0a5b86cf_24    rapidsai-nightly
ucx-proc                  1.0.0                       gpu    rapidsai-nightly
ucx-py                    0.33.00a        py310_230720_g8eef032_17    rapidsai-nightly
xgboost                   1.7.5dev.rapidsai23.08  cuda11_py310_0    rapidsai-nightly
@beckernick beckernick added bug Something isn't working Needs Triage Need team to review and classify labels Jul 24, 2023
@wence-
Copy link
Contributor

wence- commented Jul 25, 2023

This is a little tricky. The reason that copy fails but concat succeeds (in the sense of the error being propagated to Python) is that concat calls a C++ function that is annotated in cython as except + (which means that C++ exceptions should be caught and converted to Python ones).

copy on the other hand calls a C++ constructor (that is marked as except +) but via std::make_unique<column>(column_view). This can (as noted) throw an out of memory error.

In cython, this is spelt: make_unique[column](input_column_view). So far, so good, but make_unique in Cython is not annotated with except +. That line arrived in cython/cython#487 where there's some discussion about why the exception is not annotated, linking to this discussion thread.

As a consequence, some cython code like this:

from libcpp.utility cimport move
from libcpp.memory cimport make_unique, unique_ptr
from libcpp.vector cimport vector


def some_function(int n):
    cdef unique_ptr[vector[double]] vec

    with nogil:
        vec = move(make_unique[vector[double]](n))

    return vec.get().size()

Is transpiled to:

      /*try:*/ {

        /* "cython_except.pyx":10
 * 
 *     with nogil:
 *         vec = move(make_unique[vector[double]](n))             # <<<<<<<<<<<<<<
 * 
 *     return vec.get().size()
 */
        __pyx_v_vec = cython_std::move<std::unique_ptr<std::vector<double> > >(std::make_unique<std::vector<double> >(__pyx_v_n));
      }

Notice how the (potentially throwing) std::make_unique is not in a try/catch block.

The initial reason to not annotate with except + in cython was that the temporary tracking didn't work, the code that should have been generated was:

try {
    tmp = std::move(std::make_unique<...>(...));
} catch (...) {
   ...
}
result = std::move(tmp);

But, when the PR was introduced, instead we would get:

try {
    tmp = std::move(std::make_unique<...>(...));
} catch (...) {
   ...
}
result = tmp; // no std::move here, bad!

But, that was seven years ago, let's just try adding the except + in Cython's standard headers... If I do that then I get the perfectly correct looking:

        try {
          __pyx_t_1 = std::make_unique<std::vector<double> >(__pyx_v_n);
        } catch(...) {
          #ifdef WITH_THREAD
          PyGILState_STATE __pyx_gilstate_save = __Pyx_PyGILState_Ensure();
          #endif
          __Pyx_CppExn2PyErr();
          #ifdef WITH_THREAD
          __Pyx_PyGILState_Release(__pyx_gilstate_save);
          #endif
          __PYX_ERR(0, 10, __pyx_L4_error)
        }
        __pyx_v_vec = cython_std::move<std::unique_ptr<std::vector<double> > >(__PYX_STD_MOVE_IF_SUPPORTED(__pyx_t_1));

(That was with Cython 3, with Cython < 3 I have):

        try {
          __pyx_t_2 = std::make_unique<cudf::column>(__pyx_v_input_column_view);
        } catch(...) {
          #ifdef WITH_THREAD
          PyGILState_STATE __pyx_gilstate_save = __Pyx_PyGILState_Ensure();
          #endif
          __Pyx_CppExn2PyErr();
          #ifdef WITH_THREAD
          __Pyx_PyGILState_Release(__pyx_gilstate_save);
          #endif
          __PYX_ERR(0, 86, __pyx_L4_error)
        }
        __pyx_v_c_result = cython_std::move<std::unique_ptr<cudf::column> >(__pyx_t_2);

Which also looks fine (cython_std::move just wraps up std::move).

And with that, I get the python-level memory error for copy.

@wence- wence- added Python Affects Python cuDF API. Cython and removed Needs Triage Need team to review and classify labels Jul 25, 2023
@bdice
Copy link
Contributor

bdice commented Jul 25, 2023

@wence- Could we shift this logic to C++ and call make_unique in C++ instead? Seems like a viable workaround to just get back a unique_ptr from a factory in C++ (where we control exception behavior of its Cython binding) that calls make_unique for us.

(if the Cython fix is delayed, for example, this could be a short term solution)

@shwina
Copy link
Contributor

shwina commented Jul 25, 2023

We could also consider writing some custom C++ that is inlined in the Cython, if adding a function to libcudf here introduces too much overhead (in terms of human time) or is too intrusive.

wence- added a commit to wence-/cudf that referenced this issue Jul 25, 2023
make_unique in Cython's libcpp headers is not annotated with `except
+`. As a consequence, if the constructor throws, we do not catch it in
Python. To work around this (see cython/cython#5560 for details),
provide our own implementation.

Due to the way assignments occur to temporaries, we need to now
explicitly wrap all calls to `make_unique` in `move`, but that is
arguably preferable to not being able to catch exceptions.

- Closes rapidsai#13743
wence- added a commit to wence-/cudf that referenced this issue Jul 25, 2023
make_unique in Cython's libcpp headers is not annotated with `except
+`. As a consequence, if the constructor throws, we do not catch it in
Python. To work around this (see cython/cython#5560 for details),
provide our own implementation.

Due to the way assignments occur to temporaries, we need to now
explicitly wrap all calls to `make_unique` in `move`, but that is
arguably preferable to not being able to catch exceptions.

- Closes rapidsai#13743
wence- added a commit to wence-/cudf that referenced this issue Jul 26, 2023
make_unique in Cython's libcpp headers is not annotated with `except
+`. As a consequence, if the constructor throws, we do not catch it in
Python. To work around this (see cython/cython#5560 for details),
provide our own implementation.

Due to the way assignments occur to temporaries, we need to now
explicitly wrap all calls to `make_unique` in `move`, but that is
arguably preferable to not being able to catch exceptions.

- Closes rapidsai#13743
wence- added a commit to wence-/cudf that referenced this issue Jul 26, 2023
make_unique in Cython's libcpp headers is not annotated with `except
+`. As a consequence, if the constructor throws, we do not catch it in
Python. To work around this (see cython/cython#5560 for details),
provide our own implementation.

Due to the way assignments occur to temporaries, we need to now
explicitly wrap all calls to `make_unique` in `move`, but that is
arguably preferable to not being able to catch exceptions, and will
not be necessary once we move to Cython 3.

- Closes rapidsai#13743
rapids-bot bot pushed a commit that referenced this issue Jul 26, 2023
`make_unique` in Cython's libcpp headers is not annotated with `except +`. As a consequence, if the constructor throws, we do not catch it in Python. To work around this (see cython/cython#5560 for details), provide our own implementation.

Due to the way assignments occur to temporaries, we need to now explicitly wrap all calls to `make_unique` in `move`, but that is arguably preferable to not being able to catch exceptions, and will not be necessary once we move to Cython 3.

- Closes #13743

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: #13746
@shwina
Copy link
Contributor

shwina commented Jul 27, 2023

Can this be closed?

@wence-
Copy link
Contributor

wence- commented Jul 27, 2023

Yes, it wasn't auto-closed because we didn't merge to 23.08

@shwina shwina closed this as completed Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants