Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unpickling objects with pd.read_pickle() doesn't work with cudf.pandas enabled #15459

Closed
shwina opened this issue Apr 3, 2024 · 1 comment · Fixed by #16105
Closed
Assignees
Labels
1 - On Deck To be worked on next bug Something isn't working cudf.pandas Issues specific to cudf.pandas

Comments

@shwina
Copy link
Contributor

shwina commented Apr 3, 2024

Describe the bug
When cudf.pandas is enabled, we can pickle and unpickle objects using pickle.dump/load or pickle.dumps/loads. But if we choose to unpickle with pd.read_pickle, things go awry. Here's a minimal reproducer:

import pandas as pd
from io import BytesIO
import pickle

pdf = pd.DataFrame({'a': [1.0, 2.0, None, 3.0]})

with open("pickled_pdf.pkl", "wb") as f:
    pickle.dump(pdf, f)

with open("pickled_pdf.pkl", "rb") as f:
    df = pd.read_pickle(f)

print(df)
In [1]: %load_ext cudf.pandas

In [2]: import pandas as pd

In [3]: from io import BytesIO
   ...: import pickle
   ...: 
   ...: pdf = pd.DataFrame({'a': [1.0, 2.0, None, 3.0]})
   ...: 
   ...: with open("pickled_pdf.pkl", "wb") as f:
   ...:     pickle.dump(pdf, f)
   ...: 
   ...: with open("pickled_pdf.pkl", "rb") as f:
   ...:     df = pd.read_pickle(f)
   ...: 
   ...: print(df)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
File ~/mroeschke-cudf/python/cudf/cudf/pandas/fast_slow_proxy.py:888, in _fast_slow_function_call(func, *args, **kwargs)
    883 with nvtx.annotate(
    884     "EXECUTE_FAST",
    885     color=_CUDF_PANDAS_NVTX_COLORS["EXECUTE_FAST"],
    886     domain="cudf_pandas",
    887 ):
--> 888     fast_args, fast_kwargs = _fast_arg(args), _fast_arg(kwargs)
    889     result = func(*fast_args, **fast_kwargs)

File ~/mroeschke-cudf/python/cudf/cudf/pandas/fast_slow_proxy.py:1007, in _fast_arg(arg)
   1006 seen: Set[int] = set()
-> 1007 return _transform_arg(arg, "_fsproxy_fast", seen)

File ~/mroeschke-cudf/python/cudf/cudf/pandas/fast_slow_proxy.py:934, in _transform_arg(arg, attribute_name, seen)
    932 if type(arg) is tuple:
    933     # Must come first to avoid infinite recursion
--> 934     return tuple(_transform_arg(a, attribute_name, seen) for a in arg)
    935 elif hasattr(arg, "__getnewargs_ex__"):
    936     # Partial implementation of to reconstruct with
    937     # transformed pieces
    938     # This handles scipy._lib._bunch._make_tuple_bunch

File ~/mroeschke-cudf/python/cudf/cudf/pandas/fast_slow_proxy.py:934, in <genexpr>(.0)
    932 if type(arg) is tuple:
    933     # Must come first to avoid infinite recursion
--> 934     return tuple(_transform_arg(a, attribute_name, seen) for a in arg)
    935 elif hasattr(arg, "__getnewargs_ex__"):
    936     # Partial implementation of to reconstruct with
    937     # transformed pieces
    938     # This handles scipy._lib._bunch._make_tuple_bunch

File ~/mroeschke-cudf/python/cudf/cudf/pandas/fast_slow_proxy.py:917, in _transform_arg(arg, attribute_name, seen)
    916 if isinstance(arg, (_FastSlowProxy, _FastSlowProxyMeta, _FunctionProxy)):
--> 917     typ = getattr(arg, attribute_name)
    918     if typ is _Unusable:

File ~/mroeschke-cudf/python/cudf/cudf/pandas/fast_slow_proxy.py:553, in _FastSlowProxy.__getattr__(self, name)
    550 if name.startswith("_fsproxy"):
    551     # an AttributeError was raised when trying to evaluate
    552     # an internal attribute, we just need to propagate this
--> 553     _raise_attribute_error(self.__class__.__name__, name)
    554 if name in {
    555     "_ipython_canary_method_should_not_exist_",
    556     "_ipython_display_",
   (...)
    568     # This is somewhat delicate to the order in which IPython
    569     # implements special display fallbacks.

File ~/mroeschke-cudf/python/cudf/cudf/pandas/fast_slow_proxy.py:392, in _raise_attribute_error(obj, name)
    387 """
    388 Raise an AttributeError with a message that is consistent with
    389 the error raised by Python for a non-existent attribute on a
    390 proxy object.
    391 """
--> 392 raise AttributeError(f"'{obj}' object has no attribute '{name}'")

AttributeError: 'function' object has no attribute '_fsproxy_fast'

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
<ipython-input-3-deda8b8b446c> in ?()
      8 
      9 with open("pickled_pdf.pkl", "rb") as f:
     10     df = pd.read_pickle(f)
     11 
---> 12 print(df)

~/mroeschke-cudf/python/cudf/cudf/pandas/fast_slow_proxy.py in ?(self, *args, **kwargs)
    836     def __call__(self, *args, **kwargs) -> Any:
--> 837         result, _ = _fast_slow_function_call(
    838             # We cannot directly call self here because we need it to be
    839             # converted into either the fast or slow object (by
    840             # _fast_slow_function_call) to avoid infinite recursion.

~/mroeschke-cudf/python/cudf/cudf/pandas/fast_slow_proxy.py in ?(func, *args, **kwargs)
    898             domain="cudf_pandas",
    899         ):
    900             slow_args, slow_kwargs = _slow_arg(args), _slow_arg(kwargs)
    901             with disable_module_accelerator():
--> 902                 result = func(*slow_args, **slow_kwargs)
    903     return _maybe_wrap_result(result, func, *args, **kwargs), fast

~/mroeschke-cudf/python/cudf/cudf/pandas/fast_slow_proxy.py in ?(fn, args, kwargs)
     29 def call_operator(fn, args, kwargs):
---> 30     return fn(*args, **kwargs)

~/miniforge3/envs/cudf-dev/lib/python3.11/site-packages/pandas/core/frame.py in ?(self)
   1199             self.info(buf=buf)
   1200             return buf.getvalue()
   1201 
   1202         repr_params = fmt.get_dataframe_repr_params()
-> 1203         return self.to_string(**repr_params)

~/miniforge3/envs/cudf-dev/lib/python3.11/site-packages/pandas/util/_decorators.py in ?(*args, **kwargs)
    329                     msg.format(arguments=_format_argument_list(allow_args)),
    330                     FutureWarning,
    331                     stacklevel=find_stack_level(),
    332                 )
--> 333             return func(*args, **kwargs)

~/miniforge3/envs/cudf-dev/lib/python3.11/site-packages/pandas/core/frame.py in ?(self, buf, columns, col_space, header, index, na_rep, formatters, float_format, sparsify, index_names, justify, max_rows, max_cols, show_dimensions, decimal, line_width, min_rows, max_colwidth, encoding)
   1361         """
   1362         from pandas import option_context
   1363 
   1364         with option_context("display.max_colwidth", max_colwidth):
-> 1365             formatter = fmt.DataFrameFormatter(
   1366                 self,
   1367                 columns=columns,
   1368                 col_space=col_space,

~/miniforge3/envs/cudf-dev/lib/python3.11/site-packages/pandas/io/formats/format.py in ?(self, frame, columns, col_space, header, index, na_rep, formatters, justify, float_format, sparsify, index_names, max_rows, min_rows, max_cols, show_dimensions, decimal, bold_rows, escape)
    443         bold_rows: bool = False,
    444         escape: bool = True,
    445     ) -> None:
    446         self.frame = frame
--> 447         self.columns = self._initialize_columns(columns)
    448         self.col_space = self._initialize_colspace(col_space)
    449         self.header = header
    450         self.index = index

~/miniforge3/envs/cudf-dev/lib/python3.11/site-packages/pandas/io/formats/format.py in ?(self, columns)
    552             cols = ensure_index(columns)
    553             self.frame = self.frame[cols]
    554             return cols
    555         else:
--> 556             return self.frame.columns

~/miniforge3/envs/cudf-dev/lib/python3.11/site-packages/pandas/core/generic.py in ?(self, name)
   6292             and name not in self._accessors
   6293             and self._info_axis._can_hold_identifiers_and_holds_name(name)
   6294         ):
   6295             return self[name]
-> 6296         return object.__getattribute__(self, name)

properties.pyx in ?()
---> 65 'Could not get source, probably due dynamically evaluated source code.'

~/miniforge3/envs/cudf-dev/lib/python3.11/site-packages/pandas/core/generic.py in ?(self, name)
   6292             and name not in self._accessors
   6293             and self._info_axis._can_hold_identifiers_and_holds_name(name)
   6294         ):
   6295             return self[name]
-> 6296         return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute '_mgr'

We can (and do) control what happens when objects are pickled and unpickled via the pickle protocol (pickle.dump and pickle.load) here.

And pandas' read_pickle does call the "regular" pickle.load function.

So what's going on?

When we call pd.read_pickle in cudf.pandas mode, that will first call cudf.read_pickle (doesn't exist) and then fall back to the real pandas.read_pickle. Importantly, during fallback, we disable ourselves. Which means that our special pickle protocol handling doesn't kick in and that messes everything up.

Solutions

The only solution I could think of is we vendor pandas.read_pickle, so we can keep ourselves enabled when it is called.

@shwina shwina added the bug Something isn't working label Apr 3, 2024
@galipremsagar galipremsagar added the cudf.pandas Issues specific to cudf.pandas label Apr 15, 2024
@galipremsagar galipremsagar added this to the Proxying - cudf.pandas milestone Apr 15, 2024
@Matt711 Matt711 self-assigned this Jun 13, 2024
@Matt711 Matt711 added the 1 - On Deck To be worked on next label Jun 13, 2024
@vyasr vyasr moved this to Todo in cuDF Python Jun 26, 2024
@GPUtester GPUtester moved this from Todo to In Progress in cuDF Python Jun 26, 2024
@Matt711
Copy link
Contributor

Matt711 commented Jun 26, 2024

In 24.08, the example gets stuck. My guess is it leads to infinite recursion.

@rapids-bot rapids-bot bot closed this as completed in 341e014 Jul 9, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in cuDF Python Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 - On Deck To be worked on next bug Something isn't working cudf.pandas Issues specific to cudf.pandas
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants