Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Dataframe.__setitem__ slow-downs #17222

Merged
merged 15 commits into from
Nov 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 48 additions & 1 deletion python/cudf/cudf/pandas/fast_slow_proxy.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,20 @@ def call_operator(fn, args, kwargs):
"EXECUTE_SLOW": 0x0571B0,
}

# This is a dict of functions that are known to have arguments that
# need to be transformed from fast to slow only. i.e., Some cudf functions
# error on passing a device object but don't error on passing a host object.
# For example: DataFrame.__setitem__(arg, value) errors on passing a
# cudf.Index object but doesn't error on passing a pd.Index object.
# Hence we need to transform the arg from fast to slow only. So, we use
# a dictionary like:
# {"DataFrame.__setitem__": {0}}
# where the keys are the function names and the values are the indices
# (0-based) of the arguments that need to be transformed.

_SPECIAL_FUNCTIONS_ARGS_MAP = {
"DataFrame.__setitem__": {0},
}

_WRAPPER_ASSIGNMENTS = tuple(
attr
Expand Down Expand Up @@ -875,6 +889,10 @@ def __name__(self, value):
pass
setattr(self._fsproxy_slow, "__name__", value)

@property
def _customqualname(self):
return self._fsproxy_slow.__qualname__


def _assert_fast_slow_eq(left, right):
if _is_final_type(type(left)) or type(left) in NUMPY_TYPES:
Expand Down Expand Up @@ -1011,7 +1029,36 @@ def _transform_arg(
# use __reduce_ex__ instead...
if type(arg) is tuple:
# Must come first to avoid infinite recursion
return tuple(_transform_arg(a, attribute_name, seen) for a in arg)
if (
len(arg) > 0
and isinstance(arg[0], _MethodProxy)
Matt711 marked this conversation as resolved.
Show resolved Hide resolved
Comment on lines +1032 to +1034
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only want to follow this branch when we call DataFrame.__setitem__ and the underlying wrapped object is a pd.DataFrame (so we avoid the DtoH transfer), right? Would a proxy object that wraps cudf.DataFrame also follow this code path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and yes. This is currently written in a way we could easily add to the dict map above for any other function and parameter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you lmk if I'm misunderstanding? If a wrapped cudf.DataFrame goes through this code path, then it will incorrectly call "_fsproxy_slow" when it should have stayed "fast". Eg. where df is a wrapped cudf.DataFrame
df.__setitem__("A", pd.Index([4,5,6]))

Copy link
Contributor Author

@galipremsagar galipremsagar Nov 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df.__setitem__("A", pd.Index([4,5,6])) will not force df into slow path. It will only force"A"(at 0th index) in slow path which is also "A", now lets see this example:
df.__setitem__(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]})), this will force pd.Index(["a", "b"]) to slow path with is a true pandas index rather than passing a cudf Index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I documented a bit more thoroughly here: 04ec5b7

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the following case:

df.setitem(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))

It will be converted to this and tried with cudf:

df.setitem(pd.Index(["a", "b"]), cudf.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))

And then if the above fails the following will be tried on pandas:

df.setitem(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it doesn't fail via the public API, I used CUDF_PANDAS_FALLBACK_DEBUGGING=True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out (2.) doesn't work even w/o cudf.pandas

In [1]: import cudf

In [2]: import pandas as pd

In [3]: df = cudf.DataFrame()

In [4]: df.__setitem__(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))

ValueError: Data must be 1-dimensional

In [5]: df = pd.DataFrame()

In [6]: df.__setitem__(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))

In [7]: df
Out[7]: 
   a   b
0  4  10
1  5  11
2  6  12

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep there's fallback in this case anyway.

In [1]: %load_ext cudf.pandas

In [2]: import pandas as pd

In [3]: df = pd.DataFrame()

In [4]: type(df._fsproxy_wrapped)
Out[4]: cudf.core.dataframe.DataFrame

In [5]: df.__setitem__(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))

In [6]: type(df._fsproxy_wrapped)
Out[6]: pandas.core.frame.DataFrame

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This what I get with your changes. (And running with CUDF_PACDAS_DEBUGGING=True does not fail). This looks correct to me

In [1]: %load_ext cudf.pandas

In [2]: import pandas as pd

In [3]: df = pd.DataFrame()

In [4]: type(df._fsproxy_wrapped)
Out[4]: cudf.core.dataframe.DataFrame

In [5]: df.__setitem__(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))

In [6]: type(df._fsproxy_wrapped)
Out[6]: cudf.core.dataframe.DataFrame

and arg[0]._customqualname in _SPECIAL_FUNCTIONS_ARGS_MAP
):
indices_map = _SPECIAL_FUNCTIONS_ARGS_MAP[
arg[0]._customqualname
]
method_proxy, original_args, original_kwargs = arg

original_args = tuple(
_transform_arg(a, "_fsproxy_slow", seen)
if i - 1 in indices_map
else _transform_arg(a, attribute_name, seen)
for i, a in enumerate(original_args)
)
original_kwargs = _transform_arg(
original_kwargs, attribute_name, seen
)
return tuple(
(
_transform_arg(method_proxy, attribute_name, seen),
original_args,
original_kwargs,
)
)
else:
return tuple(
_transform_arg(a, attribute_name, seen) for a in arg
)
elif hasattr(arg, "__getnewargs_ex__"):
# Partial implementation of to reconstruct with
# transformed pieces
Expand Down
23 changes: 23 additions & 0 deletions python/cudf/cudf_pandas_tests/test_cudf_pandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
import pickle
import subprocess
import tempfile
import time
import types
from io import BytesIO, StringIO

Expand Down Expand Up @@ -1795,3 +1796,25 @@ def test_iter_doesnot_raise(monkeypatch):
monkeycontext.setenv("CUDF_PANDAS_FAIL_ON_FALLBACK", "True")
for _ in s:
pass


def test_dataframe_setitem_slowdown():
# We are explicitly testing the slowdown of the setitem operation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# We are explicitly testing the slowdown of the setitem operation
# We are explicitly testing the slowdown of the setitem operation
# by eliminating the DtoH transfer performed by df[df.columns] = ...
# We do this by ensuring the df.columns argument in the setitem
# operation remains a slow object.

df = xpd.DataFrame(
{"a": [1, 2, 3] * 100000, "b": [1, 2, 3] * 100000}
).astype("float64")
df = xpd.DataFrame({"a": df["a"].repeat(1000), "b": df["b"].repeat(1000)})
new_df = df + 1
start_time = time.time()
df[df.columns] = new_df
end_time = time.time()
delta = int(end_time - start_time)
if delta > 5:
pytest.fail(f"Test took too long to run, runtime: {delta}")


def test_dataframe_setitem():
df = xpd.DataFrame({"a": [1, 2, 3], "b": [1, 2, 3]}).astype("float64")
new_df = df + 1
df[df.columns] = new_df
tm.assert_equal(df, new_df)
Loading