Fix `Dataframe.setitem` slow-downs #17222

galipremsagar · 2024-10-31T04:42:51Z

Description

This PR fixes slow-downs in DataFrame.__seitem__ by properly passing in CPU objects where needed instead of passing a GPU object and then failing and performing a GPU -> CPU transfer.

DataFrame.__setitem__ first argument can be a column(pd.Index), in our fast path this will be converted to cudf.Index and thus there will be failure from cudf side and then the transfer to CPU + slow-path executes, this is the primary reason for slowdown. This PR maintains a dict mapping of such special functions where we shouldn't be converting the objects to fast path.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

python/cudf/cudf/pandas/fast_slow_proxy.py

python/cudf/cudf_pandas_tests/test_cudf_pandas.py

python/cudf/cudf/pandas/fast_slow_proxy.py

bdice · 2024-11-08T22:55:15Z

python/cudf/cudf_pandas_tests/test_cudf_pandas.py

+
+@pytest.mark.timeout(5)
+def test_dataframe_setitem_slowdown():
+    # We are explictily testing the slowdown of the setitem operation


Please give another sentence or two about how this works. Specifically, that we expect the input transformation to be skipped.

Suggested change

# We are explictily testing the slowdown of the setitem operation

# We are explicitly testing the slowdown of the setitem operation.

python/cudf/cudf_pandas_tests/test_cudf_pandas.py

python/cudf/cudf/pandas/fast_slow_proxy.py

galipremsagar · 2024-11-09T00:01:49Z

@bdice Addressed all your reviews.

Matt711

Sorry for all the comments, just trying to make I understand the implications of this change.

python/cudf/cudf/pandas/fast_slow_proxy.py

Matt711 · 2024-11-09T01:51:43Z

python/cudf/cudf/pandas/fast_slow_proxy.py

+            if (
+                len(arg) > 0
+                and isinstance(arg[0], _MethodProxy)


We only want to follow this branch when we call DataFrame.__setitem__ and the underlying wrapped object is a pd.DataFrame (so we avoid the DtoH transfer), right? Would a proxy object that wraps cudf.DataFrame also follow this code path?

Yes and yes. This is currently written in a way we could easily add to the dict map above for any other function and parameter.

Can you lmk if I'm misunderstanding? If a wrapped cudf.DataFrame goes through this code path, then it will incorrectly call "_fsproxy_slow" when it should have stayed "fast". Eg. where df is a wrapped cudf.DataFrame
df.__setitem__("A", pd.Index([4,5,6]))

df.__setitem__("A", pd.Index([4,5,6])) will not force df into slow path. It will only force"A"(at 0th index) in slow path which is also "A", now lets see this example:
df.__setitem__(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]})), this will force pd.Index(["a", "b"]) to slow path with is a true pandas index rather than passing a cudf Index.

I documented a bit more thoroughly here: 04ec5b7

For the following case:

df.setitem(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))

It will be converted to this and tried with cudf:

df.setitem(pd.Index(["a", "b"]), cudf.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))

And then if the above fails the following will be tried on pandas:

df.setitem(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))

No it doesn't fail via the public API, I used CUDF_PANDAS_FALLBACK_DEBUGGING=True

It turns out (2.) doesn't work even w/o cudf.pandas

In [1]: import cudf In [2]: import pandas as pd In [3]: df = cudf.DataFrame() In [4]: df.__setitem__(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]})) ValueError: Data must be 1-dimensional In [5]: df = pd.DataFrame() In [6]: df.__setitem__(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]})) In [7]: df Out[7]: a b 0 4 10 1 5 11 2 6 12

Yep there's fallback in this case anyway.

In [1]: %load_ext cudf.pandas In [2]: import pandas as pd In [3]: df = pd.DataFrame() In [4]: type(df._fsproxy_wrapped) Out[4]: cudf.core.dataframe.DataFrame In [5]: df.__setitem__(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]})) In [6]: type(df._fsproxy_wrapped) Out[6]: pandas.core.frame.DataFrame

This what I get with your changes. (And running with CUDF_PACDAS_DEBUGGING=True does not fail). This looks correct to me

In [1]: %load_ext cudf.pandas In [2]: import pandas as pd In [3]: df = pd.DataFrame() In [4]: type(df._fsproxy_wrapped) Out[4]: cudf.core.dataframe.DataFrame In [5]: df.__setitem__(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]})) In [6]: type(df._fsproxy_wrapped) Out[6]: cudf.core.dataframe.DataFrame

Co-authored-by: Matthew Murray <[email protected]>

Matt711

This is making more sense to me.

Matt711 · 2024-11-11T21:19:27Z

python/cudf/cudf/pandas/fast_slow_proxy.py

+            if (
+                len(arg) > 0
+                and isinstance(arg[0], _MethodProxy)


Thanks for adding that! I have a question about this example:

df.__setitem__(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))

So df can either be a wrapped a pandas or cudf dataframe.

df is a wrapped pandas dataframe.
In this case, pd.Index(["a", "b"]) is forced to take the slow path. This makes sense to because if we keep it a fast-slow proxy object, then it will be converted to cudf.Index(["a", "b"]) and then cause df.__setitem__(cudf.Index(["a", "b"]), ...) to fail because you can't pass a GPU object to pandas like this. Because this would fail, we'd trigger a DtoH transfer. This data transfer is the main contributing factor to the slow down. This case makes sense to me.

df is a wrapped cudf dataframe.
In this case, pd.Index(["a", "b"]) is also forced to take the slow path. Is this okay because __setitem__ also fails in this case? Trying this locally, I get

TypeError: Index object is not iterable. Consider using `.to_arrow()`, `.to_pandas()` or `.values_host` if you wish to iterate over the values.

python/cudf/cudf/pandas/fast_slow_proxy.py

Co-authored-by: Matthew Murray <[email protected]>

Matt711

Thanks @galipremsagar for explaining that. I'm not sure if you want to wait for another review, but it LGTM.

Matt711 · 2024-11-11T22:44:35Z

python/cudf/cudf_pandas_tests/test_cudf_pandas.py

+
+
+def test_dataframe_setitem_slowdown():
+    # We are explicitly testing the slowdown of the setitem operation


Suggested change

# We are explicitly testing the slowdown of the setitem operation

# We are explicitly testing the slowdown of the setitem operation

# by eliminating the DtoH transfer performed by df[df.columns] = ...

# We do this by ensuring the df.columns argument in the setitem

# operation remains a slow object.

galipremsagar · 2024-11-12T00:06:19Z

Thanks @galipremsagar for explaining that. I'm not sure if you want to wait for another review, but it LGTM.

Thanks for the review, Matt! I'll go ahead and merge. I'll also open a follow-up to fix the cudf bug you surfaced in this discussion.

galipremsagar · 2024-11-12T00:06:29Z

/merge

test fix

4679553

github-actions bot assigned galipremsagar Oct 31, 2024

github-actions bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels Oct 31, 2024

galipremsagar added 3 commits October 31, 2024 19:48

Merge remote-tracking branch 'upstream/branch-24.12' into 17140

9edf365

Merge remote-tracking branch 'upstream/branch-24.12' into 17140

9d7cfd3

Fix properly

a5a2263

galipremsagar added bug Something isn't working non-breaking Non-breaking change labels Nov 5, 2024

galipremsagar added 5 commits November 8, 2024 14:58

Merge remote-tracking branch 'upstream/branch-24.12' into 17140

52b1585

Merge remote-tracking branch 'upstream/branch-24.12' into 17140

ba50715

make the fix generic

99e9c68

Fix and add tests

c3debb5

Merge branch 'branch-24.12' into 17140

554cfdf

galipremsagar marked this pull request as ready for review November 8, 2024 22:19

galipremsagar requested review from a team as code owners November 8, 2024 22:19

galipremsagar requested review from jameslamb, bdice and mroeschke November 8, 2024 22:19

Matt711 reviewed Nov 8, 2024

View reviewed changes

python/cudf/cudf/pandas/fast_slow_proxy.py Show resolved Hide resolved

Matt711 reviewed Nov 8, 2024

View reviewed changes

python/cudf/cudf/pandas/fast_slow_proxy.py Outdated Show resolved Hide resolved

Matt711 reviewed Nov 8, 2024

View reviewed changes

python/cudf/cudf_pandas_tests/test_cudf_pandas.py Outdated Show resolved Hide resolved

bdice reviewed Nov 8, 2024

View reviewed changes

Matt711 reviewed Nov 8, 2024

View reviewed changes

python/cudf/cudf/pandas/fast_slow_proxy.py Outdated Show resolved Hide resolved

address reviews

d8b8296

Matt711 reviewed Nov 9, 2024

View reviewed changes

galipremsagar and others added 2 commits November 8, 2024 19:55

Update fast_slow_proxy.py

216c85c

Co-authored-by: Matthew Murray <[email protected]>

document more

04ec5b7

galipremsagar added 2 commits November 9, 2024 15:13

Merge branch '17140' of https://github.com/galipremsagar/cudf into 17140

0d11b03

Merge branch 'branch-24.12' into 17140

2c9bce5

Matt711 reviewed Nov 11, 2024

View reviewed changes

Update fast_slow_proxy.py

5f000f9

Co-authored-by: Matthew Murray <[email protected]>

Matt711 approved these changes Nov 11, 2024

View reviewed changes

rapids-bot bot merged commit 84743c3 into rapidsai:branch-24.12 Nov 12, 2024
102 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `Dataframe.setitem` slow-downs #17222

Fix `Dataframe.setitem` slow-downs #17222

galipremsagar commented Oct 31, 2024 •

edited

Loading

bdice Nov 8, 2024

galipremsagar commented Nov 9, 2024

Matt711 left a comment

Matt711 Nov 9, 2024

galipremsagar Nov 9, 2024

Matt711 Nov 9, 2024

galipremsagar Nov 9, 2024 •

edited

Loading

galipremsagar Nov 9, 2024

galipremsagar Nov 11, 2024

Matt711 Nov 11, 2024

Matt711 Nov 11, 2024

Matt711 Nov 11, 2024

Matt711 Nov 11, 2024

Matt711 left a comment

Matt711 Nov 11, 2024 •

edited

Loading

Matt711 left a comment

Matt711 Nov 11, 2024

galipremsagar commented Nov 12, 2024

galipremsagar commented Nov 12, 2024

	# We are explictily testing the slowdown of the setitem operation
	# We are explicitly testing the slowdown of the setitem operation.



		def test_dataframe_setitem_slowdown():
		# We are explicitly testing the slowdown of the setitem operation

-    # We are explicitly testing the slowdown of the setitem operation
+    # We are explicitly testing the slowdown of the setitem operation
+    # by eliminating the DtoH transfer performed by df[df.columns] = ...
+    # We do this by ensuring the df.columns argument in the setitem
+    # operation remains a slow object.

Fix Dataframe.__setitem__ slow-downs #17222

Fix Dataframe.__setitem__ slow-downs #17222

Conversation

galipremsagar commented Oct 31, 2024 • edited Loading

Description

Checklist

Choose a reason for hiding this comment

galipremsagar commented Nov 9, 2024

Matt711 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

galipremsagar Nov 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Matt711 left a comment

Choose a reason for hiding this comment

Matt711 Nov 11, 2024 • edited Loading

Choose a reason for hiding this comment

Matt711 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

galipremsagar commented Nov 12, 2024

galipremsagar commented Nov 12, 2024

Fix `Dataframe.setitem` slow-downs #17222

Fix `Dataframe.setitem` slow-downs #17222

galipremsagar commented Oct 31, 2024 •

edited

Loading

galipremsagar Nov 9, 2024 •

edited

Loading

Matt711 Nov 11, 2024 •

edited

Loading