Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Dataframe.__setitem__ slow-downs #17222

Merged
merged 15 commits into from
Nov 12, 2024

Conversation

galipremsagar
Copy link
Contributor

@galipremsagar galipremsagar commented Oct 31, 2024

Description

Fixes: #17140

This PR fixes slow-downs in DataFrame.__seitem__ by properly passing in CPU objects where needed instead of passing a GPU object and then failing and performing a GPU -> CPU transfer.

DataFrame.__setitem__ first argument can be a column(pd.Index), in our fast path this will be converted to cudf.Index and thus there will be failure from cudf side and then the transfer to CPU + slow-path executes, this is the primary reason for slowdown. This PR maintains a dict mapping of such special functions where we shouldn't be converting the objects to fast path.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels Oct 31, 2024
@galipremsagar galipremsagar added bug Something isn't working non-breaking Non-breaking change labels Nov 5, 2024
@galipremsagar galipremsagar marked this pull request as ready for review November 8, 2024 22:19
@galipremsagar galipremsagar requested review from a team as code owners November 8, 2024 22:19
python/cudf/cudf/pandas/fast_slow_proxy.py Outdated Show resolved Hide resolved
python/cudf/cudf/pandas/fast_slow_proxy.py Outdated Show resolved Hide resolved
python/cudf/cudf/pandas/fast_slow_proxy.py Outdated Show resolved Hide resolved
python/cudf/cudf/pandas/fast_slow_proxy.py Outdated Show resolved Hide resolved

@pytest.mark.timeout(5)
def test_dataframe_setitem_slowdown():
# We are explictily testing the slowdown of the setitem operation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please give another sentence or two about how this works. Specifically, that we expect the input transformation to be skipped.

Suggested change
# We are explictily testing the slowdown of the setitem operation
# We are explicitly testing the slowdown of the setitem operation.

python/cudf/cudf_pandas_tests/test_cudf_pandas.py Outdated Show resolved Hide resolved
@galipremsagar
Copy link
Contributor Author

@bdice Addressed all your reviews.

Copy link
Contributor

@Matt711 Matt711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for all the comments, just trying to make I understand the implications of this change.

python/cudf/cudf/pandas/fast_slow_proxy.py Outdated Show resolved Hide resolved
Comment on lines +1021 to +1023
if (
len(arg) > 0
and isinstance(arg[0], _MethodProxy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only want to follow this branch when we call DataFrame.__setitem__ and the underlying wrapped object is a pd.DataFrame (so we avoid the DtoH transfer), right? Would a proxy object that wraps cudf.DataFrame also follow this code path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and yes. This is currently written in a way we could easily add to the dict map above for any other function and parameter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you lmk if I'm misunderstanding? If a wrapped cudf.DataFrame goes through this code path, then it will incorrectly call "_fsproxy_slow" when it should have stayed "fast". Eg. where df is a wrapped cudf.DataFrame
df.__setitem__("A", pd.Index([4,5,6]))

Copy link
Contributor Author

@galipremsagar galipremsagar Nov 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df.__setitem__("A", pd.Index([4,5,6])) will not force df into slow path. It will only force"A"(at 0th index) in slow path which is also "A", now lets see this example:
df.__setitem__(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]})), this will force pd.Index(["a", "b"]) to slow path with is a true pandas index rather than passing a cudf Index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I documented a bit more thoroughly here: 04ec5b7

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the following case:

df.setitem(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))

It will be converted to this and tried with cudf:

df.setitem(pd.Index(["a", "b"]), cudf.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))

And then if the above fails the following will be tried on pandas:

df.setitem(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it doesn't fail via the public API, I used CUDF_PANDAS_FALLBACK_DEBUGGING=True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out (2.) doesn't work even w/o cudf.pandas

In [1]: import cudf

In [2]: import pandas as pd

In [3]: df = cudf.DataFrame()

In [4]: df.__setitem__(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))

ValueError: Data must be 1-dimensional

In [5]: df = pd.DataFrame()

In [6]: df.__setitem__(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))

In [7]: df
Out[7]: 
   a   b
0  4  10
1  5  11
2  6  12

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep there's fallback in this case anyway.

In [1]: %load_ext cudf.pandas

In [2]: import pandas as pd

In [3]: df = pd.DataFrame()

In [4]: type(df._fsproxy_wrapped)
Out[4]: cudf.core.dataframe.DataFrame

In [5]: df.__setitem__(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))

In [6]: type(df._fsproxy_wrapped)
Out[6]: pandas.core.frame.DataFrame

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This what I get with your changes. (And running with CUDF_PACDAS_DEBUGGING=True does not fail). This looks correct to me

In [1]: %load_ext cudf.pandas

In [2]: import pandas as pd

In [3]: df = pd.DataFrame()

In [4]: type(df._fsproxy_wrapped)
Out[4]: cudf.core.dataframe.DataFrame

In [5]: df.__setitem__(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))

In [6]: type(df._fsproxy_wrapped)
Out[6]: cudf.core.dataframe.DataFrame

Copy link
Contributor

@Matt711 Matt711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is making more sense to me.

Comment on lines +1021 to +1023
if (
len(arg) > 0
and isinstance(arg[0], _MethodProxy)
Copy link
Contributor

@Matt711 Matt711 Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding that! I have a question about this example:

df.__setitem__(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))

So df can either be a wrapped a pandas or cudf dataframe.

  1. df is a wrapped pandas dataframe.
    In this case, pd.Index(["a", "b"]) is forced to take the slow path. This makes sense to because if we keep it a fast-slow proxy object, then it will be converted to cudf.Index(["a", "b"]) and then cause df.__setitem__(cudf.Index(["a", "b"]), ...) to fail because you can't pass a GPU object to pandas like this. Because this would fail, we'd trigger a DtoH transfer. This data transfer is the main contributing factor to the slow down. This case makes sense to me.
  2. df is a wrapped cudf dataframe.
    In this case, pd.Index(["a", "b"]) is also forced to take the slow path. Is this okay because __setitem__ also fails in this case? Trying this locally, I get
TypeError: Index object is not iterable. Consider using `.to_arrow()`, `.to_pandas()` or `.values_host` if you wish to iterate over the values.

python/cudf/cudf/pandas/fast_slow_proxy.py Outdated Show resolved Hide resolved
Co-authored-by: Matthew Murray <[email protected]>
Copy link
Contributor

@Matt711 Matt711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @galipremsagar for explaining that. I'm not sure if you want to wait for another review, but it LGTM.



def test_dataframe_setitem_slowdown():
# We are explicitly testing the slowdown of the setitem operation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# We are explicitly testing the slowdown of the setitem operation
# We are explicitly testing the slowdown of the setitem operation
# by eliminating the DtoH transfer performed by df[df.columns] = ...
# We do this by ensuring the df.columns argument in the setitem
# operation remains a slow object.

@galipremsagar
Copy link
Contributor Author

Thanks @galipremsagar for explaining that. I'm not sure if you want to wait for another review, but it LGTM.

Thanks for the review, Matt! I'll go ahead and merge. I'll also open a follow-up to fix the cudf bug you surfaced in this discussion.

@galipremsagar
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 84743c3 into rapidsai:branch-24.12 Nov 12, 2024
102 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf.pandas Issues specific to cudf.pandas non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[BUG] Slow Performance of cuDF Pandas on L4
3 participants