-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Dataframe.__setitem__
slow-downs
#17222
Conversation
|
||
@pytest.mark.timeout(5) | ||
def test_dataframe_setitem_slowdown(): | ||
# We are explictily testing the slowdown of the setitem operation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please give another sentence or two about how this works. Specifically, that we expect the input transformation to be skipped.
# We are explictily testing the slowdown of the setitem operation | |
# We are explicitly testing the slowdown of the setitem operation. |
@bdice Addressed all your reviews. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for all the comments, just trying to make I understand the implications of this change.
if ( | ||
len(arg) > 0 | ||
and isinstance(arg[0], _MethodProxy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only want to follow this branch when we call DataFrame.__setitem__
and the underlying wrapped object is a pd.DataFrame
(so we avoid the DtoH transfer), right? Would a proxy object that wraps cudf.DataFrame
also follow this code path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and yes. This is currently written in a way we could easily add to the dict map above for any other function and parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you lmk if I'm misunderstanding? If a wrapped cudf.DataFrame goes through this code path, then it will incorrectly call "_fsproxy_slow" when it should have stayed "fast". Eg. where df is a wrapped cudf.DataFrame
df.__setitem__("A", pd.Index([4,5,6]))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df.__setitem__("A", pd.Index([4,5,6]))
will not force df
into slow path. It will only force"A"
(at 0th index) in slow path which is also "A"
, now lets see this example:
df.__setitem__(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))
, this will force pd.Index(["a", "b"])
to slow path with is a true pandas index rather than passing a cudf Index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I documented a bit more thoroughly here: 04ec5b7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the following case:
df.setitem(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))
It will be converted to this and tried with cudf:
df.setitem(pd.Index(["a", "b"]), cudf.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))
And then if the above fails the following will be tried on pandas:
df.setitem(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it doesn't fail via the public API, I used CUDF_PANDAS_FALLBACK_DEBUGGING=True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out (2.) doesn't work even w/o cudf.pandas
In [1]: import cudf
In [2]: import pandas as pd
In [3]: df = cudf.DataFrame()
In [4]: df.__setitem__(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))
ValueError: Data must be 1-dimensional
In [5]: df = pd.DataFrame()
In [6]: df.__setitem__(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))
In [7]: df
Out[7]:
a b
0 4 10
1 5 11
2 6 12
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep there's fallback in this case anyway.
In [1]: %load_ext cudf.pandas
In [2]: import pandas as pd
In [3]: df = pd.DataFrame()
In [4]: type(df._fsproxy_wrapped)
Out[4]: cudf.core.dataframe.DataFrame
In [5]: df.__setitem__(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))
In [6]: type(df._fsproxy_wrapped)
Out[6]: pandas.core.frame.DataFrame
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This what I get with your changes. (And running with CUDF_PACDAS_DEBUGGING=True
does not fail). This looks correct to me
In [1]: %load_ext cudf.pandas
In [2]: import pandas as pd
In [3]: df = pd.DataFrame()
In [4]: type(df._fsproxy_wrapped)
Out[4]: cudf.core.dataframe.DataFrame
In [5]: df.__setitem__(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))
In [6]: type(df._fsproxy_wrapped)
Out[6]: cudf.core.dataframe.DataFrame
Co-authored-by: Matthew Murray <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is making more sense to me.
if ( | ||
len(arg) > 0 | ||
and isinstance(arg[0], _MethodProxy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding that! I have a question about this example:
df.__setitem__(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))
So df
can either be a wrapped a pandas or cudf dataframe.
df
is a wrapped pandas dataframe.
In this case,pd.Index(["a", "b"])
is forced to take the slow path. This makes sense to because if we keep it a fast-slow proxy object, then it will be converted tocudf.Index(["a", "b"])
and then causedf.__setitem__(cudf.Index(["a", "b"]), ...)
to fail because you can't pass a GPU object to pandas like this. Because this would fail, we'd trigger a DtoH transfer. This data transfer is the main contributing factor to the slow down. This case makes sense to me.df
is a wrapped cudf dataframe.
In this case,pd.Index(["a", "b"])
is also forced to take the slow path. Is this okay because__setitem__
also fails in this case? Trying this locally, I get
TypeError: Index object is not iterable. Consider using `.to_arrow()`, `.to_pandas()` or `.values_host` if you wish to iterate over the values.
Co-authored-by: Matthew Murray <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @galipremsagar for explaining that. I'm not sure if you want to wait for another review, but it LGTM.
|
||
|
||
def test_dataframe_setitem_slowdown(): | ||
# We are explicitly testing the slowdown of the setitem operation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# We are explicitly testing the slowdown of the setitem operation | |
# We are explicitly testing the slowdown of the setitem operation | |
# by eliminating the DtoH transfer performed by df[df.columns] = ... | |
# We do this by ensuring the df.columns argument in the setitem | |
# operation remains a slow object. |
Thanks for the review, Matt! I'll go ahead and merge. I'll also open a follow-up to fix the cudf bug you surfaced in this discussion. |
/merge |
Description
Fixes: #17140
This PR fixes slow-downs in
DataFrame.__seitem__
by properly passing in CPU objects where needed instead of passing a GPU object and then failing and performing a GPU -> CPU transfer.DataFrame.__setitem__
first argument can be a column(pd.Index), in our fast path this will be converted tocudf.Index
and thus there will be failure from cudf side and then the transfer to CPU + slow-path executes, this is the primary reason for slowdown. This PR maintains a dict mapping of such special functions where we shouldn't be converting the objects to fast path.Checklist