Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix
Dataframe.__setitem__
slow-downs #17222Fix
Dataframe.__setitem__
slow-downs #17222Changes from all commits
4679553
9edf365
9d7cfd3
a5a2263
52b1585
ba50715
99e9c68
c3debb5
554cfdf
d8b8296
216c85c
04ec5b7
0d11b03
2c9bce5
5f000f9
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only want to follow this branch when we call
DataFrame.__setitem__
and the underlying wrapped object is apd.DataFrame
(so we avoid the DtoH transfer), right? Would a proxy object that wrapscudf.DataFrame
also follow this code path?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and yes. This is currently written in a way we could easily add to the dict map above for any other function and parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you lmk if I'm misunderstanding? If a wrapped cudf.DataFrame goes through this code path, then it will incorrectly call "_fsproxy_slow" when it should have stayed "fast". Eg. where df is a wrapped
cudf.DataFrame
df.__setitem__("A", pd.Index([4,5,6]))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df.__setitem__("A", pd.Index([4,5,6]))
will not forcedf
into slow path. It will only force"A"
(at 0th index) in slow path which is also"A"
, now lets see this example:df.__setitem__(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))
, this will forcepd.Index(["a", "b"])
to slow path with is a true pandas index rather than passing a cudf Index.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I documented a bit more thoroughly here: 04ec5b7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the following case:
df.setitem(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))
It will be converted to this and tried with cudf:
df.setitem(pd.Index(["a", "b"]), cudf.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))
And then if the above fails the following will be tried on pandas:
df.setitem(pd.Index(["a", "b"]), pd.DataFrame({'a':[4,5,6], 'b':[10, 11, 12]}))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it doesn't fail via the public API, I used
CUDF_PANDAS_FALLBACK_DEBUGGING=True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out (2.) doesn't work even w/o
cudf.pandas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep there's fallback in this case anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This what I get with your changes. (And running with
CUDF_PACDAS_DEBUGGING=True
does not fail). This looks correct to meThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.