Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add
Series.drop
api #7304Add
Series.drop
api #7304Changes from 7 commits
d3be46f
85bf4f8
b301cd8
818a8bb
bc0f5f2
1c38141
a1abaa1
188e11d
fd70f2b
2765eab
5325fcc
a61119f
e8d494a
43acf56
b72ac3e
3cb9ede
bef984a
3145f2b
f4b72d3
1c2e8d0
2b893fa
a4f9007
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice - very elegant!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the join parameters are fixed here, it might be worth jumping straight to the cython without first creating a dataframe. Something like
This way you avoid all the python overhead of constructing a dataframe and then a bunch of unecessary checking and typecasting that the merge codepath uses, which isn't necessarily needed in this case. The downside is you might want to throw if the labels passed dont match the dtype of the target series index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a fast path that could exist in the merge code that we're not currently enabling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how useful it would be for users, but after the upcoming join improvements it might be useful as an internal utility for
join
at least, since join is just a narrow case of merge withleft_index=True
,right_index=True
and noon
parameters.A lot of the merge code is there just to validate all the combinations of arguments a user could pass and fix their dtypes if necessary. We should be able to get around that internally if we know something about what our cython API expects and are sure what we are passing to it is right. It would also help if
cudf._lib.join.join
had a better "developer facing" API, so the actual call to it from didn't seem so weird.The counterargument is that to avoid libcudf itself erroring in an invalid case, we have to do a dtype check, to make sure the labels and the index are the same type. It could be argued that we might as well let the full user facing merge code error since it's already built to error in that case. But then, if someone passes something invalid, it's almost just as weird that the error comes from inside join.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't we still need dtype validation, etc., here? The user-provided inputs to
drop()
could be of different dtype from the index columns.Agree that we need a better developer-facing join API. FWIW, that won't be coming from Cython though, as the Cython API is going to be very simple (just returning a tuple of gathermaps).
Maybe we could wrap the call to
join
here in atry...except
and raise a more informative error message?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with that, I guess my only question is in general, should we avoid using our own user facing API entrypoints internally in favor of a lower level API, just to avoid all the extra overhead we put in the user facing APIs that are meant to check every edge case of everything they could possibly be doing wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're in 100% agreement that we should use internal APIs that bypass Pandas semantics for efficiency. My only point is that doesn't always have to be calling all the way down to Cython. For example
_gather
is an internal API at the Python level.Here specifically, it looks like we need a
join
API without all the bells and whistles of the top-levelmerge()
method, and we currently don't really have that.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw in drop there is this line:
which I think guards against the dtype mismatch issue? @brandon-b-miller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Offline sync with @brandon-b-miller : There is room to optimize for join parameter logic checks, but should defer until the
join
refactor is finished.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, can we add a TODO here to optimize later by calling a lower-level join API? Ideally, the TODO should document exactly what logic checks we would like to avoid here.