-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix default name conversion in ToFrame
#1044
Conversation
dask_expr/_expr.py
Outdated
|
||
@functools.cached_property | ||
def unique_partition_mapping_columns_from_shuffle(self): | ||
unique_mapping = self.frame.unique_partition_mapping_columns_from_shuffle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how this test covers the added function here? Could you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a fair question. This problem is still a bit confusing to me :)
When an un-named Series is shuffled, and then converted to a DataFrame, it's unique_partition_mapping_columns_from_shuffle
result will be something like {None}
instead of a set containing the real (default) column name ({0}
). This results in a KeyError
when RenameFrame
tries to select the None
key instead of 0
.
There seem to be several ways to avoid the error. However, I think the root problem is that ToFrame
must properly account for the name of the column it creates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes agree, but this check might not properly account for that
set(self.frame.columns) == unique_mapping
unique_mapping could be a tuple of one column, I think we have to be a bit more elaborate here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I could use your help here if the current solution is wrong/incomplete. I was thinking that the only case we need to catch here is when we are converting to a dataframe from an unnamed Index or Series, but I didn't dig into the Index case at all. Is that what you have in mind?
unique_mapping could be a tuple of one column
Doesn't unique_partition_mapping_columns_from_shuffle
always return a set
?
@phofl - Do you have a use case in mind where this still fails? I'd like to make sure this fix (or something better) is included in the next release. |
For future PRs: we need tests like the one I added if we change the partitioning implementation |
thx |
Oh cool - I didn't see |
Hmm - Seems like the new |
good point, #1052 That part of the test didn't make much sense |
**[WIP]** I'm using this PR to debug/add support for `DASK_DATAFRAME__QUERY_PLANNING=True`. **NOTES**: - Depends on dask/dask-expr#1041 [Merged] - Depends on dask/dask-expr#1044 Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Rick Ratzel (https://github.com/rlratzel) - Ray Douglass (https://github.com/raydouglass) URL: #4325
Possible fix for a subtle optimization bug that shows up when an unnamed
Series
is shuffled and then converted to aDataFrame
and merged. Definitely a bit of a "corner case", but does show up incugraph
CI.