Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix default name conversion in ToFrame #1044

Merged
merged 10 commits into from
May 2, 2024
Merged

Conversation

rjzamora
Copy link
Member

Possible fix for a subtle optimization bug that shows up when an unnamed Series is shuffled and then converted to a DataFrame and merged. Definitely a bit of a "corner case", but does show up in cugraph CI.

@rjzamora rjzamora added the bug Something isn't working label Apr 26, 2024
@rjzamora rjzamora self-assigned this Apr 26, 2024

@functools.cached_property
def unique_partition_mapping_columns_from_shuffle(self):
unique_mapping = self.frame.unique_partition_mapping_columns_from_shuffle
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this test covers the added function here? Could you elaborate?

Copy link
Member Author

@rjzamora rjzamora Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a fair question. This problem is still a bit confusing to me :)

When an un-named Series is shuffled, and then converted to a DataFrame, it's unique_partition_mapping_columns_from_shuffle result will be something like {None} instead of a set containing the real (default) column name ({0}). This results in a KeyError when RenameFrame tries to select the None key instead of 0.

There seem to be several ways to avoid the error. However, I think the root problem is that ToFrame must properly account for the name of the column it creates.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agree, but this check might not properly account for that

set(self.frame.columns) == unique_mapping

unique_mapping could be a tuple of one column, I think we have to be a bit more elaborate here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I could use your help here if the current solution is wrong/incomplete. I was thinking that the only case we need to catch here is when we are converting to a dataframe from an unnamed Index or Series, but I didn't dig into the Index case at all. Is that what you have in mind?

unique_mapping could be a tuple of one column

Doesn't unique_partition_mapping_columns_from_shuffle always return a set?

@rjzamora
Copy link
Member Author

@phofl - Do you have a use case in mind where this still fails? I'd like to make sure this fix (or something better) is included in the next release.

@phofl
Copy link
Collaborator

phofl commented May 2, 2024

For future PRs: we need tests like the one I added if we change the partitioning implementation

@phofl phofl merged commit 26728a4 into dask:main May 2, 2024
7 checks passed
@phofl
Copy link
Collaborator

phofl commented May 2, 2024

thx

@rjzamora
Copy link
Member Author

rjzamora commented May 2, 2024

Oh cool - I didn't see test_partitioning_knowledge.py before. Thanks for the help here @phofl !

@rjzamora rjzamora deleted the fix-rename-then-merge branch May 2, 2024 13:21
@rjzamora
Copy link
Member Author

rjzamora commented May 2, 2024

Hmm - Seems like the new test_merge_groupby_to_frame test is failing in #1049 for 3.9

@phofl
Copy link
Collaborator

phofl commented May 2, 2024

good point, #1052

That part of the test didn't make much sense

rapids-bot bot pushed a commit to rapidsai/cugraph that referenced this pull request May 28, 2024
**[WIP]** I'm using this PR to debug/add support for `DASK_DATAFRAME__QUERY_PLANNING=True`.

**NOTES**:
- Depends on dask/dask-expr#1041 [Merged]
- Depends on dask/dask-expr#1044

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Rick Ratzel (https://github.com/rlratzel)
  - Ray Douglass (https://github.com/raydouglass)

URL: #4325
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants