Fix default name conversion in `ToFrame` #1044

rjzamora · 2024-04-26T17:38:20Z

Possible fix for a subtle optimization bug that shows up when an unnamed Series is shuffled and then converted to a DataFrame and merged. Definitely a bit of a "corner case", but does show up in cugraph CI.

dask_expr/tests/test_merge.py

phofl · 2024-04-27T11:36:25Z

dask_expr/_expr.py

+
+    @functools.cached_property
+    def unique_partition_mapping_columns_from_shuffle(self):
+        unique_mapping = self.frame.unique_partition_mapping_columns_from_shuffle


I don't see how this test covers the added function here? Could you elaborate?

Yeah, that's a fair question. This problem is still a bit confusing to me :)

When an un-named Series is shuffled, and then converted to a DataFrame, it's unique_partition_mapping_columns_from_shuffle result will be something like {None} instead of a set containing the real (default) column name ({0}). This results in a KeyError when RenameFrame tries to select the None key instead of 0.

There seem to be several ways to avoid the error. However, I think the root problem is that ToFrame must properly account for the name of the column it creates.

Yes agree, but this check might not properly account for that

set(self.frame.columns) == unique_mapping

unique_mapping could be a tuple of one column, I think we have to be a bit more elaborate here

Okay, I could use your help here if the current solution is wrong/incomplete. I was thinking that the only case we need to catch here is when we are converting to a dataframe from an unnamed Index or Series, but I didn't dig into the Index case at all. Is that what you have in mind?

unique_mapping could be a tuple of one column

Doesn't unique_partition_mapping_columns_from_shuffle always return a set?

rjzamora · 2024-04-30T15:11:37Z

@phofl - Do you have a use case in mind where this still fails? I'd like to make sure this fix (or something better) is included in the next release.

phofl · 2024-05-02T11:56:06Z

For future PRs: we need tests like the one I added if we change the partitioning implementation

phofl · 2024-05-02T12:52:41Z

thx

rjzamora · 2024-05-02T13:21:18Z

Oh cool - I didn't see test_partitioning_knowledge.py before. Thanks for the help here @phofl !

rjzamora · 2024-05-02T14:27:22Z

Hmm - Seems like the new test_merge_groupby_to_frame test is failing in #1049 for 3.9

phofl · 2024-05-02T14:35:46Z

good point, #1052

That part of the test didn't make much sense

**[WIP]** I'm using this PR to debug/add support for `DASK_DATAFRAME__QUERY_PLANNING=True`. **NOTES**: - Depends on dask/dask-expr#1041 [Merged] - Depends on dask/dask-expr#1044 Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Rick Ratzel (https://github.com/rlratzel) - Ray Douglass (https://github.com/raydouglass) URL: #4325

rjzamora added 2 commits April 26, 2024 10:31

fix subtle issue with renaming before a merge

a0e0e43

check formatting

b73938d

rjzamora added the bug Something isn't working label Apr 26, 2024

rjzamora self-assigned this Apr 26, 2024

rjzamora commented Apr 26, 2024

View reviewed changes

dask_expr/tests/test_merge.py Outdated Show resolved Hide resolved

Update dask_expr/tests/test_merge.py

aea6002

rjzamora mentioned this pull request Apr 26, 2024

Enable expression-based Dask Dataframe support rapidsai/cugraph#4325

Merged

phofl reviewed Apr 27, 2024

View reviewed changes

rjzamora added 2 commits April 29, 2024 08:58

Merge remote-tracking branch 'upstream/main' into fix-rename-then-merge

9ee4059

test index case

d46447d

rjzamora and others added 5 commits May 1, 2024 08:59

Merge remote-tracking branch 'upstream/main' into fix-rename-then-merge

83adc93

handle tuples

c15e5d7

Add test

9a63d29

Add test

de1e678

Update

4977083

phofl merged commit 26728a4 into dask:main May 2, 2024
7 checks passed

rjzamora deleted the fix-rename-then-merge branch May 2, 2024 13:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix default name conversion in `ToFrame` #1044

Fix default name conversion in `ToFrame` #1044

rjzamora commented Apr 26, 2024

phofl Apr 27, 2024

rjzamora Apr 29, 2024 •

edited

Loading

phofl Apr 29, 2024

rjzamora Apr 29, 2024

rjzamora commented Apr 30, 2024

phofl commented May 2, 2024

phofl commented May 2, 2024

rjzamora commented May 2, 2024

rjzamora commented May 2, 2024

phofl commented May 2, 2024

Fix default name conversion in ToFrame #1044

Fix default name conversion in ToFrame #1044

Conversation

rjzamora commented Apr 26, 2024

phofl Apr 27, 2024

Choose a reason for hiding this comment

rjzamora Apr 29, 2024 • edited Loading

Choose a reason for hiding this comment

phofl Apr 29, 2024

Choose a reason for hiding this comment

rjzamora Apr 29, 2024

Choose a reason for hiding this comment

rjzamora commented Apr 30, 2024

phofl commented May 2, 2024

phofl commented May 2, 2024

rjzamora commented May 2, 2024

rjzamora commented May 2, 2024

phofl commented May 2, 2024

Fix default name conversion in `ToFrame` #1044

Fix default name conversion in `ToFrame` #1044

rjzamora Apr 29, 2024 •

edited

Loading