Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix default name conversion in ToFrame #1044

Merged
merged 10 commits into from
May 2, 2024
Merged
9 changes: 8 additions & 1 deletion dask_expr/_expr.py
Original file line number Diff line number Diff line change
Expand Up @@ -1613,7 +1613,14 @@ class ToFrame(Elemwise):
_keyword_only = ["name"]
operation = M.to_frame
_filter_passthrough = True
_preserves_partitioning_information = True

@functools.cached_property
def unique_partition_mapping_columns_from_shuffle(self):
unique_mapping = self.frame.unique_partition_mapping_columns_from_shuffle
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this test covers the added function here? Could you elaborate?

Copy link
Member Author

@rjzamora rjzamora Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a fair question. This problem is still a bit confusing to me :)

When an un-named Series is shuffled, and then converted to a DataFrame, it's unique_partition_mapping_columns_from_shuffle result will be something like {None} instead of a set containing the real (default) column name ({0}). This results in a KeyError when RenameFrame tries to select the None key instead of 0.

There seem to be several ways to avoid the error. However, I think the root problem is that ToFrame must properly account for the name of the column it creates.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agree, but this check might not properly account for that

set(self.frame.columns) == unique_mapping

unique_mapping could be a tuple of one column, I think we have to be a bit more elaborate here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I could use your help here if the current solution is wrong/incomplete. I was thinking that the only case we need to catch here is when we are converting to a dataframe from an unnamed Index or Series, but I didn't dig into the Index case at all. Is that what you have in mind?

unique_mapping could be a tuple of one column

Doesn't unique_partition_mapping_columns_from_shuffle always return a set?

if set(self.frame.columns) == unique_mapping:
# Account for default name conversion
return set(self.columns)
return unique_mapping


class ToFrameIndex(Elemwise):
Expand Down
15 changes: 15 additions & 0 deletions dask_expr/tests/test_merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -1011,3 +1011,18 @@ def test_merge_suffix_projections():
expected = df.merge(df, on="a")
expected = expected[expected["c_x"] == "A"]["c_y"]
assert_eq(result, expected)


def test_merge_after_rename():
pleft = pd.Series(range(10))
pleft = pleft.drop_duplicates().to_frame()
pleft.columns = ["a"]

left = from_pandas(pd.Series(range(10)), npartitions=2)
left = left.drop_duplicates().to_frame()
left.columns = ["a"]

right = pd.DataFrame({"a": [1, 2] * 5})
expected = pleft.merge(right, how="inner")
result = left.merge(right, how="inner")
assert_eq(result, expected)
rjzamora marked this conversation as resolved.
Show resolved Hide resolved
Loading