Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT-#7254: Support right merge/join #42

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 27 additions & 4 deletions modin/core/storage_formats/pandas/merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,11 @@ def row_axis_merge(
if how in ["left", "inner"] and left_index is False and right_index is False:
kwargs["sort"] = False

reverted = False
if how == "inner" and left._modin_frame._partitions.shape[0] == 1:
left, right = right, left
reverted = True
Comment on lines +133 to +135
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

category Functionality

The new condition to swap operands in the merge operation is a good optimization for performance. However, we need to ensure that this swap doesn't affect the order of columns in the final result. Consider adding a step after the merge to reorder the columns if the operands were swapped. This will maintain consistency with the expected output when the operands are not swapped.

Chat with Korbit by mentioning @korbit-ai, and give a 👍 or 👎 to help Korbit improve your reviews.


def should_keep_index(left, right):
keep_index = False
if left_on is not None and right_on is not None:
Expand All @@ -147,7 +152,10 @@ def should_keep_index(left, right):
def map_func(
left, right, *axis_lengths, kwargs=kwargs, **service_kwargs
): # pragma: no cover
df = pandas.merge(left, right, **kwargs)
if reverted:
df = pandas.merge(right, left, **kwargs)
else:
df = pandas.merge(left, right, **kwargs)

if kwargs["how"] == "left":
partition_idx = service_kwargs["partition_idx"]
Expand All @@ -170,9 +178,24 @@ def map_func(
on = list(on) if is_list_like(on) else [on]

right_to_broadcast = right._modin_frame.combine()
new_columns, new_dtypes = cls._compute_result_metadata(
left, right, on, left_on, right_on, kwargs.get("suffixes", ("_x", "_y"))
)
if not reverted:
new_columns, new_dtypes = cls._compute_result_metadata(
left,
right,
on,
left_on,
right_on,
kwargs.get("suffixes", ("_x", "_y")),
)
else:
new_columns, new_dtypes = cls._compute_result_metadata(
right,
left,
on,
right_on,
left_on,
kwargs.get("suffixes", ("_x", "_y")),
)
Comment on lines +181 to +198
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

category Functionality

The _compute_result_metadata method has been updated to handle the case when operands are swapped. However, it's important to ensure that this change doesn't introduce any inconsistencies in the computed metadata. Consider adding unit tests specifically for the swapped operands case to verify that the computed metadata (columns and dtypes) is correct in both swapped and non-swapped scenarios.

Chat with Korbit by mentioning @korbit-ai, and give a 👍 or 👎 to help Korbit improve your reviews.


# We rebalance when the ratio of the number of existing partitions to
# the ideal number of partitions is smaller than this threshold. The
Expand Down