-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] cudf incorrect when merging on both index level and column when specifying left_on
and right_on
#11550
Comments
This includes a slow workaround for rapidsai/cudf#11550
This issue has been labeled |
Note that this behavior was added to pandas 0.23: and was added to Dask here: This is an important optimization for |
This issue has been labeled |
This is a bug in cudf's treatment of merge when both import cudf
df = cudf.DataFrame({'a': [1, 2, 1, 2], 'b': [2, 3, 3, 4]}).set_index('a')
df2 = cudf.DataFrame({'a': [1, 2, 1, 3], 'b': [2, 30, 3, 4]}).set_index('a')
df2['c'] = 10
expected = df2.merge(df, on=["a", "b"], how="outer")
# b c
# a
# 1 2 10
# 1 3 10
# 2 30 10
# 3 4 10
# 2 3 <NA>
# 2 4 <NA>
got = df2.merge(df, left_on=["a", "b"], right_on=["a", "b"], how="outer")
# b_x c b_y
# a
# 1 2 10 2
# 1 3 10 3
# 2 30 10 <NA>
# 3 4 10 <NA>
# 2 3 <NA> 3
# 2 4 <NA> 4 I think this is because the This patch might be right: diff --git a/python/cudf/cudf/core/join/join.py b/python/cudf/cudf/core/join/join.py
index 0e5ac8dc02..18f02170bc 100644
--- a/python/cudf/cudf/core/join/join.py
+++ b/python/cudf/cudf/core/join/join.py
@@ -147,12 +147,13 @@ class Merge:
self._key_columns_with_same_name = (
set(_coerce_to_tuple(on))
if on
- else set()
- if (self._using_left_index or self._using_right_index)
else {
lkey.name
for lkey, rkey in zip(self._left_keys, self._right_keys)
if lkey.name == rkey.name
+ and not (
+ isinstance(lkey, _IndexIndexer) or isinstance(rkey, _IndexIndexer)
+ )
}
)
|
dask_cudf
incorrect when merging on both index level and columnleft_on
and right_on
Previously, if any of the join keys were indices, we assumed that they all were, and provided an empty set of key columns with matching names in the left and right dataframe. This does the wrong thing for mixed join keys (on a combination of index and normal columns), producing more output columns than is correct. To avoid this, only skip matching key names if they name indices. Closes rapidsai#11550.
Previously, if any of the join keys were indices, we assumed that they all were, and provided an empty set of key columns with matching names in the left and right dataframe. This does the wrong thing for mixed join keys (on a combination of index and normal columns), producing more output columns than is correct. To avoid this, only skip matching key names if they name indices. Closes #11550. Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #12271
Describe the bug
dask_cudf
does not give the expected result when merging two dataframes on index and column.pandas
,dask.dataframe
, andcudf
all behave the same.Steps/Code to reproduce bug
In the merge below, note that for
on=["a", "b"]
,"a"
is the index and"b"
is a column.The two dataframes are:
Expected behavior
Same code as above that works when using
pandas
anddask.dataframe
instead:Environment overview (please complete the following information)
Environment details
Click here to see environment details
Additional context
Encountered this when implementing functionality for MG PropertyGraph: rapidsai/cugraph#2523
The text was updated successfully, but these errors were encountered: