Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow merging index column with data column using keyword "on" #7736

Merged
merged 190 commits into from
Apr 2, 2021
Merged
Show file tree
Hide file tree
Changes from 189 commits
Commits
Show all changes
190 commits
Select commit Hold shift + click to select a range
4a4b4af
Merge branch 'branch-0.17' into branch-0.18
shwina Dec 11, 2020
223f2b5
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into b…
shwina Dec 15, 2020
abd6ad2
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into b…
shwina Dec 17, 2020
18863b5
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into b…
shwina Jan 4, 2021
0fbdd31
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into b…
shwina Jan 5, 2021
dc9b943
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into b…
shwina Jan 5, 2021
d586aa7
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into b…
shwina Jan 7, 2021
996fda8
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into b…
shwina Jan 8, 2021
2808a5c
Add a compute_hash_join_indices that returns just the join indices
shwina Jan 11, 2021
ef0baee
Don't need common_columns stuff for join that returns a gathermap
shwina Jan 11, 2021
18f3074
Add hash_join_impl methods that return gathermaps
shwina Jan 11, 2021
70abf48
Add overloads to public hash_join class
shwina Jan 11, 2021
13dff67
Add top-level join APIs that return gathermaps
shwina Jan 11, 2021
3300fe1
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into g…
shwina Jan 12, 2021
7ed694c
Use device_uvector instead of device_vector in join
shwina Jan 12, 2021
636c2ea
Undo some API changes
shwina Jan 12, 2021
b79da68
Add join_result
shwina Jan 13, 2021
380aa59
Add APIs that return join_result
shwina Jan 13, 2021
3cbb2b4
Remove column_in_common
shwina Jan 13, 2021
53ae7c9
Add an inner join API that returns gathermaps
shwina Jan 14, 2021
fde172b
Add remaining APIs to return gathermaps
shwina Jan 14, 2021
4a286dd
Add gathermap join test
shwina Jan 18, 2021
c756db9
Replace -1 with INT_MIN
shwina Jan 18, 2021
6a3d23e
Make join_result columns instead of column_views
shwina Jan 20, 2021
5dfc2a0
Replace join_result with a pair of columns
shwina Jan 20, 2021
362829b
Add gathermap test for outer join
shwina Jan 20, 2021
4e4380c
Add and pass full join gathermap test
shwina Jan 20, 2021
339a13d
Begin Python-side refactor
shwina Jan 21, 2021
2b07802
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into g…
shwina Jan 25, 2021
0d5a19c
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into g…
shwina Jan 28, 2021
fdbdc12
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into g…
shwina Feb 1, 2021
5dd5d29
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into g…
shwina Feb 5, 2021
6b20429
Merge branch 'branch-0.19' into gathermap-based-join-apis
shwina Feb 8, 2021
044eac1
Add left_semi and left_anti join APIs that return gathermaps
shwina Feb 8, 2021
555d5ec
Add Cython bindings
shwina Feb 8, 2021
56ae616
full -> outer
shwina Feb 9, 2021
dd05121
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into g…
shwina Feb 9, 2021
d447924
Progress
shwina Feb 9, 2021
484512e
More progress on py refactor
shwina Feb 9, 2021
5227582
Remove breakpoint
shwina Feb 10, 2021
9cd870e
Fix neg index handling
shwina Feb 10, 2021
8e4f193
Use nullify gather in join
shwina Feb 10, 2021
29fe140
Handle outer joins better
shwina Feb 10, 2021
b634055
Fix index construction
shwina Feb 10, 2021
cd53d6c
Fix sorting behaviour
shwina Feb 10, 2021
75f1efd
Fix Index.join
shwina Feb 10, 2021
1f5d6ad
Progress on semi/anti joins
shwina Feb 10, 2021
de30520
Add simple join test
shwina Feb 10, 2021
66a0de5
Semi-join fix
shwina Feb 11, 2021
ca72295
Only combine key columns in outer join if they have the same name
shwina Feb 11, 2021
ee2242d
Handle when both _on and _index are provided
shwina Feb 11, 2021
e531725
Fix sorting join result
shwina Feb 11, 2021
c8b4948
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into g…
shwina Feb 11, 2021
674095c
whitespace
shwina Feb 12, 2021
cbd9dc3
Make construct_join_output_df work with column views
shwina Feb 12, 2021
3f3c3cb
Get rid of hash_join::left_join
shwina Feb 12, 2021
01415fc
More join C++ cleanup
shwina Feb 12, 2021
6185492
Even more cleaning
shwina Feb 17, 2021
d736d1c
More join tests
shwina Feb 18, 2021
b58591d
Fix all join tests
shwina Feb 18, 2021
be560bb
Python regressions
shwina Feb 18, 2021
efb60d6
Revert
shwina Feb 18, 2021
fe6d0b8
Invalid -> Unkown
shwina Feb 18, 2021
547027c
Don't mutate lhs/rhs
shwina Feb 18, 2021
5f93d23
Fix join tests
shwina Feb 19, 2021
b7bf821
Fix semi/anti join trivial cases
shwina Feb 19, 2021
50a2fb2
When testing join results, use a helper that sorts values
shwina Feb 19, 2021
ff0ae79
Totally broken commit
shwina Feb 19, 2021
07cd052
Cleanup
shwina Feb 20, 2021
bd6bf77
Warnings
shwina Feb 20, 2021
a40063e
Cleanup
shwina Feb 22, 2021
ccef9d0
Cleanup
shwina Feb 22, 2021
210244b
Cleanup
shwina Feb 22, 2021
b57348c
Add typing for join helpers
shwina Feb 22, 2021
5c2c9b3
Typing for Join class
shwina Feb 22, 2021
558aa15
Simplify joiner API
shwina Feb 22, 2021
3184896
Example doc
shwina Feb 22, 2021
d3535dc
Refactor join APIs to return a device_uvector
shwina Feb 25, 2021
3b0a2a5
Merge tag 'branch-0.19-latest' of https://github.com/rapidsai/cudf in…
shwina Mar 1, 2021
b82181d
docs
shwina Mar 3, 2021
77d2bfd
Finish up docs?
shwina Mar 3, 2021
0bf34e8
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into g…
shwina Mar 4, 2021
26a3fb0
Fix join tests
shwina Mar 4, 2021
8a60d62
Refactor join APIs to work with unique_ptr<rmm::device_uvector>>
shwina Mar 5, 2021
387a953
Update join Cython
shwina Mar 5, 2021
6cd6433
Need to resize the gathermap
shwina Mar 5, 2021
c67dcce
Doc
shwina Mar 5, 2021
30c22ed
Changelog
shwina Mar 5, 2021
f73199d
Add helper to convert gather_map_type->Column
shwina Mar 9, 2021
393c06a
Update python/cudf/cudf/core/frame.py
shwina Mar 9, 2021
e91f554
Cannot specify both column and index
shwina Mar 9, 2021
0185896
Vaildate how
shwina Mar 9, 2021
b232f85
Merge branch 'gathermap-based-join-apis' of github.com:shwina/cudf in…
shwina Mar 9, 2021
1eb495d
Can't use a set
shwina Mar 9, 2021
4f1f072
Avoid function local import
shwina Mar 10, 2021
4aa8fec
False -> NotImplementedError
shwina Mar 10, 2021
ae0e5f9
Update cpp/include/cudf/join.hpp
shwina Mar 10, 2021
f47cf7e
Reuse some join logic
shwina Mar 10, 2021
2a201c3
Merge branch 'gathermap-based-join-apis' of github.com:shwina/cudf in…
shwina Mar 10, 2021
230ca08
Formatting
shwina Mar 10, 2021
498a621
Update cpp/include/cudf/join.hpp
shwina Mar 11, 2021
2de26f3
Docs?
shwina Mar 11, 2021
d6f128c
Merge branch 'gathermap-based-join-apis' of github.com:shwina/cudf in…
shwina Mar 11, 2021
b7d8d8a
Use mr
shwina Mar 11, 2021
9efc761
Docs
shwina Mar 15, 2021
8779bc7
Simplify suffix handling
shwina Mar 16, 2021
4c651ac
Simplify joiner requirements
shwina Mar 17, 2021
b4f4d7c
Do less work in SemiJoin._merge_results
shwina Mar 17, 2021
d353c92
Doc
shwina Mar 17, 2021
580a346
Doc
shwina Mar 17, 2021
328dafd
Return None from semi_join
shwina Mar 17, 2021
297d20a
Init common_type
shwina Mar 17, 2021
e388dd6
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into g…
shwina Mar 18, 2021
935648b
Move validation directly into set_by_label and use a raw dict to stor…
vyasr Mar 19, 2021
806a3ef
Remove all references to OrderedColumnDict.
vyasr Mar 19, 2021
40a7b17
Move validation to separate method and use in both set_by_label and c…
vyasr Mar 19, 2021
a1c576e
Format with black.
vyasr Mar 19, 2021
788d9d6
Expose parameter to make validation optional.
vyasr Mar 19, 2021
6a64285
Coerce constructor input to dict before calling items.
vyasr Mar 19, 2021
e7d0981
Make construction safe.
vyasr Mar 19, 2021
c39932c
Final cleanup and documentation.
vyasr Mar 19, 2021
4ff09fc
Address style issues.
vyasr Mar 19, 2021
35c63ec
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into g…
shwina Mar 22, 2021
9433582
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into f…
shwina Mar 22, 2021
74f2884
Merge remote-tracking branch 'origin/branch-0.19' into feature/optimi…
vyasr Mar 22, 2021
0178127
CA fix
shwina Mar 22, 2021
5c0f202
Merge branch 'feature/optimize_accessor_copy' into join-bench
shwina Mar 22, 2021
c8d2364
Don't validate on gathers
shwina Mar 22, 2021
efea63d
Prioritize numeric columns
shwina Mar 22, 2021
898a3d8
Merge branch 'feature/optimize_accessor_copy' into join-bench
shwina Mar 22, 2021
c3b6444
Lazily compute and delete column length on demand.
vyasr Mar 22, 2021
01b2cf5
Remove redundant clear cache in setitem.
vyasr Mar 22, 2021
8899258
Remove mypy annotation for column length.
vyasr Mar 22, 2021
c6cd415
Optimize casting logic
shwina Mar 22, 2021
3507785
Merge branch 'feature/optimize_accessor_copy' of github.com:vyasr/cud…
shwina Mar 22, 2021
7f8e1cd
Undo
shwina Mar 22, 2021
f2e4609
Don't validate when copying type metadata
shwina Mar 22, 2021
5d378c2
Merge branch 'feature/optimize_accessor_copy' into join-bench
shwina Mar 22, 2021
83cc407
ImportError
shwina Mar 22, 2021
72598fb
Prioritize numeric dtypes in is_numerical_dtype
shwina Mar 22, 2021
fa220b6
Add unsafe CA ctor
shwina Mar 22, 2021
6572cd3
Merge branch 'feature/optimize_accessor_copy' into join-bench
shwina Mar 22, 2021
f7dc417
Revert "Prioritize numeric dtypes in is_numerical_dtype"
shwina Mar 22, 2021
3760077
Revert "Prioritize numeric dtypes in is_numerical_dtype"
shwina Mar 22, 2021
01cdfcf
Merge branch 'feature/optimize_accessor_copy' into join-bench
shwina Mar 22, 2021
de9ca28
Change error message back so that tests pass.
vyasr Mar 23, 2021
e35d03b
Faster is_numerical_dtype
shwina Mar 23, 2021
e2fd533
Faster is_numerical_dtype
shwina Mar 23, 2021
9044d62
Merge branch 'feature/optimize_accessor_copy' into join-bench
shwina Mar 23, 2021
64ca702
Even faster is_numerical_dtype
shwina Mar 23, 2021
749edf1
Enable fast path for constructing a Buffer from a DeviceBuffer
shwina Mar 23, 2021
7526e4a
Merge branch 'feature/optimize_accessor_copy' into join-bench
shwina Mar 23, 2021
ca772b8
Small fix
shwina Mar 23, 2021
739ec57
Add validation option to insert and standardize error message.
vyasr Mar 23, 2021
498b70e
Fix style.
vyasr Mar 23, 2021
3cd012b
Merge remote-tracking branch 'vyasr/feature/optimize_accessor_copy' i…
shwina Mar 23, 2021
660afa6
Merge branch 'various-py-optimizations' into join-bench
shwina Mar 23, 2021
f8ac22f
Merge branch 'gathermap-based-join-apis' into join-bench
shwina Mar 23, 2021
c28866c
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into v…
shwina Mar 23, 2021
01e13fa
Undo formatting change
shwina Mar 23, 2021
89a0301
Add TODO
shwina Mar 23, 2021
26f4cc8
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into g…
shwina Mar 23, 2021
f2036eb
Merge branch 'various-py-optimizations' into join-bench
shwina Mar 23, 2021
5e73de7
init->create + doc
shwina Mar 24, 2021
e0c50b5
Merge branch 'various-py-optimizations' into gathermap-based-join-apis
shwina Mar 24, 2021
fa880c1
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into g…
shwina Mar 24, 2021
58bdecd
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into g…
shwina Mar 25, 2021
ed1b434
Merge branch 'join-bench' into gathermap-based-join-apis
shwina Mar 25, 2021
ca116a3
Only gather the index if necessary
shwina Mar 25, 2021
ce03918
Don't copy type metadata for the index unless we need to
shwina Mar 25, 2021
b7c6b19
Use validate=False in a few more places
shwina Mar 25, 2021
3e15f54
fresh new start
skirui-source Mar 26, 2021
671a0e0
Import
shwina Mar 26, 2021
797087b
Review
shwina Mar 26, 2021
5ad531f
Coerce to tuple first
shwina Mar 26, 2021
f7e94fb
Replace hasattr with isinstance
shwina Mar 26, 2021
1cb9448
Handle renamed indexes
shwina Mar 26, 2021
cc89360
Fix to names setter
shwina Mar 26, 2021
4ca1238
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into g…
shwina Mar 26, 2021
9cebf2e
Update cpp/src/join/hash_join.cu
shwina Mar 26, 2021
1584b86
Better example
shwina Mar 26, 2021
3977b79
Remove std::moves
shwina Mar 26, 2021
67919a3
Merge branch 'gathermap-based-join-apis' of github.com:shwina/cudf in…
shwina Mar 26, 2021
7bf6561
Fix formatting error
shwina Mar 26, 2021
d84fbf0
all tests passing now
skirui-source Mar 27, 2021
4cf98e4
added comments for better comprehension
skirui-source Mar 27, 2021
a07e66a
Merge branch 'gathermap-based-join-apis' of github.com:shwina/cudf in…
skirui-source Mar 30, 2021
e3d3ac8
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into m…
skirui-source Mar 30, 2021
a2c38bc
added tests. ready for review
skirui-source Apr 1, 2021
53c6f15
addressed review comments
skirui-source Apr 2, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 40 additions & 15 deletions python/cudf/cudf/core/join/join.py
Original file line number Diff line number Diff line change
Expand Up @@ -196,14 +196,14 @@ def perform_merge(self) -> Frame:

def _compute_join_keys(self):
# Computes self._keys
left_keys = []
right_keys = []
if (
self.left_index
or self.right_index
or self.left_on
or self.right_on
):
left_keys = []
right_keys = []
if self.left_index:
left_keys.extend(
[
Expand Down Expand Up @@ -234,14 +234,35 @@ def _compute_join_keys(self):
for on in _coerce_to_tuple(self.right_on)
]
)
elif self.on:
on_names = _coerce_to_tuple(self.on)
for on in on_names:
# If `on` is provided, checks whether merging on
# columns, indexes or merging index with column
if on in self.lhs._data and on not in self.rhs._data:
# case1: merge on lhs column with rhs index
left_keys.append(_Indexer(name=on, column=True))
right_keys.append(_Indexer(name=on, index=True))

elif on not in self.lhs._data and on in self.rhs._data:
# case2: merge on rhs column with lhs index
left_keys.append(_Indexer(name=on, index=True))
right_keys.append(_Indexer(name=on, column=True))

elif on not in self.lhs._data and on not in self.rhs._data:
# case3: merge on lhs index with rhs index
left_keys.append(_Indexer(name=on, index=True))
right_keys.append(_Indexer(name=on, index=True))

else:
# case4: merge on lhs column with rhs column
left_keys.append(_Indexer(name=on, column=True))
right_keys.append(_Indexer(name=on, column=True))
skirui-source marked this conversation as resolved.
Show resolved Hide resolved

else:
# Use `on` if provided. Otherwise,
# implicitly use identically named columns as the key columns:
on_names = (
_coerce_to_tuple(self.on)
if self.on is not None
else set(self.lhs._data) & set(self.rhs._data)
)
# if `on` not provided and not merging index with column or on
# both indexes, then use intersection of columns in both frames
skirui-source marked this conversation as resolved.
Show resolved Hide resolved
on_names = set(self.lhs._data) & set(self.rhs._data)
left_keys = [_Indexer(name=on, column=True) for on in on_names]
right_keys = [_Indexer(name=on, column=True) for on in on_names]

Expand Down Expand Up @@ -384,12 +405,16 @@ def _validate_merge_params(
if how not in {"left", "inner", "outer", "leftanti", "leftsemi"}:
raise NotImplementedError(f"{how} merge not supported yet")

# Passing 'on' with 'left_on' or 'right_on' is ambiguous
if on and (left_on or right_on):
raise ValueError(
'Can only pass argument "on" OR "left_on" '
'and "right_on", not a combination of both.'
)
if on:
if left_on or right_on:
# Passing 'on' with 'left_on' or 'right_on' is ambiguous
raise ValueError(
'Can only pass argument "on" OR "left_on" '
'and "right_on", not a combination of both.'
)
else:
# the validity of 'on' being checked by _Indexer
return

# Can't merge on unnamed Series
if (isinstance(lhs, cudf.Series) and not lhs.name) or (
Expand Down
55 changes: 55 additions & 0 deletions python/cudf/cudf/tests/test_joining.py
Original file line number Diff line number Diff line change
Expand Up @@ -1738,3 +1738,58 @@ def test_join_renamed_index():
)
got = df.merge(df, left_index=True, right_index=True, how="inner")
assert_join_results_equal(expect, got, how="inner")


@pytest.mark.parametrize(
"lhs_col, lhs_idx, rhs_col, rhs_idx, on",
[
(["A", "B"], "L0", ["B", "C"], "L0", ["B"]),
(["A", "B"], "L0", ["B", "C"], "L0", ["L0"]),
(["A", "B"], "L0", ["B", "C"], "L0", ["B", "L0"]),
(["A", "B"], "L0", ["C", "L0"], "A", ["A"]),
(["A", "B"], "L0", ["C", "L0"], "A", ["L0"]),
(["A", "B"], "L0", ["C", "L0"], "A", ["A", "L0"]),
],
)
@pytest.mark.parametrize(
"how", ["left", "inner", "right", "outer", "leftanti", "leftsemi"]
)
def test_join_merge_with_on(lhs_col, lhs_idx, rhs_col, rhs_idx, on, how):
lhs_data = {col_name: [] for col_name in lhs_col}
skirui-source marked this conversation as resolved.
Show resolved Hide resolved
lhs_index = cudf.Index([], name=lhs_idx)

rhs_data = {col_name: [] for col_name in rhs_col}
rhs_index = cudf.Index([], name=rhs_idx)

gd_left = cudf.DataFrame(lhs_data, lhs_index)
gd_right = cudf.DataFrame(rhs_data, rhs_index)
pd_left = gd_left.to_pandas()
pd_right = gd_right.to_pandas()

expect = pd_left.merge(pd_right, on=on)
got = gd_left.merge(gd_right, on=on)

assert_join_results_equal(expect, got, how=how)


@pytest.mark.parametrize(
"on", ["A", "L0"],
)
@pytest.mark.parametrize(
"how", ["left", "inner", "right", "outer", "leftanti", "leftsemi"]
)
def test_join_merge_invalid_keys(on, how):
with pytest.raises(KeyError):
skirui-source marked this conversation as resolved.
Show resolved Hide resolved
gd_left = cudf.DataFrame(
{"A": [], "B": []}, index=cudf.Index([], name="C")
)
gd_right = cudf.DataFrame(
{"D": [], "E": []}, index=cudf.Index([], name="F")
)
pd_left = gd_left.to_pandas()
pd_right = gd_right.to_pandas()

expect = pd_left.merge(pd_right, on=on)
got = gd_left.merge(gd_right, on=on)

assert_join_results_equal(expect, got, how=how)