Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JNI for set operations #11143

Merged
merged 291 commits into from
Jul 28, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
291 commits
Select commit Hold shift + click to select a range
2e5d41a
Rewrite comments
ttnghia Jun 8, 2022
8393809
Reverse tests
ttnghia Jun 8, 2022
5cdefa3
Fix old tests
ttnghia Jun 8, 2022
3626efb
Add a new overload for `cudf::distinct`
ttnghia Jun 8, 2022
bcc4abe
Reverse back breaking changes
ttnghia Jun 8, 2022
edbcc78
Fix compile error
ttnghia Jun 8, 2022
2f1ce5a
Reverse benchmark
ttnghia Jun 8, 2022
882a67a
Complete `StringKeyColumn` tests
ttnghia Jun 8, 2022
cdae2ac
Fix tests
ttnghia Jun 8, 2022
5b21d88
Fix tests
ttnghia Jun 8, 2022
3f18057
Rename function
ttnghia Jun 9, 2022
21456e7
Add `NonNullTable` tests
ttnghia Jun 9, 2022
8a17581
Add `SlicedNonNullTable` tests
ttnghia Jun 9, 2022
e05ad48
Add `InputWithNulls` tests
ttnghia Jun 9, 2022
03fb093
Change variable
ttnghia Jun 9, 2022
3c12942
Refactor
ttnghia Jun 9, 2022
6ab9673
Add `BasicList` tests
ttnghia Jun 9, 2022
37dfdcb
Add `NullableLists` tests
ttnghia Jun 9, 2022
b78cf5b
Add `ListsOfStructs` tests
ttnghia Jun 9, 2022
8de0948
Add `SlicedStructsOfLists` tests
ttnghia Jun 9, 2022
9e8c4a5
Misc
ttnghia Jun 9, 2022
7fa65ee
Add `ListsOfEmptyStructs` tests
ttnghia Jun 9, 2022
ff6e03e
Modify `EmptyDeepList` tests
ttnghia Jun 9, 2022
374545a
Add `StructsOfLists` tests
ttnghia Jun 9, 2022
9bf540a
Use `distinct` in Cython
ttnghia Jun 9, 2022
e1c3cd5
Merge branch 'branch-22.08' into refactor_stream_compaction
ttnghia Jun 9, 2022
70d3164
Fix Python style
ttnghia Jun 9, 2022
bba15c2
Revert "Fix Python style"
ttnghia Jun 9, 2022
d895f48
Revert "Use `distinct` in Cython"
ttnghia Jun 9, 2022
56e791c
Fix compiling errors due to merging
ttnghia Jun 10, 2022
fff65c1
Add doxygen group
ttnghia Jun 10, 2022
6ffc9b0
Fix doxygen
ttnghia Jun 10, 2022
dd8c845
Rewrite comment and rename variable
ttnghia Jun 10, 2022
0dcff06
Use customized cuco
ttnghia Jun 10, 2022
4d2ce5c
Cleanup
ttnghia Jun 10, 2022
c06f1b9
Merge branch 'branch-22.08' into set_operations
ttnghia Jun 10, 2022
361464f
Reimplement `set_overlap`
ttnghia Jun 10, 2022
f74c3c8
Reimplement `set_intersect`
ttnghia Jun 10, 2022
27aaa6e
Reimplement `set_difference`
ttnghia Jun 10, 2022
1a41c0d
Fix all compile errors
ttnghia Jun 11, 2022
dc2754f
Support `nan_equality` in `create_map`
ttnghia Jun 11, 2022
e7b3022
Support `nan_equality` in `check_contains`
ttnghia Jun 11, 2022
a0046f5
Drop duplicate from the results
ttnghia Jun 11, 2022
22f38c0
Support most functionalities
ttnghia Jun 12, 2022
1963a3c
Use `pair_contains`
ttnghia Jun 12, 2022
b4a5dc6
Unify function
ttnghia Jun 12, 2022
b77beb9
Add comments
ttnghia Jun 12, 2022
450f638
Reorganize code
ttnghia Jun 12, 2022
cb8119b
Fixing null mask
ttnghia Jun 12, 2022
e60c81c
Avoid inserting nulls if compare unequal
ttnghia Jun 12, 2022
79f7906
Remove added code
ttnghia Jun 12, 2022
326f8d4
Add member function interface
ttnghia Jun 13, 2022
26958d5
Fix stale comment
ttnghia Jun 13, 2022
5a22c2b
Initial implementation
ttnghia Jun 13, 2022
910e05f
Switch to use new implementation
ttnghia Jun 14, 2022
e4622b1
All test passed
ttnghia Jun 14, 2022
be85cd2
Add public and detail API
ttnghia Jun 14, 2022
f299f4f
Cleanup and add comments
ttnghia Jun 14, 2022
82dc340
Fix style
ttnghia Jun 14, 2022
15c8daf
Rename function and variables
ttnghia Jun 14, 2022
5ae3ef8
Fix a serious bug
ttnghia Jun 14, 2022
58fb9d7
Optimize null insertion
ttnghia Jun 14, 2022
933d650
Remove constructor
ttnghia Jun 14, 2022
ee77c27
Misc
ttnghia Jun 14, 2022
3599820
WIP
ttnghia Jun 14, 2022
edc7897
Fix a bug in accumulating nested columns
ttnghia Jun 14, 2022
cb28355
Fix error that makes tests failed
ttnghia Jun 15, 2022
d9c0ab9
Address review comments
ttnghia Jun 15, 2022
7770265
Remove one overload
ttnghia Jun 15, 2022
96a36c4
Fix benchmark
ttnghia Jun 16, 2022
0210228
Rename struct, and use CTAD
ttnghia Jun 16, 2022
65190cc
Add comment
ttnghia Jun 16, 2022
a4db720
Rename variable
ttnghia Jun 16, 2022
7e0315b
Remove added code
ttnghia Jun 16, 2022
126886b
Update cuco
ttnghia Jun 16, 2022
b5a7450
Reverse changes
ttnghia Jun 16, 2022
6c90c53
Add a parameter
ttnghia Jun 16, 2022
4cc2f2e
Fix compiling errors
ttnghia Jun 16, 2022
55895e7
WIP
ttnghia Jun 16, 2022
df05dc8
Misc
ttnghia Jun 16, 2022
ec48856
Merge branch 'refactor_stream_compaction' into distinct_with_nans_equ…
ttnghia Jun 16, 2022
5f7d778
WIP
ttnghia Jun 16, 2022
154645a
Rewrite doxygen
ttnghia Jun 16, 2022
8f04d50
Remove `keys` parameter from `get_distinct_indices`
ttnghia Jun 16, 2022
d806278
Rewrite doxygen
ttnghia Jun 16, 2022
3734344
Use another version of `gather`
ttnghia Jun 16, 2022
f731d35
Fix wrong doxygen
ttnghia Jun 16, 2022
e44c85d
Fix wrong doxygen again
ttnghia Jun 16, 2022
a74f71e
Misc
ttnghia Jun 16, 2022
f9de181
Update doxygen
ttnghia Jun 16, 2022
a339d83
Merge branch 'refactor_stream_compaction' into distinct_with_nans_equ…
ttnghia Jun 16, 2022
68652f4
Implementation is complete
ttnghia Jun 16, 2022
6cec1eb
Define hash_map and add todo
ttnghia Jun 16, 2022
661400a
Rename variable
ttnghia Jun 17, 2022
700e465
Fix doxygen
ttnghia Jun 17, 2022
1c783e8
Rename tests
ttnghia Jun 17, 2022
64e03f6
Merge branch 'refactor_stream_compaction' into distinct_with_nans_equ…
ttnghia Jun 17, 2022
13ad653
Add `NoNullsTableWithNans` test
ttnghia Jun 17, 2022
7811611
Add `InputWithNullsAndNaNs` tests
ttnghia Jun 17, 2022
47c5eec
Fix a bug when comparing nulls as unequal
ttnghia Jun 18, 2022
4db34db
Add `InputWithNullsUnequal` tests
ttnghia Jun 19, 2022
fab367b
Add `ListsWithNullsUnequal` tests
ttnghia Jun 19, 2022
9ec27af
Rewrite doxygen
ttnghia Jun 19, 2022
1359ee0
Rewrite doxygen for `duplicate_keep_option` and add back performance …
ttnghia Jun 19, 2022
aa0a4ed
Remove redundant docsc
ttnghia Jun 19, 2022
01e03b6
Rename functor
ttnghia Jun 20, 2022
cdc3000
Modify comments
ttnghia Jun 20, 2022
45dec2a
Merge branch 'branch-22.08' into refactor_stream_compaction
ttnghia Jun 20, 2022
16ba20c
Add header
ttnghia Jun 20, 2022
37a23e4
Merge branch 'refactor_stream_compaction' into distinct_with_nans_equ…
ttnghia Jun 20, 2022
38603fc
Merge remote-tracking branch 'nghia/fix_compile_errors' into distinct…
ttnghia Jun 20, 2022
e32daf4
Add `InputWithNaNs*` tests
ttnghia Jun 20, 2022
cba4759
Merge branch 'branch-22.08' into refactor_stream_compaction
ttnghia Jun 20, 2022
7247101
Attempt to split files, not yet cleanup
ttnghia Jun 20, 2022
120377b
Cleanup
ttnghia Jun 20, 2022
68133d4
Change functor name
ttnghia Jun 20, 2022
aefdadf
Add doxygen
ttnghia Jun 20, 2022
faf6778
Reorganize code
ttnghia Jun 20, 2022
e839323
Fix headers
ttnghia Jun 20, 2022
f5646b3
Fix header
ttnghia Jun 20, 2022
538ff08
Fix `mr` usage, and rewrite some comments
ttnghia Jun 21, 2022
8201835
Reverse `join.hpp` files
ttnghia Jun 21, 2022
54d6e35
Write doxygen
ttnghia Jun 21, 2022
8149a08
Add new source file
ttnghia Jun 21, 2022
0cd9bd6
Complete implementation
ttnghia Jun 21, 2022
9254612
Cleanup headers
ttnghia Jun 21, 2022
e456b0b
Add headers
ttnghia Jun 21, 2022
bb703c6
Temporary use a cuco commit
ttnghia Jun 21, 2022
a755bea
Pass `std::shared_ptr` by value
ttnghia Jun 21, 2022
61df0ac
Rename lambda
ttnghia Jun 21, 2022
136b490
Merge branch 'branch-22.08' into refactor_semijoin
ttnghia Jun 21, 2022
9c2fb25
Draft for doxygen
ttnghia Jun 21, 2022
b8d43dc
Implement `check_compatibility`
ttnghia Jun 21, 2022
a2db48b
Using `pair_contains_if`
ttnghia Jun 21, 2022
d66a213
Update cuco
ttnghia Jun 21, 2022
adf8965
Fix null handling
ttnghia Jun 22, 2022
f0ee266
Fix doxygen and change function name
ttnghia Jun 22, 2022
0b35671
Update doxygen
ttnghia Jun 22, 2022
9320cf3
Fix nan handling
ttnghia Jun 22, 2022
29599b4
Merge branch 'branch-22.08' into refactor_semijoin
ttnghia Jun 22, 2022
db886ea
Merge branch 'refactor_stream_compaction' into distinct_with_nans_equ…
ttnghia Jun 22, 2022
1ac6501
Merge branch 'branch-22.08' into refactor_stream_compaction
ttnghia Jun 22, 2022
d0af0e6
Merge branch 'refactor_stream_compaction' into distinct_with_nans_equ…
ttnghia Jun 22, 2022
9ddcc93
Add column into benchmark
ttnghia Jun 22, 2022
f712db6
Set benchmark min time
ttnghia Jun 22, 2022
29d15d4
Don't check for nulls of the needles table
ttnghia Jun 22, 2022
a4d15d6
Use asterisk
ttnghia Jun 22, 2022
ec00f0a
Remove redundant variable
ttnghia Jun 22, 2022
be3b2fe
Merge branch 'branch-22.08' into distinct_with_nans_equality
ttnghia Jun 22, 2022
c121268
Remove redundant declaration
ttnghia Jun 22, 2022
2410c08
Change default behavior
ttnghia Jun 22, 2022
489060f
Merge branch 'branch-22.08' into set_operations
ttnghia Jun 22, 2022
7ee00ad
Rename function
ttnghia Jun 22, 2022
d3c404e
Merge branch 'distinct_with_nans_equality' into set_operations
ttnghia Jun 22, 2022
a3b2539
Fix compile errors
ttnghia Jun 22, 2022
c2c9a30
Remove temporary function
ttnghia Jun 22, 2022
fc40b55
Merge branch 'refactor_semijoin' into set_operations
ttnghia Jun 22, 2022
137d2ae
Remove all temporary functions
ttnghia Jun 22, 2022
58d36df
Rewrite `list_distinct`
ttnghia Jun 22, 2022
9a29248
Rewrite `list_overlap`
ttnghia Jun 22, 2022
84df3c5
Rewrite `set_intersect`
ttnghia Jun 22, 2022
c72b406
Rewrite `set_union`
ttnghia Jun 22, 2022
10e26b0
Rewrite all
ttnghia Jun 22, 2022
25d2635
Add detail header
ttnghia Jun 22, 2022
4fd850b
Change default value for nan comparison
ttnghia Jun 22, 2022
6436a54
Fix compile error
ttnghia Jun 22, 2022
f488969
Merge branch 'branch-22.08' into set_operations
ttnghia Jun 23, 2022
613e9ba
Update meta.yaml
ttnghia Jun 23, 2022
75e567d
Write more doxygen
ttnghia Jun 23, 2022
acf7bc9
Misc
ttnghia Jun 23, 2022
f5769ae
Rename file
ttnghia Jun 23, 2022
4354312
Add headers for test files
ttnghia Jun 23, 2022
f181ac2
Merge branch 'branch-22.08' into set_operations
ttnghia Jun 23, 2022
14d4f68
Add `TrivialTest` tests
ttnghia Jun 23, 2022
6925c02
Fix label generation
ttnghia Jun 23, 2022
541ebf9
Generate labels with nullmask
ttnghia Jun 23, 2022
da3f525
Revert "Generate labels with nullmask"
ttnghia Jun 23, 2022
6a28643
Fix validity check
ttnghia Jun 23, 2022
f901e39
Fix non-empty null lists
ttnghia Jun 23, 2022
814522d
All tests pass
ttnghia Jun 23, 2022
aecdf31
Merge branch 'branch-22.08' into set_operations
ttnghia Jun 23, 2022
55bfa69
Add C++ JNI
ttnghia Jun 23, 2022
ee53dfc
Add Java APIs
ttnghia Jun 23, 2022
aebf434
Add comments
ttnghia Jun 24, 2022
9266be4
Add doxygen
ttnghia Jun 24, 2022
4571c6d
Rewrite doxygen
ttnghia Jun 24, 2022
c0ac406
Rename function
ttnghia Jun 24, 2022
3a32498
Rename function
ttnghia Jun 24, 2022
f3a4e84
Add `utilities.*` files
ttnghia Jun 24, 2022
2b7386c
Extract `distinct`
ttnghia Jun 24, 2022
c8f3da0
Add test file for `cudf::lists::distinct`
ttnghia Jun 24, 2022
1d7e8e0
Add new implementation and test files
ttnghia Jun 24, 2022
51b80db
Fix compile error
ttnghia Jun 24, 2022
08a76ad
Rename function
ttnghia Jun 27, 2022
16101f7
Implement `cudf::detail::stable_distinct` and `lists::distinct`
ttnghia Jun 27, 2022
5ec13d6
Rewrite doxygen
ttnghia Jun 27, 2022
6c5b738
Rename variable
ttnghia Jun 27, 2022
5b70eee
Rewrite comment
ttnghia Jun 27, 2022
238248d
Rename files
ttnghia Jun 27, 2022
ba6bf6b
Implement float tests
ttnghia Jun 27, 2022
3845c95
Implement string tests
ttnghia Jun 27, 2022
507c82d
Implement tests for `ListDistinctTypedTest`
ttnghia Jun 28, 2022
2cb8347
Complete the remaining tests
ttnghia Jun 28, 2022
7efdea0
Merge branch 'branch-22.08' into add_lists_distinct
ttnghia Jun 28, 2022
4388637
Rewrite doxygen
ttnghia Jun 28, 2022
277c110
Merge branch 'add_lists_distinct' into set_operations
ttnghia Jun 28, 2022
1992561
Rewrite all
ttnghia Jun 28, 2022
818c85f
Fix compatibility check
ttnghia Jun 28, 2022
66e28ca
Misc
ttnghia Jun 28, 2022
279f04e
Remove files
ttnghia Jun 28, 2022
ee334c1
Implement floating point tests
ttnghia Jun 28, 2022
434d35c
Fix a bug
ttnghia Jun 28, 2022
99d526b
Add string tests
ttnghia Jun 28, 2022
66fce07
Implement typed tests
ttnghia Jun 28, 2022
8e2ff3d
Implement nested structs tests
ttnghia Jun 28, 2022
d4b7d6c
Cleanup
ttnghia Jun 28, 2022
669aa9e
Implement floating point tests and string tests
ttnghia Jun 29, 2022
f42a976
Implement typed tests
ttnghia Jun 29, 2022
e63b27d
Misc
ttnghia Jun 29, 2022
f0d839a
Implement nested structs tests
ttnghia Jun 29, 2022
21654db
Add blank lines
ttnghia Jun 29, 2022
ae36981
Implement `set_intersect_tests`
ttnghia Jun 29, 2022
b087720
Misc
ttnghia Jun 29, 2022
7a07e2c
Implement `set_union_tests`
ttnghia Jun 29, 2022
0f8f8e2
Remove files
ttnghia Jun 29, 2022
38b50bc
Rename files
ttnghia Jun 29, 2022
35cba3a
Fix a bug
ttnghia Jun 29, 2022
55fc6e5
Update default stream
ttnghia Jun 30, 2022
fa286a6
Add identity tests
ttnghia Jun 30, 2022
5764ae0
Merge branch 'branch-22.08' into set_operations
ttnghia Jun 30, 2022
678377d
Merge branch 'set_operations' into set_ops_jni
ttnghia Jun 30, 2022
1fceafd
Change default value for `list_overlap`
ttnghia Jul 5, 2022
2e8271b
Explicitly specify null and nan comparison parameters
ttnghia Jul 5, 2022
faa4048
Fix doxygen
ttnghia Jul 5, 2022
a24e206
Add Javadoc
ttnghia Jul 5, 2022
e7702a8
Add Java tests
ttnghia Jul 5, 2022
b2790dc
Modify comments
ttnghia Jul 6, 2022
21083e3
Add special post-processing for `list_overlap`
ttnghia Jul 7, 2022
2849114
Fix null handling error
ttnghia Jul 7, 2022
4cd3ba3
Optimize post-processing
ttnghia Jul 7, 2022
e192914
Add comment
ttnghia Jul 7, 2022
c639636
Merge branch 'branch-22.08' into set_ops_jni
ttnghia Jul 8, 2022
34465fc
Combine code
ttnghia Jul 8, 2022
374ed1f
Merge branch 'branch-22.08' into set_ops_jni
ttnghia Jul 26, 2022
e5c3803
Still resolve merge conflict
ttnghia Jul 26, 2022
949c875
Change function names
ttnghia Jul 26, 2022
638e52c
Fix typo
ttnghia Jul 26, 2022
272df92
Fix doxygen
ttnghia Jul 26, 2022
1ea9877
Remove empty line
ttnghia Jul 28, 2022
711ad52
Merge branch 'branch-22.08' into set_ops_jni
ttnghia Jul 28, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
99 changes: 94 additions & 5 deletions java/src/main/java/ai/rapids/cudf/ColumnView.java
Original file line number Diff line number Diff line change
Expand Up @@ -3260,12 +3260,11 @@ public final Table extractRe(String pattern) throws CudfException {
}

/**
* Extracts all strings that match the given regular expression and corresponds to the
* Extracts all strings that match the given regular expression and corresponds to the
* regular expression group index. Any null inputs also result in null output entries.
*
*
* For supported regex patterns refer to:
* @link https://docs.rapids.ai/api/libcudf/nightly/md_regex.html

* @param pattern The regex pattern
* @param idx The regex group index
* @return A new column vector of extracted matches
Expand Down Expand Up @@ -3313,7 +3312,7 @@ public final ColumnVector urlEncode() throws CudfException {
}

private static void assertIsSupportedMapKeyType(DType keyType) {
boolean isSupportedKeyType =
boolean isSupportedKeyType =
!keyType.equals(DType.EMPTY) && !keyType.equals(DType.LIST) && !keyType.equals(DType.STRUCT);
assert isSupportedKeyType : "Map lookup by STRUCT and LIST keys is not supported.";
}
Expand All @@ -3331,7 +3330,7 @@ public final ColumnVector getMapValue(ColumnView keys) {
return new ColumnVector(mapLookupForKeys(getNativeView(), keys.getNativeView()));
}

/**
/**
* Given a column of type List<Struct<X, Y>> and a key of type X, return a column of type Y,
* where each row in the output column is the Y value corresponding to the X key.
* If the key is not found, the corresponding output value is null.
Expand Down Expand Up @@ -3542,6 +3541,88 @@ public final ColumnVector listSortRows(boolean isDescending, boolean isNullSmall
return new ColumnVector(listSortRows(getNativeView(), isDescending, isNullSmallest));
}

/**
* For each pair of lists from the input lists columns, check if they have any common non-null
* elements.
*
* A null input row in any of the input columns will result in a null output row. During checking
* for common elements, nulls within each list are considered as different values while
* floating-point NaN values are considered as equal.
*
* The input lists columns must have the same size and same data type.
*
* @param lhs The input lists column for one side
* @param rhs The input lists column for the other side
* @return A column of type BOOL8 containing the check result
*/
public static ColumnVector listsHaveOverlap(ColumnView lhs, ColumnView rhs) {
assert lhs.getType().equals(DType.LIST) && rhs.getType().equals(DType.LIST) :
"Input columns type must be of type LIST";
assert lhs.getRowCount() == rhs.getRowCount() : "Input columns must have the same size";
return new ColumnVector(listsHaveOverlap(lhs.getNativeView(), rhs.getNativeView()));
}

/**
* Find the intersection without duplicate between lists at each row of the given lists columns.
*
* A null input row in any of the input lists columns will result in a null output row. During
* finding list intersection, nulls and floating-point NaN values within each list are
* considered as equal values.
*
* The input lists columns must have the same size and same data type.
*
* @param lhs The input lists column for one side
* @param rhs The input lists column for the other side
* @return A lists column containing the intersection result
*/
public static ColumnVector listsIntersectDistinct(ColumnView lhs, ColumnView rhs) {
assert lhs.getType().equals(DType.LIST) && rhs.getType().equals(DType.LIST) :
"Input columns type must be of type LIST";
assert lhs.getRowCount() == rhs.getRowCount() : "Input columns must have the same size";
return new ColumnVector(listsIntersectDistinct(lhs.getNativeView(), rhs.getNativeView()));
}

/**
* Find the union without duplicate between lists at each row of the given lists columns.
*
* A null input row in any of the input lists columns will result in a null output row. During
* finding list union, nulls and floating-point NaN values within each list are considered as
* equal values.
*
* The input lists columns must have the same size and same data type.
*
* @param lhs The input lists column for one side
* @param rhs The input lists column for the other side
* @return A lists column containing the union result
*/
public static ColumnVector listsUnionDistinct(ColumnView lhs, ColumnView rhs) {
assert lhs.getType().equals(DType.LIST) && rhs.getType().equals(DType.LIST) :
"Input columns type must be of type LIST";
assert lhs.getRowCount() == rhs.getRowCount() : "Input columns must have the same size";
return new ColumnVector(listsUnionDistinct(lhs.getNativeView(), rhs.getNativeView()));
}

/**
* Find the difference of lists of the left column against lists of the right column.
* Specifically, find the elements (without duplicates) from each list of the left column that
* do not exist in the corresponding list of the right column.
*
* A null input row in any of the input lists columns will result in a null output row. During
* finding, nulls and floating-point NaN values within each list are considered as equal values.
*
* The input lists columns must have the same size and same data type.
*
* @param lhs The input lists column for one side
* @param rhs The input lists column for the other side
* @return A lists column containing the difference result
*/
public static ColumnVector listsDifferenceDistinct(ColumnView lhs, ColumnView rhs) {
assert lhs.getType().equals(DType.LIST) && rhs.getType().equals(DType.LIST) :
"Input columns type must be of type LIST";
assert lhs.getRowCount() == rhs.getRowCount() : "Input columns must have the same size";
return new ColumnVector(listsDifferenceDistinct(lhs.getNativeView(), rhs.getNativeView()));
}

/**
* Generate list offsets from sizes of each list.
* NOTICE: This API only works for INT32. Otherwise, the behavior is undefined. And no null and negative value is allowed.
Expand Down Expand Up @@ -4089,6 +4170,14 @@ private static native long stringReplaceWithBackrefs(long columnView, String pat

private static native long listSortRows(long nativeView, boolean isDescending, boolean isNullSmallest);

private static native long listsHaveOverlap(long lhsViewHandle, long rhsViewHandle);

private static native long listsIntersectDistinct(long lhsViewHandle, long rhsViewHandle);

private static native long listsUnionDistinct(long lhsViewHandle, long rhsViewHandle);

private static native long listsDifferenceDistinct(long lhsViewHandle, long rhsViewHandle);

private static native long getElement(long nativeView, int index);

private static native long castTo(long nativeHandle, int type, int scale);
Expand Down
67 changes: 67 additions & 0 deletions java/src/main/native/src/ColumnViewJni.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
#include <cudf/lists/extract.hpp>
#include <cudf/lists/gather.hpp>
#include <cudf/lists/lists_column_view.hpp>
#include <cudf/lists/set_operations.hpp>
#include <cudf/lists/sorting.hpp>
#include <cudf/lists/stream_compaction.hpp>
#include <cudf/null_mask.hpp>
Expand Down Expand Up @@ -595,6 +596,72 @@ JNIEXPORT jlong JNICALL Java_ai_rapids_cudf_ColumnView_generateListOffsets(JNIEn
CATCH_STD(env, 0);
}

JNIEXPORT jlong JNICALL Java_ai_rapids_cudf_ColumnView_listsHaveOverlap(JNIEnv *env, jclass,
jlong lhs_handle,
jlong rhs_handle) {
JNI_NULL_CHECK(env, lhs_handle, "lhs_handle is null", 0)
JNI_NULL_CHECK(env, rhs_handle, "rhs_handle is null", 0)
try {
cudf::jni::auto_set_device(env);
auto const lhs = reinterpret_cast<cudf::column_view const *>(lhs_handle);
auto const rhs = reinterpret_cast<cudf::column_view const *>(rhs_handle);
auto overlap_result =
cudf::lists::have_overlap(cudf::lists_column_view{*lhs}, cudf::lists_column_view{*rhs},
cudf::null_equality::UNEQUAL, cudf::nan_equality::ALL_EQUAL);
cudf::jni::post_process_list_overlap(*lhs, *rhs, overlap_result);
return release_as_jlong(overlap_result);
}
CATCH_STD(env, 0);
}

JNIEXPORT jlong JNICALL Java_ai_rapids_cudf_ColumnView_listsIntersectDistinct(JNIEnv *env, jclass,
jlong lhs_handle,
jlong rhs_handle) {
JNI_NULL_CHECK(env, lhs_handle, "lhs_handle is null", 0)
JNI_NULL_CHECK(env, rhs_handle, "rhs_handle is null", 0)
try {
cudf::jni::auto_set_device(env);
auto const lhs = reinterpret_cast<cudf::column_view const *>(lhs_handle);
auto const rhs = reinterpret_cast<cudf::column_view const *>(rhs_handle);
return release_as_jlong(cudf::lists::intersect_distinct(
cudf::lists_column_view{*lhs}, cudf::lists_column_view{*rhs}, cudf::null_equality::EQUAL,
cudf::nan_equality::ALL_EQUAL));
}
CATCH_STD(env, 0);
}

JNIEXPORT jlong JNICALL Java_ai_rapids_cudf_ColumnView_listsUnionDistinct(JNIEnv *env, jclass,
jlong lhs_handle,
jlong rhs_handle) {
JNI_NULL_CHECK(env, lhs_handle, "lhs_handle is null", 0)
JNI_NULL_CHECK(env, rhs_handle, "rhs_handle is null", 0)
try {
cudf::jni::auto_set_device(env);
auto const lhs = reinterpret_cast<cudf::column_view const *>(lhs_handle);
auto const rhs = reinterpret_cast<cudf::column_view const *>(rhs_handle);
return release_as_jlong(
cudf::lists::union_distinct(cudf::lists_column_view{*lhs}, cudf::lists_column_view{*rhs},
cudf::null_equality::EQUAL, cudf::nan_equality::ALL_EQUAL));
}
CATCH_STD(env, 0);
}

JNIEXPORT jlong JNICALL Java_ai_rapids_cudf_ColumnView_listsDifferenceDistinct(JNIEnv *env, jclass,
jlong lhs_handle,
jlong rhs_handle) {
JNI_NULL_CHECK(env, lhs_handle, "lhs_handle is null", 0)
JNI_NULL_CHECK(env, rhs_handle, "rhs_handle is null", 0)
try {
cudf::jni::auto_set_device(env);
auto const lhs = reinterpret_cast<cudf::column_view const *>(lhs_handle);
auto const rhs = reinterpret_cast<cudf::column_view const *>(rhs_handle);
return release_as_jlong(cudf::lists::difference_distinct(
cudf::lists_column_view{*lhs}, cudf::lists_column_view{*rhs}, cudf::null_equality::EQUAL,
cudf::nan_equality::ALL_EQUAL));
}
CATCH_STD(env, 0);
}

JNIEXPORT jlongArray JNICALL Java_ai_rapids_cudf_ColumnView_stringSplit(JNIEnv *env, jclass,
jlong input_handle,
jstring pattern_obj,
Expand Down
90 changes: 90 additions & 0 deletions java/src/main/native/src/ColumnViewJni.cu
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@
* limitations under the License.
*/

#include <vector>

#include <cudf/column/column_device_view.cuh>
#include <cudf/column/column_factories.hpp>
#include <cudf/copying.hpp>
Expand All @@ -22,12 +24,17 @@
#include <cudf/detail/null_mask.hpp>
#include <cudf/detail/stream_compaction.hpp>
#include <cudf/detail/valid_if.cuh>
#include <cudf/lists/list_device_view.cuh>
#include <cudf/lists/lists_column_device_view.cuh>
#include <cudf/table/table.hpp>
#include <cudf/table/table_view.hpp>
#include <cudf/utilities/span.hpp>
#include <rmm/device_uvector.hpp>
#include <rmm/exec_policy.hpp>
#include <thrust/functional.h>
#include <thrust/logical.h>
#include <thrust/scan.h>
#include <thrust/tabulate.h>

#include "ColumnViewJni.hpp"

Expand Down Expand Up @@ -81,6 +88,89 @@ std::unique_ptr<cudf::column> generate_list_offsets(cudf::column_view const &lis
return offsets_column;
}

namespace {

/**
* @brief Check if the input list has any null elements.
*
* @param list The input list.
* @return The boolean result indicating if the input list has null elements.
*/
__device__ bool list_has_nulls(list_device_view list) {
return thrust::any_of(thrust::seq, thrust::make_counting_iterator(0),
thrust::make_counting_iterator(list.size()),
[&list](auto const idx) { return list.is_null(idx); });
}

} // namespace

void post_process_list_overlap(cudf::column_view const &lhs, cudf::column_view const &rhs,
std::unique_ptr<cudf::column> const &overlap_result,
rmm::cuda_stream_view stream) {
// If both of the input columns do not have nulls, we don't need to do anything here.
if (!lists_column_view{lhs}.child().has_nulls() && !lists_column_view{rhs}.child().has_nulls()) {
return;
}

auto const overlap_cv = overlap_result->view();
auto const lhs_cdv_ptr = column_device_view::create(lhs, stream);
auto const rhs_cdv_ptr = column_device_view::create(rhs, stream);
auto const overlap_cdv_ptr = column_device_view::create(overlap_cv, stream);

// Create a new bitmask to satisfy Spark's arrays_overlap's special behavior.
auto validity = rmm::device_uvector<bool>(overlap_cv.size(), stream);
thrust::tabulate(rmm::exec_policy(stream), validity.begin(), validity.end(),
[lhs = cudf::detail::lists_column_device_view{*lhs_cdv_ptr},
rhs = cudf::detail::lists_column_device_view{*rhs_cdv_ptr},
overlap_result = *overlap_cdv_ptr] __device__(auto const idx) {
if (overlap_result.is_null(idx) ||
overlap_result.template element<bool>(idx)) {
return true;
}

// `lhs_list` and `rhs_list` should not be null, otherwise
// `overlap_result[idx]` is null and that has been handled above.
auto const lhs_list = list_device_view{lhs, idx};
auto const rhs_list = list_device_view{rhs, idx};

// Only proceed if both lists are non-empty.
if (lhs_list.size() == 0 || rhs_list.size() == 0) {
return true;
}

// Only proceed if at least one list has nulls.
if (!list_has_nulls(lhs_list) && !list_has_nulls(rhs_list)) {
return true;
}

// Here, the input lists satisfy all the conditions below so we output a
// null:
// - Both of the the input lists have no non-null common element, and
// - They are both non-empty, and
// - Either of them contains null elements.
return false;
});

// Create a new nullmask from the validity data.
auto [new_null_mask, new_null_count] =
cudf::detail::valid_if(validity.begin(), validity.end(), thrust::identity{});

if (new_null_count > 0) {
// If the `overlap_result` column is nullable, perform `bitmask_and` of its nullmask and the
// new nullmask.
if (overlap_cv.nullable()) {
auto [null_mask, null_count] = cudf::detail::bitmask_and(
std::vector<bitmask_type const *>{
overlap_cv.null_mask(), static_cast<bitmask_type const *>(new_null_mask.data())},
std::vector<cudf::size_type>{0, 0}, overlap_cv.size(), stream);
overlap_result->set_null_mask(std::move(null_mask), null_count);
} else {
// Just set the output nullmask as the new nullmask.
overlap_result->set_null_mask(std::move(new_null_mask), new_null_count);
}
}
}

std::unique_ptr<cudf::column> lists_distinct_by_key(cudf::lists_column_view const &input,
rmm::cuda_stream_view stream) {
if (input.is_empty()) {
Expand Down
22 changes: 22 additions & 0 deletions java/src/main/native/src/ColumnViewJni.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,28 @@ std::unique_ptr<cudf::column>
generate_list_offsets(cudf::column_view const &list_length,
rmm::cuda_stream_view stream = cudf::default_stream_value);

/**
* @brief Perform a special treatment for the results of `cudf::lists::have_overlap` to produce the
* results that match with Spark's `arrays_overlap`.
*
* The function `arrays_overlap` of Apache Spark has a special behavior that needs to be addressed.
* In particular, the result of checking overlap between two lists will be a null element instead of
* a `false` value (as output by `cudf::lists::have_overlap`) if:
* - Both of the the input lists have no non-null common element, and
* - They are both non-empty, and
* - Either of them contains null elements.
*
* This function performs post-processing on the results of `cudf::lists::have_overlap`, adding
* special treatment to produce an output column that matches with the behavior described above.
*
* @param lhs The input lists column for one side.
* @param rhs The input lists column for the other side.
* @param overlap_result The result column generated by checking list overlap in cudf.
*/
void post_process_list_overlap(cudf::column_view const &lhs, cudf::column_view const &rhs,
std::unique_ptr<cudf::column> const &overlap_result,
rmm::cuda_stream_view stream = cudf::default_stream_value);

/**
* @brief Generates lists column by copying elements that are distinct by key from each input list
* row to the corresponding output row.
Expand Down
Loading