Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow merging index column with data column using keyword "on" #7569

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
122 commits
Select commit Hold shift + click to select a range
4a4b4af
Merge branch 'branch-0.17' into branch-0.18
shwina Dec 11, 2020
223f2b5
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into b…
shwina Dec 15, 2020
abd6ad2
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into b…
shwina Dec 17, 2020
18863b5
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into b…
shwina Jan 4, 2021
0fbdd31
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into b…
shwina Jan 5, 2021
dc9b943
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into b…
shwina Jan 5, 2021
d586aa7
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into b…
shwina Jan 7, 2021
996fda8
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into b…
shwina Jan 8, 2021
2808a5c
Add a compute_hash_join_indices that returns just the join indices
shwina Jan 11, 2021
ef0baee
Don't need common_columns stuff for join that returns a gathermap
shwina Jan 11, 2021
18f3074
Add hash_join_impl methods that return gathermaps
shwina Jan 11, 2021
70abf48
Add overloads to public hash_join class
shwina Jan 11, 2021
13dff67
Add top-level join APIs that return gathermaps
shwina Jan 11, 2021
3300fe1
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into g…
shwina Jan 12, 2021
7ed694c
Use device_uvector instead of device_vector in join
shwina Jan 12, 2021
636c2ea
Undo some API changes
shwina Jan 12, 2021
b79da68
Add join_result
shwina Jan 13, 2021
380aa59
Add APIs that return join_result
shwina Jan 13, 2021
3cbb2b4
Remove column_in_common
shwina Jan 13, 2021
53ae7c9
Add an inner join API that returns gathermaps
shwina Jan 14, 2021
fde172b
Add remaining APIs to return gathermaps
shwina Jan 14, 2021
4a286dd
Add gathermap join test
shwina Jan 18, 2021
c756db9
Replace -1 with INT_MIN
shwina Jan 18, 2021
6a3d23e
Make join_result columns instead of column_views
shwina Jan 20, 2021
5dfc2a0
Replace join_result with a pair of columns
shwina Jan 20, 2021
362829b
Add gathermap test for outer join
shwina Jan 20, 2021
4e4380c
Add and pass full join gathermap test
shwina Jan 20, 2021
339a13d
Begin Python-side refactor
shwina Jan 21, 2021
2b07802
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into g…
shwina Jan 25, 2021
0d5a19c
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into g…
shwina Jan 28, 2021
fdbdc12
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into g…
shwina Feb 1, 2021
5dd5d29
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into g…
shwina Feb 5, 2021
6b20429
Merge branch 'branch-0.19' into gathermap-based-join-apis
shwina Feb 8, 2021
044eac1
Add left_semi and left_anti join APIs that return gathermaps
shwina Feb 8, 2021
555d5ec
Add Cython bindings
shwina Feb 8, 2021
56ae616
full -> outer
shwina Feb 9, 2021
dd05121
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into g…
shwina Feb 9, 2021
d447924
Progress
shwina Feb 9, 2021
484512e
More progress on py refactor
shwina Feb 9, 2021
5227582
Remove breakpoint
shwina Feb 10, 2021
9cd870e
Fix neg index handling
shwina Feb 10, 2021
8e4f193
Use nullify gather in join
shwina Feb 10, 2021
29fe140
Handle outer joins better
shwina Feb 10, 2021
b634055
Fix index construction
shwina Feb 10, 2021
cd53d6c
Fix sorting behaviour
shwina Feb 10, 2021
75f1efd
Fix Index.join
shwina Feb 10, 2021
1f5d6ad
Progress on semi/anti joins
shwina Feb 10, 2021
de30520
Add simple join test
shwina Feb 10, 2021
66a0de5
Semi-join fix
shwina Feb 11, 2021
ca72295
Only combine key columns in outer join if they have the same name
shwina Feb 11, 2021
ee2242d
Handle when both _on and _index are provided
shwina Feb 11, 2021
e531725
Fix sorting join result
shwina Feb 11, 2021
c8b4948
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into g…
shwina Feb 11, 2021
674095c
whitespace
shwina Feb 12, 2021
cbd9dc3
Make construct_join_output_df work with column views
shwina Feb 12, 2021
3f3c3cb
Get rid of hash_join::left_join
shwina Feb 12, 2021
01415fc
More join C++ cleanup
shwina Feb 12, 2021
6185492
Even more cleaning
shwina Feb 17, 2021
d736d1c
More join tests
shwina Feb 18, 2021
b58591d
Fix all join tests
shwina Feb 18, 2021
be560bb
Python regressions
shwina Feb 18, 2021
efb60d6
Revert
shwina Feb 18, 2021
fe6d0b8
Invalid -> Unkown
shwina Feb 18, 2021
547027c
Don't mutate lhs/rhs
shwina Feb 18, 2021
5f93d23
Fix join tests
shwina Feb 19, 2021
b7bf821
Fix semi/anti join trivial cases
shwina Feb 19, 2021
50a2fb2
When testing join results, use a helper that sorts values
shwina Feb 19, 2021
ff0ae79
Totally broken commit
shwina Feb 19, 2021
07cd052
Cleanup
shwina Feb 20, 2021
bd6bf77
Warnings
shwina Feb 20, 2021
a40063e
Cleanup
shwina Feb 22, 2021
ccef9d0
Cleanup
shwina Feb 22, 2021
210244b
Cleanup
shwina Feb 22, 2021
b57348c
Add typing for join helpers
shwina Feb 22, 2021
5c2c9b3
Typing for Join class
shwina Feb 22, 2021
558aa15
Simplify joiner API
shwina Feb 22, 2021
3184896
Example doc
shwina Feb 22, 2021
d3535dc
Refactor join APIs to return a device_uvector
shwina Feb 25, 2021
3b0a2a5
Merge tag 'branch-0.19-latest' of https://github.com/rapidsai/cudf in…
shwina Mar 1, 2021
b82181d
docs
shwina Mar 3, 2021
77d2bfd
Finish up docs?
shwina Mar 3, 2021
0bf34e8
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into g…
shwina Mar 4, 2021
26a3fb0
Fix join tests
shwina Mar 4, 2021
8a60d62
Refactor join APIs to work with unique_ptr<rmm::device_uvector>>
shwina Mar 5, 2021
387a953
Update join Cython
shwina Mar 5, 2021
6cd6433
Need to resize the gathermap
shwina Mar 5, 2021
c67dcce
Doc
shwina Mar 5, 2021
30c22ed
Changelog
shwina Mar 5, 2021
f73199d
Add helper to convert gather_map_type->Column
shwina Mar 9, 2021
393c06a
Update python/cudf/cudf/core/frame.py
shwina Mar 9, 2021
e91f554
Cannot specify both column and index
shwina Mar 9, 2021
0185896
Vaildate how
shwina Mar 9, 2021
b232f85
Merge branch 'gathermap-based-join-apis' of github.com:shwina/cudf in…
shwina Mar 9, 2021
1eb495d
Can't use a set
shwina Mar 9, 2021
4f1f072
Avoid function local import
shwina Mar 10, 2021
4aa8fec
False -> NotImplementedError
shwina Mar 10, 2021
ae0e5f9
Update cpp/include/cudf/join.hpp
shwina Mar 10, 2021
f47cf7e
Reuse some join logic
shwina Mar 10, 2021
2a201c3
Merge branch 'gathermap-based-join-apis' of github.com:shwina/cudf in…
shwina Mar 10, 2021
230ca08
Formatting
shwina Mar 10, 2021
1ad18b0
Closed PR 6453; created new PR to implement merging index column with…
skirui-source Mar 11, 2021
498a621
Update cpp/include/cudf/join.hpp
shwina Mar 11, 2021
2de26f3
Docs?
shwina Mar 11, 2021
d6f128c
Merge branch 'gathermap-based-join-apis' of github.com:shwina/cudf in…
shwina Mar 11, 2021
b7d8d8a
Use mr
shwina Mar 11, 2021
3988897
.
skirui-source Mar 12, 2021
9efc761
Docs
shwina Mar 15, 2021
8779bc7
Simplify suffix handling
shwina Mar 16, 2021
4c71c6c
Merge branch 'gathermap-based-join-apis' of github.com:shwina/cudf in…
skirui-source Mar 16, 2021
7a8f83b
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into m…
skirui-source Mar 16, 2021
4c651ac
Simplify joiner requirements
shwina Mar 17, 2021
b4f4d7c
Do less work in SemiJoin._merge_results
shwina Mar 17, 2021
d353c92
Doc
shwina Mar 17, 2021
580a346
Doc
shwina Mar 17, 2021
328dafd
Return None from semi_join
shwina Mar 17, 2021
297d20a
Init common_type
shwina Mar 17, 2021
d9e291b
Merge branch 'gathermap-based-join-apis' of github.com:shwina/cudf in…
skirui-source Mar 17, 2021
049a332
handles merging index columns with data columns. failing tests
skirui-source Mar 18, 2021
ec1ff7a
Merge branch 'branch-0.19' into mergeindexondata
skirui-source Mar 23, 2021
2cf80b6
fix merge conflict in frame.py
skirui-source Mar 23, 2021
f13d1b5
Merge branch 'mergeindexondata' of github.com:skirui-source/cudf into…
skirui-source Mar 23, 2021
c64aa1d
fix merge conflict
skirui-source Mar 25, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 2 additions & 6 deletions cpp/benchmarks/join/join_benchmark.cu
Original file line number Diff line number Diff line change
Expand Up @@ -105,12 +105,8 @@ static void BM_join(benchmark::State &state)
for (auto _ : state) {
cuda_event_timer raii(state, true, 0);

auto result = cudf::inner_join(probe_table,
build_table,
columns_to_join,
columns_to_join,
{{0, 0}},
cudf::null_equality::UNEQUAL);
auto result = cudf::inner_join(
probe_table, build_table, columns_to_join, columns_to_join, cudf::null_equality::UNEQUAL);
}
}

Expand Down
440 changes: 252 additions & 188 deletions cpp/include/cudf/join.hpp

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions cpp/include/cudf/table/table_view.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,11 @@ class table_view_base {
*/
size_type num_rows() const noexcept { return _num_rows; }

/**
* @brief Returns true if `num_columns()` returns zero, or false otherwise
*/
size_type is_empty() const noexcept { return num_columns() == 0; }

table_view_base() = default;

~table_view_base() = default;
Expand Down
4 changes: 1 addition & 3 deletions cpp/src/copying/gather.cu
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,7 @@ std::unique_ptr<table> gather(table_view const& source_table,

if (neg_indices == negative_index_policy::ALLOWED) {
cudf::size_type n_rows = source_table.num_rows();
auto idx_converter = [n_rows] __device__(size_type in) {
return ((in % n_rows) + n_rows) % n_rows;
};
auto idx_converter = [n_rows] __device__(size_type in) { return in < 0 ? in + n_rows : in; };
return gather(source_table,
thrust::make_transform_iterator(map_begin, idx_converter),
thrust::make_transform_iterator(map_end, idx_converter),
Expand Down
333 changes: 133 additions & 200 deletions cpp/src/join/hash_join.cu

Large diffs are not rendered by default.

143 changes: 52 additions & 91 deletions cpp/src/join/hash_join.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,9 @@
*/
#pragma once

#include <cudf/detail/concatenate.cuh>
#include <cudf/detail/gather.cuh>
#include <cudf/detail/gather.hpp>
#include <join/join_common_utils.hpp>
#include <join/join_kernels.cuh>

Expand All @@ -25,7 +28,7 @@
#include <cudf/table/table_view.hpp>

#include <rmm/cuda_stream_view.hpp>
#include <rmm/device_vector.hpp>
#include <rmm/device_uvector.hpp>
#include <rmm/exec_policy.hpp>

#include <thrust/sequence.h>
Expand Down Expand Up @@ -178,19 +181,29 @@ size_type estimate_join_output_size(table_device_view build_table,
*
* @param left Table of left columns to join
* @param stream CUDA stream used for device memory operations and kernel launches
* @param mr Device memory resource used to allocate the result
*
* @return Join output indices vector pair
*/
inline std::pair<rmm::device_vector<size_type>, rmm::device_vector<size_type>>
get_trivial_left_join_indices(table_view const& left, rmm::cuda_stream_view stream)
inline std::pair<std::unique_ptr<rmm::device_uvector<size_type>>,
std::unique_ptr<rmm::device_uvector<size_type>>>
get_trivial_left_join_indices(
table_view const& left,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource())
{
rmm::device_vector<size_type> left_indices(left.num_rows());
thrust::sequence(rmm::exec_policy(stream), left_indices.begin(), left_indices.end(), 0);
rmm::device_vector<size_type> right_indices(left.num_rows());
thrust::fill(rmm::exec_policy(stream), right_indices.begin(), right_indices.end(), JoinNoneValue);
auto left_indices = std::make_unique<rmm::device_uvector<size_type>>(left.num_rows(), stream, mr);
thrust::sequence(rmm::exec_policy(stream), left_indices->begin(), left_indices->end(), 0);
auto right_indices =
std::make_unique<rmm::device_uvector<size_type>>(left.num_rows(), stream, mr);
thrust::fill(
rmm::exec_policy(stream), right_indices->begin(), right_indices->end(), JoinNoneValue);
return std::make_pair(std::move(left_indices), std::move(right_indices));
}

std::pair<std::unique_ptr<table>, std::unique_ptr<table>> get_empty_joined_table(
table_view const& probe, table_view const& build);

std::unique_ptr<cudf::table> combine_table_pair(std::unique_ptr<cudf::table>&& left,
std::unique_ptr<cudf::table>&& right);

Expand All @@ -207,106 +220,52 @@ struct hash_join::hash_join_impl {

private:
cudf::table_view _build;
cudf::table_view _build_selected;
std::vector<size_type> _build_on;
std::unique_ptr<cudf::detail::multimap_type, std::function<void(cudf::detail::multimap_type*)>>
_hash_table;

public:
/**
* @brief Constructor that internally builds the hash table based on the given `build` table and
* column indices specified by `build_on` for subsequent probe calls.
* @brief Constructor that internally builds the hash table based on the given `build` table
*
* @throw cudf::logic_error if the number of columns in `build` table is 0.
* @throw cudf::logic_error if the number of rows in `build` table exceeds MAX_JOIN_SIZE.
* @throw std::out_of_range if elements of `build_on` exceed the number of columns in the `build`
* table.
*
* @param build The build table, from which the hash table is built.
* @param build_on The column indices from `build` to join on.
* @param compare_nulls Controls whether null join-key values should match or not.
*/
hash_join_impl(cudf::table_view const& build,
std::vector<size_type> const& build_on,
null_equality compare_nulls,
rmm::cuda_stream_view stream = rmm::cuda_stream_default);

std::pair<std::unique_ptr<cudf::table>, std::unique_ptr<cudf::table>> inner_join(
cudf::table_view const& probe,
std::vector<size_type> const& probe_on,
std::vector<std::pair<cudf::size_type, cudf::size_type>> const& columns_in_common,
common_columns_output_side common_columns_output_side,
null_equality compare_nulls,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr) const;

std::unique_ptr<cudf::table> left_join(
cudf::table_view const& probe,
std::vector<size_type> const& probe_on,
std::vector<std::pair<cudf::size_type, cudf::size_type>> const& columns_in_common,
null_equality compare_nulls,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr) const;

std::unique_ptr<cudf::table> full_join(
cudf::table_view const& probe,
std::vector<size_type> const& probe_on,
std::vector<std::pair<cudf::size_type, cudf::size_type>> const& columns_in_common,
null_equality compare_nulls,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr) const;
std::pair<std::unique_ptr<rmm::device_uvector<size_type>>,
std::unique_ptr<rmm::device_uvector<size_type>>>
inner_join(cudf::table_view const& probe,
null_equality compare_nulls,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr) const;

std::pair<std::unique_ptr<rmm::device_uvector<size_type>>,
std::unique_ptr<rmm::device_uvector<size_type>>>
left_join(cudf::table_view const& probe,
null_equality compare_nulls,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr) const;

std::pair<std::unique_ptr<rmm::device_uvector<size_type>>,
std::unique_ptr<rmm::device_uvector<size_type>>>
full_join(cudf::table_view const& probe,
null_equality compare_nulls,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr) const;

private:
/**
* @brief Performs hash join by probing the columns provided in `probe` as per
* the joining indices given in `probe_on` and returns a (`probe`, `_build`) table pair, which
* contains the probe and build portions of the logical joined table respectively.
*
* @throw cudf::logic_error if `columns_in_common` contains a pair of indices
* (`P`, `B`) where `P` does not exist in `probe_on` or `B` does not exist in
* `_build_on`.
* @throw cudf::logic_error if `columns_in_common` contains a pair of indices
* (`P`, `B`) such that the location of `P` within `probe_on` is not equal to
* the location of `B` within `_build_on`.
* @throw cudf::logic_error if the number of elements in `probe_on` and
* `_build_on` are not equal.
* @throw cudf::logic_error if the number of columns in `probe` is 0.
* @throw cudf::logic_error if the number of rows in `probe` table exceeds MAX_JOIN_SIZE.
* @throw std::out_of_range if elements of `probe_on` exceed the number of columns in the `probe`
* table.
* @throw cudf::logic_error if types do not match between joining columns.
*
* @tparam JoinKind The type of join to be performed.
*
* @param probe The probe table.
* @param probe_on The column's indices from `probe` to join on.
* Column `i` from `probe_on` will be compared against column `i` of `_build_on`.
* @param columns_in_common is a vector of pairs of column indices into
* `probe` and `_build`, respectively, that are "in common". For "common"
* columns, only a single output column will be produced, which is gathered
* from `probe_on` columns. Else, for every column in `probe_on` and `_build_on`,
* an output column will be produced. For each of these pairs (P, B), P
* should exist in `probe_on` and B should exist in `_build_on`.
* @param common_columns_output_side @see cudf::hash_join::common_columns_output_side.
* @param compare_nulls Controls whether null join-key values should match or not.
* @param mr Device memory resource used to allocate the returned table's device memory.
* @param stream CUDA stream used for device memory operations and kernel launches.
*
* @return Table pair of (`probe`, `_build`) of joining both tables on the columns
* specified by `probe_on` and `_build_on`. The resulting table pair will be joined columns of
* (`probe(including common columns)`, `_build(excluding common columns)`) if
* `common_columns_output_side` is `PROBE`, or (`probe(excluding common columns)`,
* `_build(including common columns)`) if `common_columns_output_side` is `BUILD`.
*/
template <cudf::detail::join_kind JoinKind>
std::pair<std::unique_ptr<cudf::table>, std::unique_ptr<cudf::table>> compute_hash_join(
cudf::table_view const& probe,
std::vector<size_type> const& probe_on,
std::vector<std::pair<cudf::size_type, cudf::size_type>> const& columns_in_common,
common_columns_output_side common_columns_output_side,
null_equality compare_nulls,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr) const;
std::pair<std::unique_ptr<rmm::device_uvector<size_type>>,
std::unique_ptr<rmm::device_uvector<size_type>>>
compute_hash_join(cudf::table_view const& probe,
null_equality compare_nulls,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr) const;

/**
* @brief Probes the `_hash_table` built from `_build` for tuples in `probe_table`,
Expand All @@ -320,15 +279,17 @@ struct hash_join::hash_join_impl {
* @param probe_table Table of probe side columns to join.
* @param compare_nulls Controls whether null join-key values should match or not.
* @param stream CUDA stream used for device memory operations and kernel launches.
* @param mr Device memory resource used to allocate the returned vectors.
*
* @return Join output indices vector pair.
*/
template <cudf::detail::join_kind JoinKind>
std::enable_if_t<JoinKind != cudf::detail::join_kind::FULL_JOIN,
std::pair<rmm::device_vector<size_type>, rmm::device_vector<size_type>>>
std::pair<std::unique_ptr<rmm::device_uvector<size_type>>,
std::unique_ptr<rmm::device_uvector<size_type>>>
probe_join_indices(cudf::table_view const& probe,
null_equality compare_nulls,
rmm::cuda_stream_view stream) const;
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr) const;
};

} // namespace cudf
Loading