Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement all methods of groupby rank aggregation in libcudf, python #9569

Merged
merged 56 commits into from
Apr 28, 2022
Merged
Show file tree
Hide file tree
Changes from 46 commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
7085ad1
move RANK, DENSE_RANK into single RANK aggregation
karthikeyann Oct 29, 2021
24a11c9
Merge branch 'branch-21.12' of https://github.com/rapidsai/cudf into …
karthikeyann Nov 8, 2021
007bafb
Merge branch 'branch-21.12' of https://github.com/rapidsai/cudf into …
karthikeyann Nov 10, 2021
dca492a
style fix clang-format
karthikeyann Nov 15, 2021
d970133
Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …
karthikeyann Nov 15, 2021
406faed
Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …
karthikeyann Dec 7, 2021
e5e97fd
fix factories usage
karthikeyann Dec 7, 2021
fa1cd22
fix throw message
karthikeyann Dec 7, 2021
0e67a17
Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …
karthikeyann Jan 3, 2022
b79fe0c
Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …
karthikeyann Jan 10, 2022
af85093
Merge branch 'branch-22.04' of https://github.com/rapidsai/cudf into …
karthikeyann Feb 26, 2022
5bab9e2
update copyright year
karthikeyann Feb 26, 2022
8a7da0d
rename PERCENT_RANK to ANSI_SQL_PERCENT_RANK
karthikeyann Feb 26, 2022
b731422
Merge branch 'branch-22.04' of https://github.com/rapidsai/cudf into …
karthikeyann Mar 18, 2022
a3bd6ad
Merge branch 'branch-22.04' of https://github.com/rapidsai/cudf into …
karthikeyann Mar 22, 2022
b69fe85
move rank_method to aggregation headers from types headers
karthikeyann Mar 28, 2022
96d2b61
Merge branch 'branch-22.06' of https://github.com/rapidsai/cudf into …
karthikeyann Mar 28, 2022
c5c080d
add cython, python code for groupby rank
karthikeyann Mar 28, 2022
7690d28
fix bug - assumes values is presorted, presorted is for keys, not va…
karthikeyann Mar 30, 2022
d55d80f
reuse rank_min aggregation for ANSI_SQL_PERCENT_RANK
karthikeyann Mar 30, 2022
1578742
add max, first, average rank agg in sort groupby
karthikeyann Mar 30, 2022
6ac6c3e
add pytest for groupby rank
karthikeyann Mar 30, 2022
8cea0bd
implement percentage to groupby rank
karthikeyann Mar 31, 2022
29e6917
add percentage tests, with nulls
karthikeyann Mar 31, 2022
17a6da1
fix failing tests is_presorted assumes values sorted too. (TODO fix)
karthikeyann Apr 1, 2022
7913298
add groupby rank benchmark
karthikeyann Apr 1, 2022
3f9ccca
cleanup and documentation
karthikeyann Apr 1, 2022
9dc2af5
fix moving RankMethod
karthikeyann Apr 1, 2022
ea3a6da
add first, average, max rank documentation
karthikeyann Apr 1, 2022
5c83f1a
update documentation
karthikeyann Apr 1, 2022
58a1460
Merge branch 'branch-22.06' into fea-groupby_rank_full
karthikeyann Apr 4, 2022
674a5ad
null include default for spark
karthikeyann Apr 4, 2022
5d1070f
address review comments (mythrocks)
karthikeyann Apr 5, 2022
2bb65f6
Merge branch 'branch-22.06' of https://github.com/rapidsai/cudf into …
karthikeyann Apr 5, 2022
d0ba38a
address review comments
karthikeyann Apr 7, 2022
4f290af
pytest coverage
karthikeyann Apr 7, 2022
145c9b1
Merge branch 'branch-22.06' of https://github.com/rapidsai/cudf into …
karthikeyann Apr 7, 2022
50660af
address review comments (mythrocks, vyasr)
karthikeyann Apr 7, 2022
00f5dbe
named lambdas
karthikeyann Apr 7, 2022
b5b3234
add forward tparam to use forward or reverse iterator for scan_by_key
karthikeyann Apr 7, 2022
f7af03a
simplify lambda
karthikeyann Apr 7, 2022
75ddf65
fix ansi_sql_rank_aggregation rename
karthikeyann Apr 7, 2022
aacf74e
fix style black
karthikeyann Apr 7, 2022
bc5f7b6
remove unused header
karthikeyann Apr 11, 2022
21048b4
address review comments (vyasr)
karthikeyann Apr 14, 2022
9853cf6
address review comments (vyasr)
karthikeyann Apr 14, 2022
b810a84
Merge branch 'branch-22.06' of https://github.com/rapidsai/cudf into …
karthikeyann Apr 18, 2022
fbdb80e
fix merge issues
karthikeyann Apr 18, 2022
fef487c
move groupby rank gbench to nvbench
karthikeyann Apr 18, 2022
51a4fe6
documentation about orderby sorting requirement in rank aggregations
karthikeyann Apr 18, 2022
ee88304
Merge branch 'branch-22.06' of https://github.com/rapidsai/cudf into …
karthikeyann Apr 21, 2022
c988d8f
rename ANSI_SQL_PERCENT_RANK to MIN_0_INDEXED rank_method
karthikeyann Apr 25, 2022
d269d2d
Merge branch 'branch-22.06' of https://github.com/rapidsai/cudf into …
karthikeyann Apr 25, 2022
fc8b625
add rank_percentage, replace MIN_0_INDEXED
karthikeyann Apr 26, 2022
fb8c096
update percentage enum in java, python usages
karthikeyann Apr 26, 2022
36c7a53
fix group_size==1 as zero rank percentage
karthikeyann Apr 27, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions cpp/benchmarks/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -194,8 +194,14 @@ ConfigureBench(FILL_BENCH filling/repeat.cpp)
# ##################################################################################################
# * groupby benchmark -----------------------------------------------------------------------------
ConfigureBench(
GROUPBY_BENCH groupby/group_sum.cu groupby/group_nth.cu groupby/group_shift.cu
groupby/group_struct.cu groupby/group_no_requests.cu groupby/group_scan.cu
GROUPBY_BENCH
groupby/group_sum.cu
groupby/group_nth.cu
groupby/group_shift.cu
groupby/group_struct.cu
groupby/group_no_requests.cu
groupby/group_scan.cu
groupby/group_rank_benchmark.cu
)

# ##################################################################################################
Expand Down
123 changes: 123 additions & 0 deletions cpp/benchmarks/groupby/group_rank_benchmark.cu
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
/*
* Copyright (c) 2022, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include <benchmarks/common/generate_input.hpp>
#include <benchmarks/fixture/benchmark_fixture.hpp>
#include <benchmarks/synchronization/synchronization.hpp>

#include <cudf/groupby.hpp>
#include <cudf/sorting.hpp>
#include <cudf/table/table_view.hpp>
#include <cudf/types.hpp>

class Groupby : public cudf::benchmark {
};

template <cudf::rank_method method>
void BM_group_rank(benchmark::State& state)
{
using namespace cudf;

const size_type column_size{(size_type)state.range(0)};
const int num_groups = 100;

data_profile profile;
profile.set_null_frequency(std::nullopt);
profile.set_cardinality(0);
profile.set_distribution_params<int64_t>(
cudf::type_to_id<int64_t>(), distribution_id::UNIFORM, 0, num_groups);

auto source_table = create_random_table(
{cudf::type_to_id<int64_t>(), cudf::type_to_id<int64_t>()}, row_count{column_size}, profile);

// TODO values to be sorted too for groupby rank
// auto sorted_table = cudf::sort(*source_table);

auto agg = cudf::make_rank_aggregation<groupby_scan_aggregation>(method);
std::vector<groupby::scan_request> requests;
requests.emplace_back(groupby::scan_request());
requests[0].values = source_table->view().column(1);
requests[0].aggregations.push_back(std::move(agg));

groupby::groupby gb_obj(
table_view{{source_table->view().column(0)}}, null_policy::EXCLUDE, sorted::NO);

for (auto _ : state) {
cuda_event_timer timer(state, true);
// groupby scan uses sort implementation
auto result = gb_obj.scan(requests);
}
}
//

BENCHMARK_DEFINE_F(Groupby, rank_dense)(::benchmark::State& state)
karthikeyann marked this conversation as resolved.
Show resolved Hide resolved
{
BM_group_rank<cudf::rank_method::DENSE>(state);
}

BENCHMARK_REGISTER_F(Groupby, rank_dense)
->Arg(1'000'000)
->Arg(10'000'000)
->Arg(100'000'000)
->UseManualTime()
->Unit(benchmark::kMillisecond);

BENCHMARK_DEFINE_F(Groupby, rank_min)(::benchmark::State& state)
{
BM_group_rank<cudf::rank_method::MIN>(state);
}

BENCHMARK_REGISTER_F(Groupby, rank_min)
->Arg(1'000'000)
->Arg(10'000'000)
->Arg(100'000'000)
->UseManualTime()
->Unit(benchmark::kMillisecond);

BENCHMARK_DEFINE_F(Groupby, rank_max)(::benchmark::State& state)
{
BM_group_rank<cudf::rank_method::MAX>(state);
}

BENCHMARK_REGISTER_F(Groupby, rank_max)
->Arg(1'000'000)
->Arg(10'000'000)
->Arg(100'000'000)
->UseManualTime()
->Unit(benchmark::kMillisecond);

BENCHMARK_DEFINE_F(Groupby, rank_first)(::benchmark::State& state)
{
BM_group_rank<cudf::rank_method::FIRST>(state);
}

BENCHMARK_REGISTER_F(Groupby, rank_first)
->Arg(1'000'000)
->Arg(10'000'000)
->Arg(100'000'000)
->UseManualTime()
->Unit(benchmark::kMillisecond);

BENCHMARK_DEFINE_F(Groupby, rank_average)(::benchmark::State& state)
{
BM_group_rank<cudf::rank_method::AVERAGE>(state);
}

BENCHMARK_REGISTER_F(Groupby, rank_average)
->Arg(1'000'000)
->Arg(10'000'000)
->Arg(100'000'000)
->UseManualTime()
->Unit(benchmark::kMillisecond);
Loading