Skip to content

Commit

Permalink
Rework logic in cudf::strings::split_record to improve performance (#…
Browse files Browse the repository at this point in the history
…12729)

Updates the `cudf::strings::split_record` logic to match the more optimized code in `cudf::strings:split`.
The optimized code performs much better for longer strings (>64 bytes) by parallelizing over the character bytes to find delimiters before determining split tokens. 
This led to refactoring the code so it both APIs can share the optimized code.
Also fixes a bug found when using overlapped delimiters.
Additional tests were added for multi-byte delimiters which can overlap and span multiple adjacent strings.

Closes #12694

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Yunsong Wang (https://github.com/PointKernel)
  - https://github.com/nvdbaranec

URL: #12729
  • Loading branch information
davidwendt authored Feb 21, 2023
1 parent c2f0161 commit 7da233b
Show file tree
Hide file tree
Showing 5 changed files with 565 additions and 546 deletions.
14 changes: 7 additions & 7 deletions cpp/benchmarks/string/split.cpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2021-2022, NVIDIA CORPORATION.
* Copyright (c) 2021-2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -57,12 +57,12 @@ static void BM_split(benchmark::State& state, split_type rt)

static void generate_bench_args(benchmark::internal::Benchmark* b)
{
int const min_rows = 1 << 12;
int const max_rows = 1 << 24;
int const row_mult = 8;
int const min_rowlen = 1 << 5;
int const max_rowlen = 1 << 13;
int const len_mult = 4;
int constexpr min_rows = 1 << 12;
int constexpr max_rows = 1 << 24;
int constexpr row_mult = 8;
int constexpr min_rowlen = 1 << 5;
int constexpr max_rowlen = 1 << 13;
int constexpr len_mult = 2;
for (int row_count = min_rows; row_count <= max_rows; row_count *= row_mult) {
for (int rowlen = min_rowlen; rowlen <= max_rowlen; rowlen *= len_mult) {
// avoid generating combinations that exceed the cudf column limit
Expand Down
Loading

0 comments on commit 7da233b

Please sign in to comment.