Improve performance for cudf::strings::count_characters for long strings #12779

davidwendt · 2023-02-15T01:14:54Z

Description

Adds more efficient counting algorithm specifically for columns with long strings--greater than 64 bytes on average.
The internal detail method will be used to help improve performance in other strings functions.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

davidwendt · 2023-02-15T01:17:28Z

Benchmark results show most improvement for string widths greater 64 bytes on average.

|   rows   | width|   Ref Time |    Cmp Time |          Diff |   %Diff |
|----------|------|------------|-------------|---------------|---------|
|     4096 |   32 |  37.670 us |   37.532 us |     -0.138 us |  -0.37% |
|    32768 |   32 |  38.356 us |   38.225 us |     -0.131 us |  -0.34% |
|   262144 |   32 |  49.001 us |   48.659 us |     -0.342 us |  -0.70% |
|  2097152 |   32 | 332.821 us |  333.535 us |      0.714 us |   0.21% |
| 16777216 |   32 |   2.418 ms |    2.434 ms |     15.780 us |   0.65% |
|     4096 |   64 |  38.041 us |   38.462 us |      0.421 us |   1.11% |
|    32768 |   64 |  39.032 us |   39.507 us |      0.475 us |   1.22% |
|   262144 |   64 |  56.326 us |   56.501 us |      0.175 us |   0.31% |
|  2097152 |   64 | 381.756 us |  380.597 us |     -1.159 us |  -0.30% |
| 16777216 |   64 |   2.808 ms |    2.789 ms |    -19.355 us |  -0.69% |
|     4096 |  128 |  41.175 us |   41.297 us |      0.122 us |   0.30% |
|    32768 |  128 |  45.130 us |   45.286 us |      0.156 us |   0.35% |
|   262144 |  128 |  77.200 us |   77.078 us |     -0.122 us |  -0.16% |
|  2097152 |  128 | 497.075 us |  492.847 us |     -4.228 us |  -0.85% |
|     4096 |  256 |  49.684 us |   37.710 us |    -11.974 us | -24.10% |
|    32768 |  256 |  53.379 us |   49.197 us |     -4.182 us |  -7.83% |
|   262144 |  256 | 152.252 us |  139.544 us |    -12.709 us |  -8.35% |
|  2097152 |  256 | 963.613 us |    1.090 ms |    126.674 us |  13.15% |
|     4096 |  512 |  69.779 us |   41.161 us |    -28.618 us | -41.01% |
|    32768 |  512 |  75.668 us |   55.073 us |    -20.595 us | -27.22% |
|   262144 |  512 | 405.729 us |  166.213 us |   -239.516 us | -59.03% |
|  2097152 |  512 |   2.551 ms |    1.297 ms |  -1254.632 us | -49.18% |
|     4096 | 1024 | 106.060 us |   42.641 us |    -63.418 us | -59.79% |
|    32768 | 1024 | 127.396 us |   68.891 us |    -58.505 us | -45.92% |
|   262144 | 1024 |   1.223 ms |  237.241 us |   -986.090 us | -80.61% |
|     4096 | 2048 | 172.035 us |   46.194 us |   -125.841 us | -73.15% |
|    32768 | 2048 | 228.790 us |   94.118 us |   -134.671 us | -58.86% |
|   262144 | 2048 |   6.446 ms |  422.672 us |  -6023.192 us | -93.44% |
|     4096 | 4096 | 309.562 us |   52.891 us |   -256.671 us | -82.91% |
|    32768 | 4096 | 425.600 us |  138.327 us |   -287.273 us | -67.50% |
|   262144 | 4096 |  14.901 ms |  797.709 us | -14103.759 us | -94.65% |

ttnghia · 2023-02-17T23:38:41Z

Will the performance with strings shorter than 64 bytes be impacted?

davidwendt · 2023-02-18T00:13:08Z

Will the performance with strings shorter than 64 bytes be impacted?

No. The optimized code did not perform as well as the original code under 64 bytes so the solution implemented here is to include both code paths.

cpp/src/strings/attributes.cu

…o count-string-lengths

cpp/src/strings/attributes.cu

bdice · 2023-02-24T16:57:45Z

cpp/tests/strings/attrs_tests.cpp

+  auto strings_view = cudf::strings_column_view(strings);
+
+  auto results = cudf::strings::count_characters(strings_view);
+  std::vector<int32_t> h_expected(h_strings.size(), 64);


I expected that these tests should all be using cudf::size_type for the counts, but I also see that count_characters is defined to return an INT32 in its docstring. That seems undesirable, right? Shouldn't we be using cudf::size_type instead of hardcoding int32_t for the function return value and test comparisons?

Since there is no SIZE_TYPE column type, I think it is more correct to use the appropriate data type.

If changing the definition of size type would affect the correctness or consistency of this code, we should use size type.

I disagree but this seems out of scope for this PR.
There are at least 6 APIs that do this today.
Perhaps you can create a separate issue if you want to reopen the discussion.

We document what data-type is used for the output of the column of these APIs.
I'm not comfortable saying the column type is whatever size-type equates to at the moment.

Discussed a bit offline. The latest documented status is #3958 (comment) but from conversation since that was posted, @GregoryKimball seemed interested in moving towards option 3. I won't hold this PR up on that account.

…o count-string-lengths

davidwendt · 2023-02-27T18:46:42Z

/merge

Adds section to developer guide about `cudf::size_type` and adds links to it from other relevant parts of the document. The fundamental nature of this type seems important enough to mention in the developer guide since it is the basis for how much of the code is designed and implemented. Also updates some doxygen for public APIs that are return `size_type` column values but had cited `INT32` specifically. Reference: #12779 (comment) Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Lawrence Mitchell (https://github.com/wence-) - Yunsong Wang (https://github.com/PointKernel) - Karthikeyan (https://github.com/karthikeyann) URL: #12904

Improve performance for cudf::strings::count_characters for long strings

d0542d0

davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 15, 2023

davidwendt self-assigned this Feb 15, 2023

github-actions bot added the CMake CMake build issue label Feb 15, 2023

davidwendt and others added 5 commits February 15, 2023 17:29

Merge branch 'branch-23.04' into count-string-lengths

782ead7

Merge branch 'branch-23.04' into count-string-lengths

d097b2e

Merge branch 'branch-23.04' into count-string-lengths

d6b04f2

refactor lambda to functor

4e5030d

Merge branch 'branch-23.04' into count-string-lengths

3cf9dd1

davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Feb 17, 2023

davidwendt marked this pull request as ready for review February 17, 2023 00:04

davidwendt requested a review from a team as a code owner February 17, 2023 00:04

davidwendt requested review from robertmaynard and nvdbaranec February 17, 2023 00:04

davidwendt marked this pull request as draft February 19, 2023 00:39

davidwendt added 3 commits February 21, 2023 11:33

Merge branch 'branch-23.04' into count-string-lengths

8e2f8d9

change functor to kernel for shared-memory usage

2d6a2ba

Merge branch 'branch-23.04' into count-string-lengths

219563e

davidwendt marked this pull request as ready for review February 22, 2023 00:10

davidwendt added 3 commits February 22, 2023 16:14

Merge branch 'branch-23.04' into count-string-lengths

4dc1cd4

Merge branch 'branch-23.04' into count-string-lengths

390242f

fix global-writes calc

3a70d09

davidwendt added 3 commits February 23, 2023 09:27

fix const vars

bb78362

remove unneeded commented out line

d7562cd

fix style violation

440fe8b

ttnghia reviewed Feb 23, 2023

View reviewed changes

cpp/src/strings/attributes.cu Outdated Show resolved Hide resolved

davidwendt added 3 commits February 24, 2023 07:42

Merge branch 'count-string-lengths' of github.com:davidwendt/cudf int…

bcb4cc0

…o count-string-lengths

Merge branch 'branch-23.04' into count-string-lengths

85a3594

add more text is comment for set_null_count

bd1c7d9

ttnghia approved these changes Feb 24, 2023

View reviewed changes

bdice requested changes Feb 24, 2023

View reviewed changes

Remove extra space.

e110492

bdice approved these changes Feb 24, 2023

View reviewed changes

davidwendt added 3 commits February 24, 2023 15:04

remove extra space from comment

3f270ba

Merge branch 'branch-23.04' into count-string-lengths

9f997e9

Merge branch 'count-string-lengths' of github.com:davidwendt/cudf int…

e30e067

…o count-string-lengths

rapids-bot bot merged commit 2025783 into rapidsai:branch-23.04 Feb 27, 2023

davidwendt deleted the count-string-lengths branch February 27, 2023 18:46

davidwendt mentioned this pull request Mar 8, 2023

Update libcudf counting functions to specify cudf::size_type #12904

Merged

3 tasks

GregoryKimball mentioned this pull request Apr 3, 2023

[FEA] Story - Improve performance with long strings #13048

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance for cudf::strings::count_characters for long strings #12779

Improve performance for cudf::strings::count_characters for long strings #12779

davidwendt commented Feb 15, 2023

davidwendt commented Feb 15, 2023

ttnghia commented Feb 17, 2023

davidwendt commented Feb 18, 2023

bdice Feb 24, 2023

davidwendt Feb 24, 2023

bdice Feb 24, 2023

davidwendt Feb 24, 2023

bdice Feb 24, 2023

davidwendt commented Feb 27, 2023

Improve performance for cudf::strings::count_characters for long strings #12779

Improve performance for cudf::strings::count_characters for long strings #12779

Conversation

davidwendt commented Feb 15, 2023

Description

Checklist

davidwendt commented Feb 15, 2023

ttnghia commented Feb 17, 2023

davidwendt commented Feb 18, 2023

bdice Feb 24, 2023

Choose a reason for hiding this comment

davidwendt Feb 24, 2023

Choose a reason for hiding this comment

bdice Feb 24, 2023

Choose a reason for hiding this comment

davidwendt Feb 24, 2023

Choose a reason for hiding this comment

bdice Feb 24, 2023

Choose a reason for hiding this comment

davidwendt commented Feb 27, 2023