Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve degenerate performance in create_structs_data #14761

Merged
merged 8 commits into from
Jan 18, 2024

Conversation

SurajAralihalli
Copy link
Contributor

@SurajAralihalli SurajAralihalli commented Jan 16, 2024

Resolves issue #14716

  • Eliminated unnecessary recursive self-calls in the superimpose_nulls_no_sanitize function, addressing performance issues in make_structs_column.
  • Introduced STRUCT_CREATION_NVBENCH to assess the performance of the create_structs_data function.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Jan 16, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jan 16, 2024
@SurajAralihalli
Copy link
Contributor Author

SurajAralihalli commented Jan 16, 2024

Example structure definition: STRUCT<STRUCT<INT32>> : [ {{Null}}, {Null}, Null, {{3}} ]

In the code:

P {
  type = STRUCT
  null_mask = [1, 1, 0, 1]
  null_count = 1
  children =
    C_1 {
      type = STRUCT
      null_mask = [1, 0, 0, 1]
      null_count = 2
      children =
        C_2 {
          type = INT32
          data =       [X, X, X, 3]
          null_mask  = [0, 0, 0, 1]
          null_count = 3
        }
    }
}

The make_structs_column function applies the null mask of the parent (P) to its immediate children (C_1) at line

for (auto& child : child_columns) {
.
The superimpose_nulls_no_sanitize function recursively applies a correct null mask to all nested children (C_2) at line
std::for_each(content.children.begin(),
.
This function uses make_structs_column to reconstruct the parent column (P) which applies C_1's null mask to its children (C_2) at line
return cudf::make_structs_column(num_rows,
However, C_2's null_mask is already correct.

I believe this redundancy can be eliminated by removing the code at line 266.

@SurajAralihalli
Copy link
Contributor Author

SurajAralihalli commented Jan 16, 2024

Before:
Screenshot 2024-01-15 at 10 59 16 PM

After:
Screenshot 2024-01-15 at 11 01 23 PM

Thank you @karthikeyann for sharing scope based NVTX ranges #11864 (comment). This utility greatly helped in debugging.

@SurajAralihalli SurajAralihalli marked this pull request as ready for review January 16, 2024 11:24
@SurajAralihalli SurajAralihalli requested a review from a team as a code owner January 16, 2024 11:24
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 16, 2024
@davidwendt
Copy link
Contributor

/ok to test

@davidwendt
Copy link
Contributor

Is there a benchmark that could be run to show the performance improvement here?

@SurajAralihalli
Copy link
Contributor Author

Is there a benchmark that could be run to show the performance improvement here?

I ran the rank_struct benchmark described by @GregoryKimball in #14716 (comment). I added the NVTX regions to superimpose_nulls_no_sanitize to compute the following results.

Before:
Screenshot 2024-01-16 at 10 46 26 AM

After:
Screenshot 2024-01-16 at 10 42 30 AM

@ttnghia
Copy link
Contributor

ttnghia commented Jan 16, 2024

/ok to test

@vuule
Copy link
Contributor

vuule commented Jan 16, 2024

How come there were 130k calls to superimpose_nulls_no_sanitize? It's more than even quadratic redundancy (not a thing, but hopefully you know what I meant :)) would cause.

I think this change should include a new benchmark.

Signed-off-by: Suraj Aralihalli <[email protected]>
@github-actions github-actions bot added the CMake CMake build issue label Jan 16, 2024
@SurajAralihalli
Copy link
Contributor Author

SurajAralihalli commented Jan 17, 2024

How come there were 130k calls to superimpose_nulls_no_sanitize? It's more than even quadratic redundancy (not a thing, but hopefully you know what I meant :)) would cause.

I think this change should include a new benchmark.

Absolutely @vuule! It would be interesting to derive the relationship between calls to make_structs_column and superimpose_nulls_no_sanitize in terms of depth. Also, I've created create_structs benchmark.

./STRUCT_CREATION_NVBENCH -b create_structs --axis NumRows=[10] --axis Depth=[16] --axis Nulls=[0]

Signed-off-by: Suraj Aralihalli <[email protected]>
@SurajAralihalli
Copy link
Contributor Author

SurajAralihalli commented Jan 17, 2024

Comparing benchmarking results:

## [0] NVIDIA RTX A5000

|  NumRows  |  Depth  |  Nulls  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |            Diff |   %Diff |  Status  |
|-----------|---------|---------|------------|-------------|------------|-------------|-----------------|---------|----------|
|   2^10    |    1    |    0    |  42.087 us |       7.79% |  39.435 us |       6.45% |       -2.651 us |  -6.30% |   PASS   |
|   2^18    |    1    |    0    |   7.681 ms |       3.37% |   7.582 ms |       2.55% |      -98.575 us |  -1.28% |   PASS   |
|   2^26    |    1    |    0    |    2.346 s |       0.23% |    2.357 s |       0.55% |       11.391 ms |   0.49% |   FAIL   |
|   2^10    |    8    |    0    |   8.984 ms |       1.35% | 824.409 us |       2.86% |    -8159.408 us | -90.82% |   FAIL   |
|   2^18    |    8    |    0    |  47.787 ms |       1.30% |  38.508 ms |       1.60% |    -9278.895 us | -19.42% |   FAIL   |
|   2^26    |    8    |    0    |   10.222 s |        inf% |   10.222 s |        inf% |       29.785 us |   0.00% |   PASS   |
|   2^10    |   16    |    0    |    2.455 s |       0.58% |   3.206 ms |       1.87% | -2452204.726 us | -99.87% |   FAIL   |
|   2^18    |   16    |    0    |    2.779 s |       0.23% |  75.308 ms |       1.82% | -2703843.785 us | -97.29% |   FAIL   |
|   2^26    |   16    |    0    |   26.935 s |        inf% |   19.043 s |        inf% | -7891943.359 us | -29.30% |   PASS   |
|   2^10    |    1    |    1    |  56.118 us |       5.24% |  56.982 us |       6.84% |        0.864 us |   1.54% |   PASS   |
|   2^18    |    1    |    1    |   7.839 ms |       1.51% |   7.985 ms |       3.48% |      146.030 us |   1.86% |   FAIL   |
|   2^26    |    1    |    1    |    2.447 s |       1.11% |    2.434 s |       0.48% |   -12658.447 us |  -0.52% |   FAIL   |
|   2^10    |    8    |    1    |   9.043 ms |       1.18% | 837.039 us |       2.92% |    -8206.161 us | -90.74% |   FAIL   |
|   2^18    |    8    |    1    |  48.371 ms |       1.75% |  38.803 ms |       1.66% |    -9568.242 us | -19.78% |   FAIL   |
|   2^26    |    8    |    1    |   10.312 s |        inf% |   10.285 s |        inf% |   -26727.539 us |  -0.26% |   PASS   |
|   2^10    |   16    |    1    |    2.411 s |       0.27% |   3.235 ms |       1.79% | -2407859.781 us | -99.87% |   FAIL   |
|   2^18    |   16    |    1    |    2.762 s |       0.40% |  75.156 ms |       0.44% | -2686969.612 us | -97.28% |   FAIL   |
|   2^26    |   16    |    1    |   27.152 s |        inf% |   19.126 s |        inf% | -8025222.656 us | -29.56% |   PASS   |

@ttnghia
Copy link
Contributor

ttnghia commented Jan 17, 2024

/ok to test

cpp/benchmarks/CMakeLists.txt Outdated Show resolved Hide resolved
@SurajAralihalli
Copy link
Contributor Author

/ok to test

@vuule
Copy link
Contributor

vuule commented Jan 17, 2024

/ok to test

@vuule
Copy link
Contributor

vuule commented Jan 18, 2024

/merge

@rapids-bot rapids-bot bot merged commit 66c3e8e into rapidsai:branch-24.02 Jan 18, 2024
68 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants