JSON tokenizer memory optimizations #16978

shrshi · 2024-10-02T17:59:20Z

Description

The full push-down automata that tokenizes the input JSON string, as well as the bracket-brace FST over-estimates the total buffer size required for the translated output and indices. This PR splits the transduce calls for both FSTs into two invocations. The first invocation estimates the size of the translated buffer and the translated indices, and the second call performs the DFA run.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

shrshi · 2024-10-05T01:07:33Z

Benchmark used for profiling study: https://github.com/karthikeyann/cudf/blob/enh-profile_memusage_json/wm_benchmark.py

Profiles before and after optimization:

We see that the peak memory usage comes down from 20.8GiB to 10.3GiB and the runtime of get_token_stream also reduces from 1.028s to 825.578ms

karthikeyann

I ran the benchmarks. The benchmarks have Out of memory allocation error for integral types.
./benchmarks/JSON_READER_NVBENCH --profile --benchmark json_read_data_type --axis data_type=INTEGRAL
I debugged further and noticed that, propagate_writes_scan_bytes in

cudf/cpp/src/io/fst/logical_stack.cuh

Line 444 in bcf9425

propagate_writes_scan_bytes,

is a very large number 18446744073707336447
Here are some numbers to debug further

json_in.size() = 2167967896
stack_ops_bufsize: 4922600
num_stack_ops: 4922600
stack_level_scan_bytes: 11519
stack_level_sort_bytes: 608511
match_level_scan_bytes: 11519
propagate_writes_scan_bytes: 18446744073707336447
num_symbols_in: 4922600
num_symbols_out: 2167967896
total_temp_storage_bytes: 18446744073707336447 (FAILURE is in this allocation)

karthikeyann · 2024-10-08T17:32:59Z

Debugged further: This issue is unrelated to this PR.
allocation goes to negative since in32_t > 2**31 is made negative.
json_in.size() = 2167967896 is more than 2 GiB (2.019 GiB).
We should have safe guards to prevent >2GiB passed to read_json due to int32_t limitations in cub algorithms.
created issue #17017

karthikeyann

LGTM 👍

lamarrr

Looks good to me.

shrshi · 2024-10-11T22:45:58Z

Debugged further: This issue is unrelated to this PR. allocation goes to negative since in32_t > 2**31 is made negative. json_in.size() = 2167967896 is more than 2 GiB (2.019 GiB). We should have safe guards to prevent >2GiB passed to read_json due to int32_t limitations in cub algorithms. created issue #17017

PR #17057 has been opened to resolve this overflow error in the benchmarks. To summarize the fix, we have enabled batching only for JSON lines inputs, not regular JSON. So running the benchmark for JSON inputs of sizes greater than 2GiB causes overflow bugs in the parser.

elstehle · 2024-10-13T09:54:57Z

The changes look good to me. I am just concerned about the potential performance downside we get from repeatedly running the FSTs and redundantly computing some values, such as the state-transition vectors and write offsets in the FST, especially for scenarios when the performance downside of redundant computation would not be compensated by the reduced allocation we get with the changes of this PR.

For the end-to-end runs on a V100 for JSON_READER_NVBENCH.nested_json_gpu_parser_depth, I get the following performance numbers:


|  depth  |  string_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|---------------|------------|-------------|------------|-------------|------------|---------|----------|
|   2^1   |     2^20      |   2.848 ms |      10.80% |   2.964 ms |      10.46% | 115.818 us |   4.07% |   PASS   |
|   2^2   |     2^20      |   2.817 ms |       8.62% |   2.952 ms |       8.29% | 135.483 us |   4.81% |   PASS   |
|   2^3   |     2^20      |   8.325 ms |       4.53% |   8.367 ms |       4.19% |  41.742 us |   0.50% |   PASS   |
|   2^4   |     2^20      |  10.352 ms |       3.47% |  10.475 ms |       3.89% | 122.863 us |   1.19% |   PASS   |
|   2^1   |     2^22      |   3.707 ms |       4.67% |   3.977 ms |       3.27% | 269.364 us |   7.27% |   FAIL   |
|   2^2   |     2^22      |   3.694 ms |       4.71% |   3.982 ms |       4.77% | 288.939 us |   7.82% |   FAIL   |
|   2^3   |     2^22      |   9.137 ms |       2.21% |   9.411 ms |       2.80% | 273.779 us |   3.00% |   FAIL   |
|   2^4   |     2^22      |  11.413 ms |       2.87% |  11.683 ms |       2.62% | 269.444 us |   2.36% |   PASS   |
|   2^1   |     2^24      |   7.800 ms |       2.31% |   8.490 ms |       0.46% | 689.604 us |   8.84% |   FAIL   |
|   2^2   |     2^24      |   7.805 ms |       2.38% |   8.512 ms |       2.10% | 707.728 us |   9.07% |   FAIL   |
|   2^3   |     2^24      |  12.846 ms |       1.73% |  13.629 ms |       1.94% | 782.743 us |   6.09% |   FAIL   |
|   2^4   |     2^24      |  16.278 ms |       1.88% |  17.111 ms |       1.46% | 832.984 us |   5.12% |   FAIL   |
|   2^1   |     2^26      |  23.387 ms |       0.31% |  25.726 ms |       0.33% |   2.339 ms |  10.00% |   FAIL   |
|   2^2   |     2^26      |  23.339 ms |       0.41% |  25.718 ms |       0.20% |   2.380 ms |  10.20% |   FAIL   |
|   2^3   |     2^26      |  28.909 ms |       0.89% |  31.439 ms |       1.14% |   2.530 ms |   8.75% |   FAIL   |
|   2^4   |     2^26      |  37.511 ms |       1.30% |  39.965 ms |       0.45% |   2.454 ms |   6.54% |   FAIL   |
|   2^1   |     2^28      |  85.176 ms |       0.08% |  94.190 ms |       0.09% |   9.014 ms |  10.58% |   FAIL   |
|   2^2   |     2^28      |  85.214 ms |       0.50% |  94.241 ms |       0.06% |   9.027 ms |  10.59% |   FAIL   |
|   2^3   |     2^28      | 110.835 ms |       0.22% | 119.999 ms |       0.49% |   9.164 ms |   8.27% |   FAIL   |
|   2^4   |     2^28      | 142.120 ms |       0.25% | 151.918 ms |       0.41% |   9.798 ms |   6.89% |   FAIL   |

So, for larger inputs, a performance regression in the ballpark of 10% on the end-to-end timings. I just want to make sure that lowering the memory capacity is pressing enough for us to take this performance penalty for the time being?

karthikeyann · 2024-10-17T17:34:12Z

The changes look good to me. I am just concerned about the potential performance downside we get from repeatedly running the FSTs and redundantly computing some values, such as the state-transition vectors and write offsets in the FST, especially for scenarios when the performance downside of redundant computation would not be compensated by the reduced allocation we get with the changes of this PR.

So, for larger inputs, a performance regression in the ballpark of 10% on the end-to-end timings. I just want to make sure that lowering the memory capacity is pressing enough for us to take this performance penalty for the time being?

@elstehle I created the issue #17114 for calculating total output values of FST without redundant calculations. Assigned to you.

shrshi · 2024-10-22T21:42:49Z

The changes look good to me. I am just concerned about the potential performance downside we get from repeatedly running the FSTs and redundantly computing some values, such as the state-transition vectors and write offsets in the FST, especially for scenarios when the performance downside of redundant computation would not be compensated by the reduced allocation we get with the changes of this PR.

For the end-to-end runs on a V100 for JSON_READER_NVBENCH.nested_json_gpu_parser_depth, I get the following performance numbers:


|  depth  |  string_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|---------------|------------|-------------|------------|-------------|------------|---------|----------|
|   2^1   |     2^20      |   2.848 ms |      10.80% |   2.964 ms |      10.46% | 115.818 us |   4.07% |   PASS   |
|   2^2   |     2^20      |   2.817 ms |       8.62% |   2.952 ms |       8.29% | 135.483 us |   4.81% |   PASS   |
|   2^3   |     2^20      |   8.325 ms |       4.53% |   8.367 ms |       4.19% |  41.742 us |   0.50% |   PASS   |
|   2^4   |     2^20      |  10.352 ms |       3.47% |  10.475 ms |       3.89% | 122.863 us |   1.19% |   PASS   |
|   2^1   |     2^22      |   3.707 ms |       4.67% |   3.977 ms |       3.27% | 269.364 us |   7.27% |   FAIL   |
|   2^2   |     2^22      |   3.694 ms |       4.71% |   3.982 ms |       4.77% | 288.939 us |   7.82% |   FAIL   |
|   2^3   |     2^22      |   9.137 ms |       2.21% |   9.411 ms |       2.80% | 273.779 us |   3.00% |   FAIL   |
|   2^4   |     2^22      |  11.413 ms |       2.87% |  11.683 ms |       2.62% | 269.444 us |   2.36% |   PASS   |
|   2^1   |     2^24      |   7.800 ms |       2.31% |   8.490 ms |       0.46% | 689.604 us |   8.84% |   FAIL   |
|   2^2   |     2^24      |   7.805 ms |       2.38% |   8.512 ms |       2.10% | 707.728 us |   9.07% |   FAIL   |
|   2^3   |     2^24      |  12.846 ms |       1.73% |  13.629 ms |       1.94% | 782.743 us |   6.09% |   FAIL   |
|   2^4   |     2^24      |  16.278 ms |       1.88% |  17.111 ms |       1.46% | 832.984 us |   5.12% |   FAIL   |
|   2^1   |     2^26      |  23.387 ms |       0.31% |  25.726 ms |       0.33% |   2.339 ms |  10.00% |   FAIL   |
|   2^2   |     2^26      |  23.339 ms |       0.41% |  25.718 ms |       0.20% |   2.380 ms |  10.20% |   FAIL   |
|   2^3   |     2^26      |  28.909 ms |       0.89% |  31.439 ms |       1.14% |   2.530 ms |   8.75% |   FAIL   |
|   2^4   |     2^26      |  37.511 ms |       1.30% |  39.965 ms |       0.45% |   2.454 ms |   6.54% |   FAIL   |
|   2^1   |     2^28      |  85.176 ms |       0.08% |  94.190 ms |       0.09% |   9.014 ms |  10.58% |   FAIL   |
|   2^2   |     2^28      |  85.214 ms |       0.50% |  94.241 ms |       0.06% |   9.027 ms |  10.59% |   FAIL   |
|   2^3   |     2^28      | 110.835 ms |       0.22% | 119.999 ms |       0.49% |   9.164 ms |   8.27% |   FAIL   |
|   2^4   |     2^28      | 142.120 ms |       0.25% | 151.918 ms |       0.41% |   9.798 ms |   6.89% |   FAIL   |

So, for larger inputs, a performance regression in the ballpark of 10% on the end-to-end timings. I just want to make sure that lowering the memory capacity is pressing enough for us to take this performance penalty for the time being?

From offline discussions, the current plan is to accept the end-to-end performance regression in favor of the reduction in memory footprint. Next steps would be to introduce a multi-stage FST implementation (#17114) that separates the tranduced buffer size estimation and final DFA run to remove the performance downside of redundantly running FSTs,

shrshi · 2024-10-23T00:28:47Z

/merge

karthikeyann and others added 4 commits October 1, 2024 01:37

add nvtx resource adapter and to json benchmark

faf8b91

add CUDF_FUNC_RANGE

a0c8818

add benchmark and options

14a7aa2

easy opts

e30b79d

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 2, 2024

shrshi added 2 commits October 5, 2024 00:31

cleanup

e9172e0

Merge branch 'branch-24.12' into json-pda-memory

d2d0711

shrshi added cuIO cuIO issue Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 5, 2024

shrshi added 2 commits October 5, 2024 00:36

more cleanup

b9c3971

adding comments

24d39cb

shrshi marked this pull request as ready for review October 5, 2024 01:12

shrshi requested a review from a team as a code owner October 5, 2024 01:12

shrshi requested review from vyasr, lamarrr and karthikeyann October 5, 2024 01:12

karthikeyann requested a review from elstehle October 7, 2024 19:45

karthikeyann mentioned this pull request Oct 7, 2024

[FEA] Improve GpuJsonToStructs performance NVIDIA/spark-rapids#11560

Closed

karthikeyann reviewed Oct 8, 2024

View reviewed changes

karthikeyann mentioned this pull request Oct 8, 2024

[BUG] Limit JSON reader input to 2GiB #17017

Open

karthikeyann approved these changes Oct 11, 2024

View reviewed changes

lamarrr reviewed Oct 11, 2024

View reviewed changes

lamarrr self-requested a review October 14, 2024 06:00

lamarrr approved these changes Oct 14, 2024

View reviewed changes

karthikeyann mentioned this pull request Oct 17, 2024

[FEA] Faster path for calculating total output symbols in FST #17114

Open

karthikeyann mentioned this pull request Oct 21, 2024

JSON spark reader plan for 24.12 #17138

Open

Merge branch 'branch-24.12' into json-pda-memory

3a9d88c

rapids-bot bot merged commit cff1296 into rapidsai:branch-24.12 Oct 23, 2024
102 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON tokenizer memory optimizations #16978

JSON tokenizer memory optimizations #16978

shrshi commented Oct 2, 2024 •

edited

Loading

shrshi commented Oct 5, 2024

karthikeyann left a comment

karthikeyann commented Oct 8, 2024

karthikeyann left a comment

lamarrr left a comment

shrshi commented Oct 11, 2024

elstehle commented Oct 13, 2024

karthikeyann commented Oct 17, 2024

shrshi commented Oct 22, 2024

shrshi commented Oct 23, 2024

JSON tokenizer memory optimizations #16978

JSON tokenizer memory optimizations #16978

Conversation

shrshi commented Oct 2, 2024 • edited Loading

Description

Checklist

shrshi commented Oct 5, 2024

karthikeyann left a comment

Choose a reason for hiding this comment

karthikeyann commented Oct 8, 2024

karthikeyann left a comment

Choose a reason for hiding this comment

lamarrr left a comment

Choose a reason for hiding this comment

shrshi commented Oct 11, 2024

elstehle commented Oct 13, 2024

karthikeyann commented Oct 17, 2024

shrshi commented Oct 22, 2024

shrshi commented Oct 23, 2024

shrshi commented Oct 2, 2024 •

edited

Loading