Refactors JSON reader's pushdown automaton #13716

elstehle · 2023-07-18T11:14:48Z

Description

This PR simplifies and cleans up the JSON reader's pushdown automaton.

The pushdown automaton takes as input two arrays:

The JSON's input characters
The stack context for each character ({ - JSON object, [ - JSON array, _ - Root of JSON)

Previously, we were fusing the two arrays and materializing them straight to the symbol group id for each combination. A symbol group id serves as the column of the transition table. The symbol group ids array was then used as input to the finite state transducer (FST).

After the recent refactor of the FST lookup tables, the FST has become more flexible. It now supports arbitrary iterators and the symbol group id lookup table (that maps a symbol to a symbol group id) can now be implemented by a simple function object.

This PR takes advantage of the FST's ability to take fancy iterators. We now zip the json_input and stack_context symbols and pass that zip_iterator to the FST.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

cpp/src/io/fst/lookup_tables.cuh

elstehle · 2023-07-20T05:54:26Z

Some perf numbers on V100 for end-to-end JSON reading. Overall slight improvements due to saving an extra pass over the data.


|  string_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|---------------|------------|-------------|------------|-------------|--------------|---------|----------|
|     2^20      |   5.213 ms |       8.36% |   5.045 ms |       6.48% |  -167.466 us |  -3.21% |   PASS   |
|     2^21      |   5.260 ms |       5.46% |   5.188 ms |       3.96% |   -71.470 us |  -1.36% |   PASS   |
|     2^22      |   5.702 ms |       5.85% |   5.493 ms |       5.02% |  -208.683 us |  -3.66% |   PASS   |
|     2^23      |   6.530 ms |       3.48% |   6.465 ms |       3.78% |   -64.807 us |  -0.99% |   PASS   |
|     2^24      |   8.673 ms |       2.72% |   8.511 ms |       1.99% |  -161.422 us |  -1.86% |   PASS   |
|     2^25      |  12.917 ms |       2.90% |  12.989 ms |       2.50% |    72.088 us |   0.56% |   PASS   |
|     2^26      |  21.124 ms |       1.43% |  20.701 ms |       1.86% |  -422.552 us |  -2.00% |   FAIL   |
|     2^27      |  37.776 ms |       1.49% |  38.016 ms |       1.44% |   239.457 us |   0.63% |   PASS   |
|     2^28      |  71.597 ms |       1.32% |  68.284 ms |       1.43% | -3312.907 us |  -4.63% |   FAIL   |
|     2^29      | 133.911 ms |       1.26% | 130.880 ms |       1.29% | -3030.951 us |  -2.26% |   FAIL   |
|     2^30      | 263.870 ms |       1.02% | 257.406 ms |       1.36% | -6463.997 us |  -2.45% |   FAIL   |

# nested_json_gpu_parser_depth

## [0] Tesla V100-SXM2-32GB

|  depth  |  string_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|---------|---------------|------------|-------------|------------|-------------|--------------|---------|----------|
|   2^1   |     2^20      |   5.182 ms |       6.49% |   5.114 ms |       6.36% |   -67.564 us |  -1.30% |   PASS   |
|   2^2   |     2^20      |   5.298 ms |       3.67% |   5.165 ms |       2.78% |  -132.755 us |  -2.51% |   PASS   |
|   2^3   |     2^20      |  11.410 ms |       3.39% |  11.234 ms |       2.38% |  -176.428 us |  -1.55% |   PASS   |
|   2^4   |     2^20      |  13.756 ms |       3.69% |  13.416 ms |       0.43% |  -339.588 us |  -2.47% |   FAIL   |
|   2^1   |     2^22      |   6.144 ms |       4.83% |   6.040 ms |       2.64% |  -104.223 us |  -1.70% |   PASS   |
|   2^2   |     2^22      |   6.254 ms |       4.29% |   6.049 ms |       2.16% |  -204.954 us |  -3.28% |   FAIL   |
|   2^3   |     2^22      |  12.294 ms |       3.46% |  11.828 ms |       0.39% |  -465.820 us |  -3.79% |   FAIL   |
|   2^4   |     2^22      |  14.414 ms |       2.45% |  14.240 ms |       0.50% |  -173.966 us |  -1.21% |   FAIL   |
|   2^1   |     2^24      |  10.711 ms |       2.63% |  10.505 ms |       1.46% |  -206.131 us |  -1.92% |   FAIL   |
|   2^2   |     2^24      |  10.723 ms |       2.41% |  10.551 ms |       1.97% |  -171.670 us |  -1.60% |   PASS   |
|   2^3   |     2^24      |  16.042 ms |       2.27% |  15.799 ms |       0.33% |  -243.033 us |  -1.52% |   FAIL   |
|   2^4   |     2^24      |  19.984 ms |       2.58% |  19.491 ms |       0.46% |  -492.768 us |  -2.47% |   FAIL   |
|   2^1   |     2^26      |  27.660 ms |       1.70% |  27.320 ms |       0.86% |  -339.988 us |  -1.23% |   FAIL   |
|   2^2   |     2^26      |  27.618 ms |       0.91% |  27.308 ms |       0.49% |  -310.610 us |  -1.12% |   FAIL   |
|   2^3   |     2^26      |  34.423 ms |       0.76% |  34.234 ms |       0.32% |  -189.112 us |  -0.55% |   FAIL   |
|   2^4   |     2^26      |  43.460 ms |       0.92% |  43.217 ms |       0.83% |  -243.384 us |  -0.56% |   PASS   |
|   2^1   |     2^28      |  95.548 ms |       0.59% |  94.341 ms |       0.60% | -1206.699 us |  -1.26% |   FAIL   |
|   2^2   |     2^28      |  95.691 ms |       0.67% |  94.230 ms |       0.39% | -1461.410 us |  -1.53% |   FAIL   |
|   2^3   |     2^28      | 125.592 ms |       0.31% | 125.966 ms |       0.33% |   374.654 us |   0.30% |   PASS   |
|   2^4   |     2^28      | 160.581 ms |       0.26% | 158.317 ms |       0.50% | -2263.586 us |  -1.41% |   FAIL   |
|   2^1   |     2^30      | 370.073 ms |       0.45% | 367.691 ms |       0.34% | -2381.911 us |  -0.64% |   FAIL   |
|   2^2   |     2^30      | 369.708 ms |       0.49% | 367.838 ms |       0.32% | -1870.769 us |  -0.51% |   FAIL   |
|   2^3   |     2^30      | 482.927 ms |       0.09% | 479.960 ms |       0.11% | -2967.504 us |  -0.61% |   FAIL   |
|   2^4   |     2^30      | 605.768 ms |       0.11% | 602.818 ms |       0.10% | -2949.862 us |  -0.49% |   FAIL   |

# json_read_data_type

## [0] Tesla V100-SXM2-32GB

|  data_type  |      io       |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|-------------|---------------|------------|-------------|------------|-------------|--------------|---------|----------|
|    FLOAT    | DEVICE_BUFFER | 711.923 ms |       0.05% | 707.953 ms |       0.06% | -3970.032 us |  -0.56% |   FAIL   |
|   DECIMAL   | DEVICE_BUFFER | 836.850 ms |       0.13% | 834.176 ms |       0.05% | -2673.364 us |  -0.32% |   FAIL   |
|   STRING    | DEVICE_BUFFER | 317.871 ms |       0.14% | 323.675 ms |       0.05% |     5.805 ms |   1.83% |   FAIL   |
|    LIST     | DEVICE_BUFFER | 228.696 ms |       0.06% | 228.660 ms |       0.06% |   -35.715 us |  -0.02% |   PASS   |
|   STRUCT    | DEVICE_BUFFER | 889.568 ms |       0.15% | 882.394 ms |       0.06% | -7174.182 us |  -0.81% |   FAIL   |

…p-pda

karthikeyann

Looks good to me. 🚀

If there is an article explaining "CUB-style implementation" on TempStorage, it will be useful. It's great if in a future PR, this is changed simpler functor.

elstehle · 2023-08-09T19:17:14Z

/merge

refactors JSON reader's pushdown automaton

e37c8b0

elstehle requested a review from a team as a code owner July 18, 2023 11:14

elstehle requested review from bdice and nvdbaranec July 18, 2023 11:14

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jul 18, 2023

elstehle added cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. and removed libcudf Affects libcudf (C++/CUDA) code. labels Jul 18, 2023

elstehle requested review from karthikeyann and vuule July 18, 2023 11:25

ttnghia reviewed Jul 19, 2023

View reviewed changes

cpp/src/io/fst/lookup_tables.cuh Show resolved Hide resolved

vuule reviewed Jul 19, 2023

View reviewed changes

cpp/src/io/fst/lookup_tables.cuh Show resolved Hide resolved

cpp/src/io/fst/lookup_tables.cuh Show resolved Hide resolved

vuule and others added 3 commits July 19, 2023 15:57

Merge branch 'branch-23.08' into enh/clean-up-pda

feb2437

corrects documentation

d208468

renames local variable

962f81d

Merge remote-tracking branch 'upstream/branch-23.08' into enh/clean-u…

5b57ffa

…p-pda

elstehle changed the title ~~Refactors JSON reader's pushdown automaton~~ Refactors JSON reader's pushdown automaton Jul 26, 2023

elstehle changed the base branch from branch-23.08 to branch-23.10 August 2, 2023 05:18

Merge remote-tracking branch 'upstream/branch-23.08' into enh/clean-u…

7337b20

…p-pda

elstehle requested review from a team as code owners August 7, 2023 09:30

github-actions bot added Python Affects Python cuDF API. conda labels Aug 7, 2023

elstehle added 2 commits August 7, 2023 02:40

fixes a typo in a comment

72389d6

Merge remote-tracking branch 'upstream/branch-23.10' into enh/clean-u…

3710e5c

…p-pda

github-actions bot removed Python Affects Python cuDF API. conda labels Aug 7, 2023

wence- removed request for a team August 7, 2023 15:10

karthikeyann approved these changes Aug 9, 2023

View reviewed changes

karthikeyann requested review from ttnghia and vuule August 9, 2023 17:27

Merge branch 'branch-23.10' into enh/clean-up-pda

d4fd02f

vuule approved these changes Aug 9, 2023

View reviewed changes

rapids-bot bot merged commit e8df037 into rapidsai:branch-23.10 Aug 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactors JSON reader's pushdown automaton #13716

Refactors JSON reader's pushdown automaton #13716

elstehle commented Jul 18, 2023 •

edited

Loading

elstehle commented Jul 20, 2023

karthikeyann left a comment •

edited

Loading

elstehle commented Aug 9, 2023

Refactors JSON reader's pushdown automaton #13716

Refactors JSON reader's pushdown automaton #13716

Conversation

elstehle commented Jul 18, 2023 • edited Loading

Description

Checklist

elstehle commented Jul 20, 2023

karthikeyann left a comment • edited Loading

Choose a reason for hiding this comment

elstehle commented Aug 9, 2023

elstehle commented Jul 18, 2023 •

edited

Loading

karthikeyann left a comment •

edited

Loading