Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactors JSON reader's pushdown automaton #13716

Merged
merged 9 commits into from
Aug 9, 2023

Conversation

elstehle
Copy link
Contributor

@elstehle elstehle commented Jul 18, 2023

Description

This PR simplifies and cleans up the JSON reader's pushdown automaton.

The pushdown automaton takes as input two arrays:

  1. The JSON's input characters
  2. The stack context for each character ({ - JSON object, [ - JSON array, _ - Root of JSON)

Previously, we were fusing the two arrays and materializing them straight to the symbol group id for each combination. A symbol group id serves as the column of the transition table. The symbol group ids array was then used as input to the finite state transducer (FST).

After the recent refactor of the FST lookup tables, the FST has become more flexible. It now supports arbitrary iterators and the symbol group id lookup table (that maps a symbol to a symbol group id) can now be implemented by a simple function object.

This PR takes advantage of the FST's ability to take fancy iterators. We now zip the json_input and stack_context symbols and pass that zip_iterator to the FST.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@elstehle elstehle requested a review from a team as a code owner July 18, 2023 11:14
@elstehle elstehle requested review from bdice and nvdbaranec July 18, 2023 11:14
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jul 18, 2023
@elstehle elstehle added cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. and removed libcudf Affects libcudf (C++/CUDA) code. labels Jul 18, 2023
@elstehle elstehle requested review from karthikeyann and vuule July 18, 2023 11:25
@elstehle
Copy link
Contributor Author

Some perf numbers on V100 for end-to-end JSON reading. Overall slight improvements due to saving an extra pass over the data.


|  string_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|---------------|------------|-------------|------------|-------------|--------------|---------|----------|
|     2^20      |   5.213 ms |       8.36% |   5.045 ms |       6.48% |  -167.466 us |  -3.21% |   PASS   |
|     2^21      |   5.260 ms |       5.46% |   5.188 ms |       3.96% |   -71.470 us |  -1.36% |   PASS   |
|     2^22      |   5.702 ms |       5.85% |   5.493 ms |       5.02% |  -208.683 us |  -3.66% |   PASS   |
|     2^23      |   6.530 ms |       3.48% |   6.465 ms |       3.78% |   -64.807 us |  -0.99% |   PASS   |
|     2^24      |   8.673 ms |       2.72% |   8.511 ms |       1.99% |  -161.422 us |  -1.86% |   PASS   |
|     2^25      |  12.917 ms |       2.90% |  12.989 ms |       2.50% |    72.088 us |   0.56% |   PASS   |
|     2^26      |  21.124 ms |       1.43% |  20.701 ms |       1.86% |  -422.552 us |  -2.00% |   FAIL   |
|     2^27      |  37.776 ms |       1.49% |  38.016 ms |       1.44% |   239.457 us |   0.63% |   PASS   |
|     2^28      |  71.597 ms |       1.32% |  68.284 ms |       1.43% | -3312.907 us |  -4.63% |   FAIL   |
|     2^29      | 133.911 ms |       1.26% | 130.880 ms |       1.29% | -3030.951 us |  -2.26% |   FAIL   |
|     2^30      | 263.870 ms |       1.02% | 257.406 ms |       1.36% | -6463.997 us |  -2.45% |   FAIL   |

# nested_json_gpu_parser_depth

## [0] Tesla V100-SXM2-32GB

|  depth  |  string_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|---------|---------------|------------|-------------|------------|-------------|--------------|---------|----------|
|   2^1   |     2^20      |   5.182 ms |       6.49% |   5.114 ms |       6.36% |   -67.564 us |  -1.30% |   PASS   |
|   2^2   |     2^20      |   5.298 ms |       3.67% |   5.165 ms |       2.78% |  -132.755 us |  -2.51% |   PASS   |
|   2^3   |     2^20      |  11.410 ms |       3.39% |  11.234 ms |       2.38% |  -176.428 us |  -1.55% |   PASS   |
|   2^4   |     2^20      |  13.756 ms |       3.69% |  13.416 ms |       0.43% |  -339.588 us |  -2.47% |   FAIL   |
|   2^1   |     2^22      |   6.144 ms |       4.83% |   6.040 ms |       2.64% |  -104.223 us |  -1.70% |   PASS   |
|   2^2   |     2^22      |   6.254 ms |       4.29% |   6.049 ms |       2.16% |  -204.954 us |  -3.28% |   FAIL   |
|   2^3   |     2^22      |  12.294 ms |       3.46% |  11.828 ms |       0.39% |  -465.820 us |  -3.79% |   FAIL   |
|   2^4   |     2^22      |  14.414 ms |       2.45% |  14.240 ms |       0.50% |  -173.966 us |  -1.21% |   FAIL   |
|   2^1   |     2^24      |  10.711 ms |       2.63% |  10.505 ms |       1.46% |  -206.131 us |  -1.92% |   FAIL   |
|   2^2   |     2^24      |  10.723 ms |       2.41% |  10.551 ms |       1.97% |  -171.670 us |  -1.60% |   PASS   |
|   2^3   |     2^24      |  16.042 ms |       2.27% |  15.799 ms |       0.33% |  -243.033 us |  -1.52% |   FAIL   |
|   2^4   |     2^24      |  19.984 ms |       2.58% |  19.491 ms |       0.46% |  -492.768 us |  -2.47% |   FAIL   |
|   2^1   |     2^26      |  27.660 ms |       1.70% |  27.320 ms |       0.86% |  -339.988 us |  -1.23% |   FAIL   |
|   2^2   |     2^26      |  27.618 ms |       0.91% |  27.308 ms |       0.49% |  -310.610 us |  -1.12% |   FAIL   |
|   2^3   |     2^26      |  34.423 ms |       0.76% |  34.234 ms |       0.32% |  -189.112 us |  -0.55% |   FAIL   |
|   2^4   |     2^26      |  43.460 ms |       0.92% |  43.217 ms |       0.83% |  -243.384 us |  -0.56% |   PASS   |
|   2^1   |     2^28      |  95.548 ms |       0.59% |  94.341 ms |       0.60% | -1206.699 us |  -1.26% |   FAIL   |
|   2^2   |     2^28      |  95.691 ms |       0.67% |  94.230 ms |       0.39% | -1461.410 us |  -1.53% |   FAIL   |
|   2^3   |     2^28      | 125.592 ms |       0.31% | 125.966 ms |       0.33% |   374.654 us |   0.30% |   PASS   |
|   2^4   |     2^28      | 160.581 ms |       0.26% | 158.317 ms |       0.50% | -2263.586 us |  -1.41% |   FAIL   |
|   2^1   |     2^30      | 370.073 ms |       0.45% | 367.691 ms |       0.34% | -2381.911 us |  -0.64% |   FAIL   |
|   2^2   |     2^30      | 369.708 ms |       0.49% | 367.838 ms |       0.32% | -1870.769 us |  -0.51% |   FAIL   |
|   2^3   |     2^30      | 482.927 ms |       0.09% | 479.960 ms |       0.11% | -2967.504 us |  -0.61% |   FAIL   |
|   2^4   |     2^30      | 605.768 ms |       0.11% | 602.818 ms |       0.10% | -2949.862 us |  -0.49% |   FAIL   |

# json_read_data_type

## [0] Tesla V100-SXM2-32GB

|  data_type  |      io       |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|-------------|---------------|------------|-------------|------------|-------------|--------------|---------|----------|
|    FLOAT    | DEVICE_BUFFER | 711.923 ms |       0.05% | 707.953 ms |       0.06% | -3970.032 us |  -0.56% |   FAIL   |
|   DECIMAL   | DEVICE_BUFFER | 836.850 ms |       0.13% | 834.176 ms |       0.05% | -2673.364 us |  -0.32% |   FAIL   |
|   STRING    | DEVICE_BUFFER | 317.871 ms |       0.14% | 323.675 ms |       0.05% |     5.805 ms |   1.83% |   FAIL   |
|    LIST     | DEVICE_BUFFER | 228.696 ms |       0.06% | 228.660 ms |       0.06% |   -35.715 us |  -0.02% |   PASS   |
|   STRUCT    | DEVICE_BUFFER | 889.568 ms |       0.15% | 882.394 ms |       0.06% | -7174.182 us |  -0.81% |   FAIL   |

@elstehle elstehle changed the title Refactors JSON reader's pushdown automaton Refactors JSON reader's pushdown automaton Jul 26, 2023
@elstehle elstehle changed the base branch from branch-23.08 to branch-23.10 August 2, 2023 05:18
@elstehle elstehle requested review from a team as code owners August 7, 2023 09:30
@github-actions github-actions bot added Python Affects Python cuDF API. conda labels Aug 7, 2023
@github-actions github-actions bot removed Python Affects Python cuDF API. conda labels Aug 7, 2023
@wence- wence- removed request for a team August 7, 2023 15:10
Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. 🚀

If there is an article explaining "CUB-style implementation" on TempStorage, it will be useful. It's great if in a future PR, this is changed simpler functor.

@karthikeyann karthikeyann requested review from ttnghia and vuule August 9, 2023 17:27
@elstehle
Copy link
Contributor Author

elstehle commented Aug 9, 2023

/merge

@rapids-bot rapids-bot bot merged commit e8df037 into rapidsai:branch-23.10 Aug 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants