Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement get_json_object_multiple_paths #2233

Merged
merged 43 commits into from
Jul 18, 2024

Conversation

ttnghia
Copy link
Collaborator

@ttnghia ttnghia commented Jul 17, 2024

This re-implements get_json_object to support extracting multiple JSON objects from the same JSON input. By processing multiple JSON paths in just one kernel call, we can improve the performance of SQL queries in which there are large numbers of calls to get_json_object.

The building blocks of this work are very simple:

  • Employ an entire warp of threads (32 threads) to process only one row. This eliminates thread divergence and avoids threads from blocking each other due to executing different code paths on different input.
  • Enhance memory access by using consecutive warps to process the same input rows, just different JSON paths.

Depends on:


Benchmark on a dataset of 200'000 rows, given by a 6.7GB parquet file, and 80 JSON paths. The data is read using parquet chunked reader with 256MB output limit.

Baseline (before this work): 410 seconds.
With this work: 44 seconds.

@ttnghia ttnghia force-pushed the get_json_object_long_strings branch from 8814752 to e4d1352 Compare July 17, 2024 07:02
Signed-off-by: Nghia Truong <[email protected]>
ttnghia added 2 commits July 17, 2024 10:16
Signed-off-by: Nghia Truong <[email protected]>
# Conflicts:
#	src/main/cpp/src/get_json_object.cu
ttnghia added 4 commits July 17, 2024 14:17
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
…_object_long_strings

# Conflicts:
#	src/main/cpp/src/get_json_object.hpp
@ttnghia
Copy link
Collaborator Author

ttnghia commented Jul 18, 2024

build

ttnghia added 2 commits July 17, 2024 19:54
@ttnghia
Copy link
Collaborator Author

ttnghia commented Jul 18, 2024

build

@ttnghia ttnghia marked this pull request as ready for review July 18, 2024 05:40
@ttnghia ttnghia requested review from thirtiseven and res-life July 18, 2024 05:41
Signed-off-by: Nghia Truong <[email protected]>
@ttnghia
Copy link
Collaborator Author

ttnghia commented Jul 18, 2024

build

Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
@res-life
Copy link
Collaborator

LGTM.

@res-life
Copy link
Collaborator

We may add a case for the out-of-bound write scenario for multipile paths in a follow-up PR.

@ttnghia
Copy link
Collaborator Author

ttnghia commented Jul 18, 2024

build

@ttnghia ttnghia merged commit 05b05a0 into NVIDIA:branch-24.08 Jul 18, 2024
3 checks passed
@ttnghia ttnghia deleted the get_json_object_long_strings branch July 18, 2024 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants