Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PERF] Performance impact of mixed_type_as_string JSON reader option in reading JSON lines #15196

Closed
shrshi opened this issue Feb 29, 2024 · 3 comments
Assignees
Labels
cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue

Comments

@shrshi
Copy link
Contributor

shrshi commented Feb 29, 2024

This report presents some preliminary findings on the normalization, mixed types handling, byte range reading, and error recovery handling options for JSON lines input. Given a valid JSON input string i.e. with no modifications to the data generation and reading a single chunk i.e. the byte range consists of all records, we expect to see no significant impact of enabling these options.

Benchmarks were run on A100 80GB GPU, with all combinations of the above options being enabled/disabled, and a performance degradation of 98% was observed on enabling mixed_type_as_string (keeping normalize_single_quotes=NO row_selection=ALL recovery_mode=RECOVER_WITH_NULL constant between the two experiments). Refer to figure for performance comparison.

To investigate the impact of mixed_type_as_string being enabled, the benchmark was profiled with --axis normalize_single_quotes=NO --axis row_selection=ALL --axis mixed_types_as_string=YES --axis recovery_mode=RECOVER_WITH_NULL.
The infer_column_type_kernel appears to be the bottleneck due to stalled warps resulting in achieved occupancy of 9.3% (nsys and ncu profiles below).
Screenshot 2024-02-29 105007
Screenshot 2024-02-29 105434

Next steps

  • More investigation required to improve performance of reader with mixed types option

Related information

@shrshi shrshi changed the title [PERF] Performance impact of JSON reader options in reading JSON lines [PERF] Performance impact of mixed_type_as_string JSON reader option in reading JSON lines Feb 29, 2024
@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Performance Performance related issue labels Feb 29, 2024
@GregoryKimball GregoryKimball modified the milestones: json, Nested JSON reader Feb 29, 2024
@GregoryKimball GregoryKimball moved this to Needs owner in libcudf Feb 29, 2024
@GregoryKimball
Copy link
Contributor

@karthikeyann, would you please work with @shrshi on this performance hotspot?

@karthikeyann
Copy link
Contributor

Provided @shrshi a minor patch to benchmark. This kernel could be skipped entirely for those cases.

rapids-bot bot pushed a commit that referenced this issue Mar 8, 2024
…n is enabled (#15236)

Addresses #15196 by applying a patch from @karthikeyann to skip the `infer_column_type_kernel` by forcing the mixed types column to be a string. 
With this optimization, we see a significant improvement in performance. Please refer to the [comment](#15236 (comment)) for a visualization of the results before and after applying this optimization as obtained from the [JSON lines benchmarking exercise](#15124).

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #15236
@GregoryKimball
Copy link
Contributor

Closed by #15236

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue
Projects
None yet
Development

No branches or pull requests

3 participants