Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] JSON reader performance projects #17718

Open
GregoryKimball opened this issue Jan 10, 2025 · 0 comments
Open

[FEA] JSON reader performance projects #17718

GregoryKimball opened this issue Jan 10, 2025 · 0 comments
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Jan 10, 2025

Background

RAPIDS cuDF released python support for nested JSON reading in 23.02, and completed the work for Spark-RAPIDS integration in 24.12. We are aware of several improvements that cuDF could make to minimize pre- and post-processing in Spark-RAPIDS. There are also some features in the JSON reader that would benefit NeMo Curator. Finally, we have collected ideas for performance gains and improved code quality. This story issue highlights the most significant outstanding issues, and the full issue set is documented in the Nested JSON reader milestone.

Improvements for Spark-RAPIDS

Status Issue Outlook
Empty lines #5712 Currently Spark-RAPIDS checks and replaces empty lines with {}
Add some validation options #15222 We may need cuDF changes to allow validation of a new token type
🔄 #17575 Root-level list support See issue in NVIDIA/spark-rapids#11717

Improvements for NeMo Curator

Status Issue Outlook
Add a column with filenames index in cudf.read_json #15960 To be addressed in #17480 (confirmation needed). Difficult due to multi-source row tracking.

Improvements for cuDF-python

Status Issue Outlook
Make cuDF JSON writer default #16993

Performance projects

Status Issue Outlook
Refactoring JSON reader tree algorithms with Compressed Sparse Row (CSR) #15903 #15979 introduced CSR data structure, and 🔄 #16205 focuses on constructing device JSON columns. Also related to #16965
Improved parsing kernel #16965 Especially for wide/deep tables, we would benefit from processing multiple columns per kernel, instead of the column-per-kernel implementation today. Performance improvement for cases without much pruning.
Faster total symbol calculation in FST #17114
Optimizations for flat JSON Needs scoping to estimate the potential for performance improvements
@GregoryKimball GregoryKimball added cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Jan 10, 2025
@GregoryKimball GregoryKimball added this to the Nested JSON reader milestone Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

1 participant