[FEA] JSON reader performance projects #17718

GregoryKimball · 2025-01-10T23:19:04Z

Background

RAPIDS cuDF released python support for nested JSON reading in 23.02, and completed the work for Spark-RAPIDS integration in 24.12. We are aware of several improvements that cuDF could make to minimize pre- and post-processing in Spark-RAPIDS. There are also some features in the JSON reader that would benefit NeMo Curator. Finally, we have collected ideas for performance gains and improved code quality. This story issue highlights the most significant outstanding issues, and the full issue set is documented in the Nested JSON reader milestone.

Improvements for Spark-RAPIDS

Status	Issue	Outlook
	Empty lines #5712	Currently Spark-RAPIDS checks and replaces empty lines with `{}`
	Add some validation options #15222	We may need cuDF changes to allow validation of a new token type
🔄 #17575	Root-level list support	See issue in NVIDIA/spark-rapids#11717

Improvements for NeMo Curator

Status	Issue	Outlook
	Add a column with filenames index in cudf.read_json #15960	To be addressed in #17480 (confirmation needed). Difficult due to multi-source row tracking.

Improvements for cuDF-python

Status	Issue	Outlook
	Make cuDF JSON writer default #16993

Performance projects

Status	Issue	Outlook
	Refactoring JSON reader tree algorithms with Compressed Sparse Row (CSR) #15903	#15979 introduced CSR data structure, and 🔄 #16205 focuses on constructing device JSON columns. Also related to #16965
	Improved parsing kernel #16965	Especially for wide/deep tables, we would benefit from processing multiple columns per kernel, instead of the column-per-kernel implementation today. Performance improvement for cases without much pruning.
	Faster total symbol calculation in FST #17114
	Optimizations for flat JSON	Needs scoping to estimate the potential for performance improvements

The text was updated successfully, but these errors were encountered:

GregoryKimball added cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Jan 10, 2025

GregoryKimball added this to the Nested JSON reader milestone Jan 10, 2025

GregoryKimball mentioned this issue Jan 10, 2025

[FEA] JSON reader improvements for Spark-RAPIDS #13525

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] JSON reader performance projects #17718

[FEA] JSON reader performance projects #17718

GregoryKimball commented Jan 10, 2025 •

edited

Loading

[FEA] JSON reader performance projects #17718

[FEA] JSON reader performance projects #17718

Comments

GregoryKimball commented Jan 10, 2025 • edited Loading

Background

Improvements for Spark-RAPIDS

Improvements for NeMo Curator

Improvements for cuDF-python

Performance projects

GregoryKimball commented Jan 10, 2025 •

edited

Loading