You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RAPIDS cuDF released python support for nested JSON reading in 23.02, and completed the work for Spark-RAPIDS integration in 24.12. We are aware of several improvements that cuDF could make to minimize pre- and post-processing in Spark-RAPIDS. There are also some features in the JSON reader that would benefit NeMo Curator. Finally, we have collected ideas for performance gains and improved code quality. This story issue highlights the most significant outstanding issues, and the full issue set is documented in the Nested JSON reader milestone.
Especially for wide/deep tables, we would benefit from processing multiple columns per kernel, instead of the column-per-kernel implementation today. Performance improvement for cases without much pruning.
Background
RAPIDS cuDF released python support for nested JSON reading in 23.02, and completed the work for Spark-RAPIDS integration in 24.12. We are aware of several improvements that cuDF could make to minimize pre- and post-processing in Spark-RAPIDS. There are also some features in the JSON reader that would benefit NeMo Curator. Finally, we have collected ideas for performance gains and improved code quality. This story issue highlights the most significant outstanding issues, and the full issue set is documented in the Nested JSON reader milestone.
Improvements for Spark-RAPIDS
{}
Improvements for NeMo Curator
Improvements for cuDF-python
Performance projects
The text was updated successfully, but these errors were encountered: