[BUG] JSON reader has no option to return the columns only for the requested schema #13473

revans2 · 2023-05-30T21:45:24Z

Describe the bug
Pandas and Spark are very different in what gets returned when reading a JSON file. In pandas you provide essentially type hints, but it will return data for all columns in the file and only for columns in the file.

Spark has a schema that they want and when given a schema they want to read only the columns that match the schema and nothing else.

This causes two problems for Spark. The first one is a performance issue. If we want to read only one column out of 100, CUDF is going to materialize 100 columns and then we are going to throw away 99 of them. It would be great if we didn't need the memory for all of the extra columns or the computation needed to create them.

~~The second problem is really an odd corner case. CUDF does not support a Table with just rows, but Spark does. So if there is a JSON file with no columns, but rows.~~
{}
{}
~~We have no way to actually read that without some help.~~
(see #5712)

It really would be nice to have a mode similar to how parquet, ORC, or CSV work where only the columns that are requested are returned and all of the requested columns are returned, even if the values are all nulls.

This moves the logic to update the columns returned from the JSON reader to java. It also updated the code to be able to deal with requested columns that were not in the data. It is not perfect because it will not work if the input file had no columns at all in it. ``` {} {} ``` But it fixes issues for a file that has valid columns in it, but none of them are the columns that we requested. This is a work around for #13473, but is not perfect. Authors: - Robert (Bobby) Evans (https://github.com/revans2) Approvers: - Jason Lowe (https://github.com/jlowe) - MithunR (https://github.com/mythrocks) URL: #13477

GregoryKimball · 2023-06-02T03:48:59Z

Thank you @revans2 for posting this. Would you please share if you exploring JSON file read or JSON strings column parsing when you encountered these issues?

The two topics about column projection and empty rows seem like they could be different issues. Would you please let me know if there is a reason to combine them?

karthikeyann · 2023-06-02T15:34:33Z

The peak memory usage of JSON reader will not reduce if we add this feature. JSON tokenizer, Tree construction and traversal will still be same. Only datatype parsing will be eliminated for non-selected columns. (saves some time, but not that big).

A quick analysis on json reader benchmark:
In the screenshot attached of a benchmark run for 64 columns, filesize 1.28GB, datatype parsing takes ~21% (green) of the time. Also note that get_token_stream and get_tree_representation takes the peak memory usage. Not datatype parsing (in green).

Speedup could be <20%.

revans2 · 2023-06-05T18:59:31Z

@GregoryKimball this was when we were reading a file, but it is the same for both code paths in this case.

@karthikeyann

Thanks for the info. I understand that this might not have a huge impact to the memory usage or computation time. The majority of the memory usage and computation would be going into tokenizing and parsing the data. But even then if I have a case like the following

{"B": true, "A": [100, 200, 300, 400, 500.... hundreds of values ...]}
... thousands of rows ...
{"B": false, "A": [1, 2, 3, 4, 5...]}

If all I save is on not needing to materialize the output for A that is a win. May not be something that you want to prioritize super high for a performance standpoint.

The big problem for us in a really odd corner case which we did run into in one of our integration tests, and I am nervous we will run into with a customer at some point. Spark when it writes out JSON data by default will remove entries that are null. It is a space savings optimization. So if we end up with a row that are all nulls we end up with an output row of {}. This is not that uncommon in terms of a JSON optimizations. The problem shows up if all of the rows in a batch show up this way.

{}
{}
{}

I don't think it is likely to happen for large runs of rows, but with Spark it is very possible to have a few rows at the end of a file that show up like this and because of splits they would end up in a single batch. CUDF is unable to parse that batch and had us back the number of rows. The best that we could do as a work around is to count the number of input rows before sending the data to CUDF and catch this exception after it happens. But I am not 100% sure that it will work in all cases, especially if CUDF support comments, because Spark strips out lines that are fully comments in files. Then we would not know how many rows there are without some help.

Again probably not the highest priority, but it does end up being rather hacky to work around a lack of this type of feature.

GregoryKimball · 2023-06-05T19:23:48Z

Linking this issue to #5712 - which seems to capture the empty row issue well.

Let's leave this issue to focus on the column projection optimization.

karthikeyann · 2024-11-12T19:50:03Z

Do PRs #14996 and #17029 fix this issue?

GregoryKimball · 2024-11-12T19:50:20Z

Closed by #14996

revans2 added bug Something isn't working Needs Triage Need team to review and classify cuIO cuIO issue Performance Performance related issue Spark Functionality that helps Spark RAPIDS labels May 30, 2023

This was referenced May 30, 2023

[BUG] distributed CI failed in test_read_case_col_name NVIDIA/spark-rapids#8403

Open

Update JNI JSON reader column compatability for Spark #13477

Merged

GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Jun 2, 2023

GregoryKimball added this to the Nested JSON reader milestone Jun 5, 2023

GregoryKimball added 0 - Backlog In queue waiting for assignment and removed Needs Triage Need team to review and classify labels Jun 5, 2023

This was referenced Jun 7, 2023

[FEA] [JSON reader] to support column prune #9990

Closed

[FEA] JSON reader improvements for Spark-RAPIDS #13525

Open

revans2 mentioned this issue Jun 9, 2023

Improve fix for from_json function for empty data input column NVIDIA/spark-rapids#8542

Closed

2 tasks

GregoryKimball closed this as completed Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] JSON reader has no option to return the columns only for the requested schema #13473

[BUG] JSON reader has no option to return the columns only for the requested schema #13473

revans2 commented May 30, 2023 •

edited by GregoryKimball

Loading

GregoryKimball commented Jun 2, 2023

karthikeyann commented Jun 2, 2023

revans2 commented Jun 5, 2023

GregoryKimball commented Jun 5, 2023 •

edited

Loading

karthikeyann commented Nov 12, 2024

GregoryKimball commented Nov 12, 2024

[BUG] JSON reader has no option to return the columns only for the requested schema #13473

[BUG] JSON reader has no option to return the columns only for the requested schema #13473

Comments

revans2 commented May 30, 2023 • edited by GregoryKimball Loading

GregoryKimball commented Jun 2, 2023

karthikeyann commented Jun 2, 2023

revans2 commented Jun 5, 2023

GregoryKimball commented Jun 5, 2023 • edited Loading

karthikeyann commented Nov 12, 2024

GregoryKimball commented Nov 12, 2024

revans2 commented May 30, 2023 •

edited by GregoryKimball

Loading

GregoryKimball commented Jun 5, 2023 •

edited

Loading