Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] JSON reader has no option to return the columns only for the requested schema #13473

Closed
revans2 opened this issue May 30, 2023 · 6 comments
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented May 30, 2023

Describe the bug
Pandas and Spark are very different in what gets returned when reading a JSON file. In pandas you provide essentially type hints, but it will return data for all columns in the file and only for columns in the file.

Spark has a schema that they want and when given a schema they want to read only the columns that match the schema and nothing else.

This causes two problems for Spark. The first one is a performance issue. If we want to read only one column out of 100, CUDF is going to materialize 100 columns and then we are going to throw away 99 of them. It would be great if we didn't need the memory for all of the extra columns or the computation needed to create them.

The second problem is really an odd corner case. CUDF does not support a Table with just rows, but Spark does. So if there is a JSON file with no columns, but rows.
{}
{}
We have no way to actually read that without some help.
(see #5712)

It really would be nice to have a mode similar to how parquet, ORC, or CSV work where only the columns that are requested are returned and all of the requested columns are returned, even if the values are all nulls.

@revans2 revans2 added bug Something isn't working Needs Triage Need team to review and classify cuIO cuIO issue Performance Performance related issue Spark Functionality that helps Spark RAPIDS labels May 30, 2023
rapids-bot bot pushed a commit that referenced this issue Jun 1, 2023
This moves the logic to update the columns returned from the JSON reader to java. It also updated the code to be able to deal with requested columns that were not in the data. It is not perfect because it will not work if the input file had no columns at all in it.

```
{}
{}
```

But it fixes issues for a file that has valid columns in it, but none of them are the columns that we requested.

This is a work around for #13473, but is not perfect.

Authors:
  - Robert (Bobby) Evans (https://github.com/revans2)

Approvers:
  - Jason Lowe (https://github.com/jlowe)
  - MithunR (https://github.com/mythrocks)

URL: #13477
@GregoryKimball GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Jun 2, 2023
@GregoryKimball
Copy link
Contributor

Thank you @revans2 for posting this. Would you please share if you exploring JSON file read or JSON strings column parsing when you encountered these issues?

The two topics about column projection and empty rows seem like they could be different issues. Would you please let me know if there is a reason to combine them?

@karthikeyann
Copy link
Contributor

The peak memory usage of JSON reader will not reduce if we add this feature. JSON tokenizer, Tree construction and traversal will still be same. Only datatype parsing will be eliminated for non-selected columns. (saves some time, but not that big).

A quick analysis on json reader benchmark:
In the screenshot attached of a benchmark run for 64 columns, filesize 1.28GB, datatype parsing takes ~21% (green) of the time. Also note that get_token_stream and get_tree_representation takes the peak memory usage. Not datatype parsing (in green).
image
Speedup could be <20%.

@revans2
Copy link
Contributor Author

revans2 commented Jun 5, 2023

@GregoryKimball this was when we were reading a file, but it is the same for both code paths in this case.

@karthikeyann

Thanks for the info. I understand that this might not have a huge impact to the memory usage or computation time. The majority of the memory usage and computation would be going into tokenizing and parsing the data. But even then if I have a case like the following

{"B": true, "A": [100, 200, 300, 400, 500.... hundreds of values ...]}
... thousands of rows ...
{"B": false, "A": [1, 2, 3, 4, 5...]}

If all I save is on not needing to materialize the output for A that is a win. May not be something that you want to prioritize super high for a performance standpoint.

The big problem for us in a really odd corner case which we did run into in one of our integration tests, and I am nervous we will run into with a customer at some point. Spark when it writes out JSON data by default will remove entries that are null. It is a space savings optimization. So if we end up with a row that are all nulls we end up with an output row of {}. This is not that uncommon in terms of a JSON optimizations. The problem shows up if all of the rows in a batch show up this way.

{}
{}
{}

I don't think it is likely to happen for large runs of rows, but with Spark it is very possible to have a few rows at the end of a file that show up like this and because of splits they would end up in a single batch. CUDF is unable to parse that batch and had us back the number of rows. The best that we could do as a work around is to count the number of input rows before sending the data to CUDF and catch this exception after it happens. But I am not 100% sure that it will work in all cases, especially if CUDF support comments, because Spark strips out lines that are fully comments in files. Then we would not know how many rows there are without some help.

Again probably not the highest priority, but it does end up being rather hacky to work around a lack of this type of feature.

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Jun 5, 2023

Linking this issue to #5712 - which seems to capture the empty row issue well.

Let's leave this issue to focus on the column projection optimization.

@karthikeyann
Copy link
Contributor

Do PRs #14996 and #17029 fix this issue?

@GregoryKimball
Copy link
Contributor

Closed by #14996

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

3 participants