-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] JSON reader has no option to return the columns only for the requested schema #13473
Comments
This moves the logic to update the columns returned from the JSON reader to java. It also updated the code to be able to deal with requested columns that were not in the data. It is not perfect because it will not work if the input file had no columns at all in it. ``` {} {} ``` But it fixes issues for a file that has valid columns in it, but none of them are the columns that we requested. This is a work around for #13473, but is not perfect. Authors: - Robert (Bobby) Evans (https://github.com/revans2) Approvers: - Jason Lowe (https://github.com/jlowe) - MithunR (https://github.com/mythrocks) URL: #13477
Thank you @revans2 for posting this. Would you please share if you exploring JSON file read or JSON strings column parsing when you encountered these issues? The two topics about column projection and empty rows seem like they could be different issues. Would you please let me know if there is a reason to combine them? |
@GregoryKimball this was when we were reading a file, but it is the same for both code paths in this case. Thanks for the info. I understand that this might not have a huge impact to the memory usage or computation time. The majority of the memory usage and computation would be going into tokenizing and parsing the data. But even then if I have a case like the following
If all I save is on not needing to materialize the output for A that is a win. May not be something that you want to prioritize super high for a performance standpoint. The big problem for us in a really odd corner case which we did run into in one of our integration tests, and I am nervous we will run into with a customer at some point. Spark when it writes out JSON data by default will remove entries that are null. It is a space savings optimization. So if we end up with a row that are all nulls we end up with an output row of
I don't think it is likely to happen for large runs of rows, but with Spark it is very possible to have a few rows at the end of a file that show up like this and because of splits they would end up in a single batch. CUDF is unable to parse that batch and had us back the number of rows. The best that we could do as a work around is to count the number of input rows before sending the data to CUDF and catch this exception after it happens. But I am not 100% sure that it will work in all cases, especially if CUDF support comments, because Spark strips out lines that are fully comments in files. Then we would not know how many rows there are without some help. Again probably not the highest priority, but it does end up being rather hacky to work around a lack of this type of feature. |
Linking this issue to #5712 - which seems to capture the empty row issue well. Let's leave this issue to focus on the column projection optimization. |
Closed by #14996 |
Describe the bug
Pandas and Spark are very different in what gets returned when reading a JSON file. In pandas you provide essentially type hints, but it will return data for all columns in the file and only for columns in the file.
Spark has a schema that they want and when given a schema they want to read only the columns that match the schema and nothing else.
This causes two problems for Spark. The first one is a performance issue. If we want to read only one column out of 100, CUDF is going to materialize 100 columns and then we are going to throw away 99 of them. It would be great if we didn't need the memory for all of the extra columns or the computation needed to create them.
The second problem is really an odd corner case. CUDF does not support a Table with just rows, but Spark does. So if there is a JSON file with no columns, but rows.{}{}We have no way to actually read that without some help.(see #5712)
It really would be nice to have a mode similar to how parquet, ORC, or CSV work where only the columns that are requested are returned and all of the requested columns are returned, even if the values are all nulls.
The text was updated successfully, but these errors were encountered: