-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] JSON reader: support multi-lines #10267
Comments
This is primarily to document what Spark supports. I don't see this being a high priority at any point int he future. This is because Spark cannot split files with this type of processing, and it would make it very difficult for us to be able to do this in an efficient way. |
Is this expected to work with JSON Lines format? |
By definition it is not the same. https://spark.apache.org/docs/latest/sql-data-sources-json.html explains some of this, but not very well. Even the example file that they point to is not in a multi-line format. https://github.com/apache/spark/blob/master/python/test_support/sql/people_array.json is a better example of a multi-line format. If we do want to support this, which I have my doubts is worth out time I can get more details about this. |
This issue has been labeled |
Moved this to low priority to match what we do with CSV where this is low priority. |
Thanks for elaborating, @revans2! We plan to have support for in case of
For
What is still unclear to me in case of Spark is whether I've seen the following Spark example (but am not sure about the options that were used while parsing):
This would also correspond to the new nested parser for What would Spark output for:
|
A lot of this depends on the schema passed into Spark. Along with the schema that spark picks if you don't provide one. For
If multi-line is not enabled any line on it's own that is not a valid JSON statement results in an error. If Spark is asked to generate the schema it is going to use, and it sees an error like this it will insert in a new column. The name of the column is configurable, but by default it is "_corrupt_record" so the schema it picks here is.
and the data is
If we see a schema with this _corrupt_record in it we fall back to the CPU for now. But if someone gives us a schema like
If we enable multiline here, only the first full JSON item from the file is parsed, and it does not see any errors.
If we switch over to
It behaves similarly, in that it sees
If we enable multi-line, then we get back what you expect.
For the data set
If multiline is disabled I get back
But if it is enabled it will only parse the first line of data.
Spark is looking at the top level item for each entry. If the top level is an array, then it will treat each item in the array as a separate row. |
Thanks so much for putting together these examples, @revans2! I'm inferring from these examples that While the data parsed for
So far, seems reasonable. You try to parse one value per row. If you fail you put it into the
Still reasonable. Would infer that, if parsing of a line runs into error at some point, that row will become null(?).
Data seems fine. Debatable whether you would want to emit a warning that the overall format isn't valid anymore, since you've encountered more than a single top-level item, instead of just silently ignoring all items that follow.
Makes sense. We'll run into an error parsing lines
Makes sense. Regular JSON, single top-level
This is where things get funky for me. I would expect that each JSON line becomes a row. Hence, single column where each row is a
This makes sense. Becoming a fan of |
I agree. It would be good for spark to output a warning. I would prefer for Spark to output a warning for any garbage data it finds at the end a record after parsing valid JSON, but it does not do that. That is why the comma at the end of the lines did not make a difference. In Spark all data after the first valid JSON item per record is ignored. Not put into corrupt anything. It is just ignored. In multi-line the record is the entire file. In ndjson (multiline=false), then each line is a separate record. At least that is how I think about it.
Spark parses the multi-line and the ndjson records almost identically. The big difference is in how the records are split up. Spark decided that a top level array means a list of records so it does that in all cases. If I have a ndjson file like.
Spark will not be able to get any data out of it. It sees them as corrupt lines. If I give it a schema to try and force Spark to parse something out of it, it sees them as invalid. What is more it does not seem them as separate records, which is odd to me.
|
Thanks for the additional details! My current understanding of Spark's parsing behaviour is this:
That's how I could make sense of these examples:
If that's right, then the question would be, how wild this can be. E.g.:
|
If I enable multiline, then it sees the entire thing as corrupt. I think this is because the second item in the top level list is another list, not an object. If multiline is disabled then just the first line is corrupt and the second line can be parsed.
I am not sure that we have to make it match perfectly all of the time in all error cases. It really would be nice if we could do that, but I am much more concerned about making it work in the positive use cases. |
Thanks, Bobby! Agreed. Let's focus on getting the correct cases right, for now. After this example, I'm giving up on trying to develop an idea about the underlying logic for not-well-formatted inputs. After all, in case of |
Btw., I suppose that #11574 will make big leaps towards meeting this feature request in the Specifically,
What may remain to be addressed are the corner cases raised by the fuzzy behaviour we are seeing from Spark's JSON parser. Mostly related to invalid JSON.
Can you comment on how relevant these corner cases are for you? |
This is part of FEA of NVIDIA/spark-rapids#9
We have a JSON file
Spark can parse it when enabling
multiLine
CUDF parsing will throw an exception
We expect there is a configure
multiLine
to control this behavior.The text was updated successfully, but these errors were encountered: