Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] JSON reader: support multi-lines #10267

Open
wbo4958 opened this issue Feb 10, 2022 · 13 comments
Open

[FEA] JSON reader: support multi-lines #10267

wbo4958 opened this issue Feb 10, 2022 · 13 comments
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@wbo4958
Copy link
Contributor

wbo4958 commented Feb 10, 2022

This is part of FEA of NVIDIA/spark-rapids#9
We have a JSON file

{"name":
   "Reynold Xin"}

Spark can parse it when enabling multiLine

CUDF parsing will throw an exception

We expect there is a configure multiLine to control this behavior.

@wbo4958 wbo4958 added feature request New feature or request Needs Triage Need team to review and classify labels Feb 10, 2022
@revans2 revans2 added the Spark Functionality that helps Spark RAPIDS label Feb 10, 2022
@revans2
Copy link
Contributor

revans2 commented Feb 10, 2022

This is primarily to document what Spark supports. I don't see this being a high priority at any point int he future. This is because Spark cannot split files with this type of processing, and it would make it very difficult for us to be able to do this in an efficient way.

@galipremsagar galipremsagar added the cuIO cuIO issue label Feb 11, 2022
@vuule
Copy link
Contributor

vuule commented Feb 15, 2022

Is this expected to work with JSON Lines format?

@revans2
Copy link
Contributor

revans2 commented Feb 16, 2022

By definition it is not the same. https://spark.apache.org/docs/latest/sql-data-sources-json.html explains some of this, but not very well. Even the example file that they point to is not in a multi-line format.

https://github.com/apache/spark/blob/master/python/test_support/sql/people_array.json is a better example of a multi-line format. If we do want to support this, which I have my doubts is worth out time I can get more details about this.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@revans2
Copy link
Contributor

revans2 commented May 16, 2022

Moved this to low priority to match what we do with CSV where this is low priority.

@GregoryKimball GregoryKimball removed the Needs Triage Need team to review and classify label Jun 28, 2022
@GregoryKimball GregoryKimball added this to the Nested JSON reader milestone Jun 28, 2022
@elstehle
Copy link
Contributor

https://github.com/apache/spark/blob/master/python/test_support/sql/people_array.json is a better example of a multi-line format. If we do want to support this, which I have my doubts is worth out time I can get more details about this.

Thanks for elaborating, @revans2! We plan to have support for multiline by default in the new nested parser.

in case of lines=true (parsing ndjson), we plan to support line breaks within nested structs, treating those newlines as whitespace. We will only parse newlines as delimiters at the root level. The following will parse to three rows of two columns

{"a":"col0_row0", "b":"col1_row0"}
{"a":"col0_row1", 
"b":"col1_row1"}
{"a":"col0_row2", "b":"col1_row2"}

For lines=false (parsing regular JSON that expects to have a single JSON value at the root of the document): Outside of quotes, newlines are generally treated as part of whitespace and will be ignored. This gives the same data as above, but the enclosing brackets are required [...]

[
{"a":"col0_row0", "b":"col1_row0"}
{"a":"col0_row1", 
"b":"col1_row1"}
{"a":"col0_row2", "b":"col1_row2"}
]

What is still unclear to me in case of Spark is whether multiline would also influence whether Spark should expect ndjson or regular JSON.

I've seen the following Spark example (but am not sure about the options that were used while parsing):

[{"a": 1}, {"b": 2}, {"c": 3}]

+----+----+----+
|   a|   b|   c|
+----+----+----+
|   1|null|null|
|null|   2|null|
|null|null|   3|
+----+----+----+

This would also correspond to the new nested parser for lines=false (i.e., regular JSON).

What would Spark output for:

[{"a": 1}, {"b": 2}, {"c": 3}]
[{"a": 21}, {"b": 22}, {"c": 23}]

@revans2
Copy link
Contributor

revans2 commented Aug 17, 2022

A lot of this depends on the schema passed into Spark. Along with the schema that spark picks if you don't provide one.

For

{"a":"col0_row0", "b":"col1_row0"}
{"a":"col0_row1", 
"b":"col1_row1"}
{"a":"col0_row2", "b":"col1_row2"}

If multi-line is not enabled any line on it's own that is not a valid JSON statement results in an error. If Spark is asked to generate the schema it is going to use, and it sees an error like this it will insert in a new column. The name of the column is configurable, but by default it is "_corrupt_record" so the schema it picks here is.

root                                                                            
 |-- _corrupt_record: string (nullable = true)
 |-- a: string (nullable = true)
 |-- b: string (nullable = true)

and the data is

+------------------+---------+---------+
|   _corrupt_record|        a|        b|
+------------------+---------+---------+
|              null|col0_row0|col1_row0|
|{"a":"col0_row1", |     null|     null|
|  "b":"col1_row1"}|     null|     null|
|              null|col0_row2|col1_row2|
+------------------+---------+---------+

If we see a schema with this _corrupt_record in it we fall back to the CPU for now. But if someone gives us a schema like a STRING, b STRING we end up with the same number of rows 4.

+---------+---------+
|        a|        b|
+---------+---------+
|col0_row0|col1_row0|
|     null|     null|
|     null|     null|
|col0_row2|col1_row2|
+---------+---------+

If we enable multiline here, only the first full JSON item from the file is parsed, and it does not see any errors.

root
 |-- a: string (nullable = true)
 |-- b: string (nullable = true)

+---------+---------+
|        a|        b|
+---------+---------+
|col0_row0|col1_row0|
+---------+---------+

If we switch over to

[
{"a":"col0_row0", "b":"col1_row0"},
{"a":"col0_row1", 
"b":"col1_row1"},
{"a":"col0_row2", "b":"col1_row2"},
]

It behaves similarly, in that it sees ] and [ as corrupt records too. Note that I added commas after each JSON entry or else it would not be a valid JSON file. Spark sees this and throws an exception when doing schema discovery.

+------------------+---------+---------+                                        
|   _corrupt_record|        a|        b|
+------------------+---------+---------+
|                 [|     null|     null|
|              null|col0_row0|col1_row0|
|{"a":"col0_row1", |     null|     null|
| "b":"col1_row1"},|     null|     null|
|              null|col0_row2|col1_row2|
|                 ]|     null|     null|
+------------------+---------+---------+

If we enable multi-line, then we get back what you expect.

+---------+---------+
|        a|        b|
+---------+---------+
|col0_row0|col1_row0|
|col0_row1|col1_row1|
|col0_row2|col1_row2|
+---------+---------+

For the data set

[{"a": 1}, {"b": 2}, {"c": 3}]
[{"a": 21}, {"b": 22}, {"c": 23}]

If multiline is disabled I get back

+----+----+----+
|   a|   b|   c|
+----+----+----+
|   1|null|null|
|null|   2|null|
|null|null|   3|
|  21|null|null|
|null|  22|null|
|null|null|  23|
+----+----+----+

But if it is enabled it will only parse the first line of data.

+----+----+----+
|   a|   b|   c|
+----+----+----+
|   1|null|null|
|null|   2|null|
|null|null|   3|
+----+----+----+

Spark is looking at the top level item for each entry. If the top level is an array, then it will treat each item in the array as a separate row.

@elstehle
Copy link
Contributor

elstehle commented Aug 18, 2022

Thanks so much for putting together these examples, @revans2!

I'm inferring from these examples that multiline=true means parsing regular JSON and multiline=false means parsing ndjson.

While the data parsed for multiline=true seems reasonable to me, I cannot really make sense of all multiline=false examples.

For

{"a":"col0_row0", "b":"col1_row0"}
{"a":"col0_row1", 
"b":"col1_row1"}
{"a":"col0_row2", "b":"col1_row2"}

If multi-line is not enabled any line on it's own that is not a valid JSON statement results in an error.

root                                                                            
 |-- _corrupt_record: string (nullable = true)
 |-- a: string (nullable = true)
 |-- b: string (nullable = true)

and the data is

+------------------+---------+---------+
|   _corrupt_record|        a|        b|
+------------------+---------+---------+
|              null|col0_row0|col1_row0|
|{"a":"col0_row1", |     null|     null|
|  "b":"col1_row1"}|     null|     null|
|              null|col0_row2|col1_row2|
+------------------+---------+---------+

So far, seems reasonable. You try to parse one value per row. If you fail you put it into the corrupt column.

If we see a schema with this _corrupt_record in it we fall back to the CPU for now. But if someone gives us a schema like a STRING, b STRING we end up with the same number of rows 4.

+---------+---------+
|        a|        b|
+---------+---------+
|col0_row0|col1_row0|
|     null|     null|
|     null|     null|
|col0_row2|col1_row2|
+---------+---------+

Still reasonable. Would infer that, if parsing of a line runs into error at some point, that row will become null(?).

If we enable multiline here, only the first full JSON item from the file is parsed, and it does not see any errors.

root
 |-- a: string (nullable = true)
 |-- b: string (nullable = true)

+---------+---------+
|        a|        b|
+---------+---------+
|col0_row0|col1_row0|
+---------+---------+

Data seems fine. Debatable whether you would want to emit a warning that the overall format isn't valid anymore, since you've encountered more than a single top-level item, instead of just silently ignoring all items that follow.

If we switch over to

[
{"a":"col0_row0", "b":"col1_row0"},
{"a":"col0_row1", 
"b":"col1_row1"},
{"a":"col0_row2", "b":"col1_row2"},
]

It behaves similarly, in that it sees ] and [ as corrupt records too. Note that I added commas after each JSON entry or else it would not be a valid JSON file. Spark sees this and throws an exception when doing schema discovery.

+------------------+---------+---------+                                        
|   _corrupt_record|        a|        b|
+------------------+---------+---------+
|                 [|     null|     null|
|              null|col0_row0|col1_row0|
|{"a":"col0_row1", |     null|     null|
| "b":"col1_row1"},|     null|     null|
|              null|col0_row2|col1_row2|
|                 ]|     null|     null|
+------------------+---------+---------+

Makes sense. We'll run into an error parsing lines [0, 2, 3, 5]. We put those lines as string values into the _corrupt_record. But would the commas really be needed for multiline=false?

If we enable multi-line, then we get back what you expect.

+---------+---------+
|        a|        b|
+---------+---------+
|col0_row0|col1_row0|
|col0_row1|col1_row1|
|col0_row2|col1_row2|
+---------+---------+

Makes sense. Regular JSON, single top-level LIST item.

For the data set

[{"a": 1}, {"b": 2}, {"c": 3}]
[{"a": 21}, {"b": 22}, {"c": 23}]

If multiline is disabled I get back

+----+----+----+
|   a|   b|   c|
+----+----+----+
|   1|null|null|
|null|   2|null|
|null|null|   3|
|  21|null|null|
|null|  22|null|
|null|null|  23|
+----+----+----+

This is where things get funky for me. I would expect that each JSON line becomes a row. Hence, single column where each row is a list. The list being a list-of-{a:int,b:int,c:int}. This may, however, also relate to the question of how Spark would distribute deeper nesting amongst columns and rows in a table.

But if it is enabled it will only parse the first line of data.

+----+----+----+
|   a|   b|   c|
+----+----+----+
|   1|null|null|
|null|   2|null|
|null|null|   3|
+----+----+----+

This makes sense. Becoming a fan of multiline=true 🙂. It parses a regular JSON, it only respects the very first value it finds in the JSON input which is [{"a": 1}, {"b": 2}, {"c": 3}]. Each object becomes a row. Each field becomes a column.

@revans2
Copy link
Contributor

revans2 commented Aug 18, 2022

Data seems fine. Debatable whether you would want to emit a warning that the overall format isn't valid anymore, since you've encountered more than a single top-level item, instead of just silently ignoring all items that follow.

I agree. It would be good for spark to output a warning. I would prefer for Spark to output a warning for any garbage data it finds at the end a record after parsing valid JSON, but it does not do that. That is why the comma at the end of the lines did not make a difference. In Spark all data after the first valid JSON item per record is ignored. Not put into corrupt anything. It is just ignored. In multi-line the record is the entire file. In ndjson (multiline=false), then each line is a separate record. At least that is how I think about it.

This is where things get funky for me. I would expect that each JSON line becomes a row. Hence, single column where each row is a list. The list being a list-of-{a:int,b:int,c:int}. This may, however, also relate to the question of how Spark would distribute deeper nesting amongst columns and rows in a table.

Spark parses the multi-line and the ndjson records almost identically. The big difference is in how the records are split up. Spark decided that a top level array means a list of records so it does that in all cases. If I have a ndjson file like.

["a", "b", "c"]
["x", "y", "z"]

Spark will not be able to get any data out of it. It sees them as corrupt lines. If I give it a schema to try and force Spark to parse something out of it, it sees them as invalid. What is more it does not seem them as separate records, which is odd to me.

spark.read.schema("a Array<STRING>").json("./test.json").show()
+----+
|   a|
+----+
|null|
|null|
+----+

@elstehle
Copy link
Contributor

Thanks for the additional details!

My current understanding of Spark's parsing behaviour is this:

  1. A JSON object ({...}) maps to a record.
  2. In order to parse a record, Spark parses the document until it encounters the first JSON object ({...}), it simply ignores other structures (i.e., enclosing lists) along the path to the first JSON object that will be mapped to the record. After that JSON object, it tries to continue parsing the next record.
  3. If multiline is enabled, it parses just the first item in the JSON. Within that first item, it probably follows the logic from (2).

That's how I could make sense of these examples:

[{"a": 1}, {"b": 2}, {"c": 3}]
[{"a": 21}, {"b": 22}, {"c": 23}]
multiline disabled
+----+----+----+
|   a|   b|   c|
+----+----+----+
|   1|null|null|
|null|   2|null|
|null|null|   3|
|  21|null|null|
|null|  22|null|
|null|null|  23|
+----+----+----+

multiline enabled

+----+----+----+
|   a|   b|   c|
+----+----+----+
|   1|null|null|
|null|   2|null|
|null|null|   3|
+----+----+----+

If that's right, then the question would be, how wild this can be. E.g.:

[{"a":1.1},
[{"b":2.2}]]

@revans2
Copy link
Contributor

revans2 commented Aug 26, 2022

If I enable multiline, then it sees the entire thing as corrupt. I think this is because the second item in the top level list is another list, not an object. If multiline is disabled then just the first line is corrupt and the second line can be parsed.

+---------------+----+
|_corrupt_record|   b|
+---------------+----+
|    [{"a":1.1},|null|
|           null| 2.2|
+---------------+----+

I am not sure that we have to make it match perfectly all of the time in all error cases. It really would be nice if we could do that, but I am much more concerned about making it work in the positive use cases.

@elstehle
Copy link
Contributor

If I enable multiline, then it sees the entire thing as corrupt. I think this is because the second item in the top level list is another list, not an object. If multiline is disabled then just the first line is corrupt and the second line can be parsed.

+---------------+----+
|_corrupt_record|   b|
+---------------+----+
|    [{"a":1.1},|null|
|           null| 2.2|
+---------------+----+

I am not sure that we have to make it match perfectly all of the time in all error cases. It really would be nice if we could do that, but I am much more concerned about making it work in the positive use cases.

Thanks, Bobby! Agreed. Let's focus on getting the correct cases right, for now.

After this example, I'm giving up on trying to develop an idea about the underlying logic for not-well-formatted inputs. After all, in case of multiline=False, the two lines begin identical, the first row fails the second one succeeds. 🤷

@elstehle
Copy link
Contributor

elstehle commented Aug 28, 2022

Btw., I suppose that #11574 will make big leaps towards meeting this feature request in the experimental parser.

Specifically,

  • lines=True will correspond to multiline=False.
  • lines=False will correspond to multiline=True.

What may remain to be addressed are the corner cases raised by the fuzzy behaviour we are seeing from Spark's JSON parser. Mostly related to invalid JSON.

  1. In particular for multiline=True, silently ignoring JSON items other than the first JSON item encountered in the input (which I'm not a fan of). Our current behaviour in the experimental parser is to fail after the first JSON item instead of ignoring any subsequent input:

Spark ignores any text content after the end of the JSON record so we would need to be able to support that

  1. The fuzzy logic(🤷) that will parse records when they are not at the root of the line:
[{"a": 1}, {"b": 2}, {"c": 3}]
[{"a": 21}, {"b": 22}, {"c": 23}]
multiline disabled
+----+----+----+
|   a|   b|   c|
+----+----+----+
|   1|null|null|
|null|   2|null|
|null|null|   3|
|  21|null|null|
|null|  22|null|
|null|null|  23|
+----+----+----+

Can you comment on how relevant these corner cases are for you?

@GregoryKimball GregoryKimball added the 0 - Backlog In queue waiting for assignment label Oct 26, 2022
@GregoryKimball GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Apr 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

6 participants