[FEA] Switch to nested JSON reader #7518

revans2 · 2023-01-17T14:48:25Z

Is your feature request related to a problem? Please describe.
rapidsai/cudf#12544 switched the default JSON parser to the nested parser, but left our java APIs on the "legacy" parser. In my testing all of the plugin integration tests pass with the new parser, but some of the CUDF unit tests don't. The unit tests are technically invalid JSON and are for a format that we don't care about from a spark perspective. They just worked and it was simple to use that format for tests. We should update the CUDF code to use the nested/default parser and the tests so that they pass with valid JSON data. We should also update the documentation for JSON support to indicate any of the differences between the new and old parser, like crashing on invalid data instead of returning nulls.

Added bonus points if we can easily add in simple nested support. If it does not just work out of the box then this should be a follow on issue after we switch over to the new parser.

firestarman · 2023-01-30T08:55:20Z

A relevant issue #7616

res-life · 2023-02-03T11:00:48Z

I tried the new parser, and all the JSON JNI cases passed except the following test case:

Test case is:

  @Test
  void testReadJSONBufferInferred() {
    JSONOptions opts = JSONOptions.builder()
        .withDayFirst(true)
        .build();
    byte[] data = ("[false,A,1,2,05/03/2001]\n" +

//   Note: 
//   "[true,B,2,3,31/10/2010]'\n" +    
//   ===>>>
//  "[true,B,2,3,31/10/2010]\n" +    

        "[true,B,2,3,31/10/2010]\n" +
        "[false,C,3,4,20/10/1994]\n" +
        "[true,D,4,5,18/10/1990]").getBytes(StandardCharsets.UTF_8);
    try (Table expected = new Table.TestBuilder()
        .column(false, true, false, true)
        .column("A", "B", "C", "D")
        .column(1L, 2L, 3L, 4L)
        .column(2L, 3L, 4L, 5L)
        .timestampMillisecondsColumn(983750400000L, 1288483200000L, 782611200000L, 656208000000L)
        .build();
         Table table = Table.readJSON(Schema.INFERRED, opts, data)) {
      assertTablesAreEqual(expected, table);
    }
  }

Error is:

java.lang.AssertionError: ColumnVectors can't be null or empty
	at ai.rapids.cudf.Table.<init>(Table.java:57)
	at ai.rapids.cudf.Table.readJSON(Table.java:1058)
	at ai.rapids.cudf.Table.readJSON(Table.java:1008)
	at ai.rapids.cudf.Table.readJSON(Table.java:971)

Schema.INFERRED is actually an empty schema: new Schema()
The new parser did not infer the columns.

The Spark also does not handle this kind of query:
The behavior of Spark:

$ cat /tmp/b.json 
[1,2,3]
[1,2,3]

>>> spark.read.json("/tmp/b.json").show()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/chongg/local-disk/progs/sparks/spark-home/python/pyspark/sql/dataframe.py", line 606, in show
    print(self._jdf.showString(n, 20, vertical))
  File "/home/chongg/local-disk/progs/sparks/spark-home/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
  File "/home/chongg/local-disk/progs/sparks/spark-home/python/pyspark/sql/utils.py", line 196, in deco
    raise converted from None
pyspark.sql.utils.AnalysisException: 
Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).csv(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).csv(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).csv(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().

We can delete this case, what do you think?
@revans2

revans2 · 2023-02-03T21:09:49Z

The tests were there to verify that we have JNI setup correctly. That is not valid JSON so the test is technically bogus to begin with and it is only adding that we can get CUDF to infer the types. I am fine if we replace the test with one that still does type inference, but with correct JSON. perhaps.

    byte[] data = ("[false,\"A\",1,2,\"05/03/2001\"]\n" +
        "[true,\"B\",2,3,\"31/10/2010\"]'\n" +
        "[false,\"C\",3,4,\"20/10/1994\"]\n" +
        "[true,\"D\",4,5,\"18/10/1990\"]").getBytes(StandardCharsets.UTF_8);

or even

    byte[] data = ("{\"a\": false, \"b\": \"A\", \"c\": 1, \"d\": 2, \"e\": \"05/03/2001\"]\n" ...

The only problem with the latter one is that we don't know the order in which the columns are returned so that might change and still be a valid result, but we will fail.

A top level JSON array is not supported by Spark, but that is separate and not completely relevant because the point of the API is to expose CUDF and not to match what Spark does.

JNI switches to nested JSON reader closes NVIDIA/spark-rapids#7518 Note: The new reader read `05/03/2001` as String, so I removed the timestamps in the test case `testReadJSONBufferInferred` Authors: - Chong Gao (https://github.com/res-life) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - Nghia Truong (https://github.com/ttnghia) URL: #12732

res-life · 2023-02-24T12:53:52Z

I'd like to close this issue since #7791 and rapidsai/cudf#12732 were merged.
@revans2

revans2 added ? - Needs Triage Need team to review and classify task Work required that improves the product but is not user facing labels Jan 17, 2023

sameerz removed the ? - Needs Triage Need team to review and classify label Jan 17, 2023

sameerz mentioned this issue Jan 31, 2023

[BUG] GPU JSON reader fails to read the JSON string of an empty body #7616

Open

GaryShen2008 assigned res-life Feb 1, 2023

res-life mentioned this issue Feb 8, 2023

JNI switches to nested JSON reader rapidsai/cudf#12732

Merged

3 tasks

res-life mentioned this issue Feb 20, 2023

Doc changes for new nested JSON reader [skip ci] #7791

Merged

sameerz changed the title ~~[FEA] Switch to nested JSON reader~~ [FEA] Switch to nested JSON reader [skip ci] Feb 22, 2023

sameerz changed the title ~~[FEA] Switch to nested JSON reader [skip ci]~~ [FEA] Switch to nested JSON reader Feb 22, 2023

revans2 closed this as completed Feb 24, 2023

mattahrens added the feature request New feature or request label Feb 24, 2023

sameerz removed the feature request New feature or request label Apr 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Switch to nested JSON reader #7518

[FEA] Switch to nested JSON reader #7518

revans2 commented Jan 17, 2023

firestarman commented Jan 30, 2023

res-life commented Feb 3, 2023 •

edited

Loading

revans2 commented Feb 3, 2023

res-life commented Feb 24, 2023

[FEA] Switch to nested JSON reader #7518

[FEA] Switch to nested JSON reader #7518

Comments

revans2 commented Jan 17, 2023

firestarman commented Jan 30, 2023

res-life commented Feb 3, 2023 • edited Loading

revans2 commented Feb 3, 2023

res-life commented Feb 24, 2023

res-life commented Feb 3, 2023 •

edited

Loading