Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Switch to nested JSON reader #7518

Closed
revans2 opened this issue Jan 17, 2023 · 4 comments · Fixed by rapidsai/cudf#12732
Closed

[FEA] Switch to nested JSON reader #7518

revans2 opened this issue Jan 17, 2023 · 4 comments · Fixed by rapidsai/cudf#12732
Assignees
Labels
task Work required that improves the product but is not user facing

Comments

@revans2
Copy link
Collaborator

revans2 commented Jan 17, 2023

Is your feature request related to a problem? Please describe.
rapidsai/cudf#12544 switched the default JSON parser to the nested parser, but left our java APIs on the "legacy" parser. In my testing all of the plugin integration tests pass with the new parser, but some of the CUDF unit tests don't. The unit tests are technically invalid JSON and are for a format that we don't care about from a spark perspective. They just worked and it was simple to use that format for tests. We should update the CUDF code to use the nested/default parser and the tests so that they pass with valid JSON data. We should also update the documentation for JSON support to indicate any of the differences between the new and old parser, like crashing on invalid data instead of returning nulls.

Added bonus points if we can easily add in simple nested support. If it does not just work out of the box then this should be a follow on issue after we switch over to the new parser.

@revans2 revans2 added ? - Needs Triage Need team to review and classify task Work required that improves the product but is not user facing labels Jan 17, 2023
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Jan 17, 2023
@firestarman
Copy link
Collaborator

A relevant issue #7616

@res-life
Copy link
Collaborator

res-life commented Feb 3, 2023

I tried the new parser, and all the JSON JNI cases passed except the following test case:

Test case is:

  @Test
  void testReadJSONBufferInferred() {
    JSONOptions opts = JSONOptions.builder()
        .withDayFirst(true)
        .build();
    byte[] data = ("[false,A,1,2,05/03/2001]\n" +

//   Note: 
//   "[true,B,2,3,31/10/2010]'\n" +    
//   ===>>>
//  "[true,B,2,3,31/10/2010]\n" +    

        "[true,B,2,3,31/10/2010]\n" +
        "[false,C,3,4,20/10/1994]\n" +
        "[true,D,4,5,18/10/1990]").getBytes(StandardCharsets.UTF_8);
    try (Table expected = new Table.TestBuilder()
        .column(false, true, false, true)
        .column("A", "B", "C", "D")
        .column(1L, 2L, 3L, 4L)
        .column(2L, 3L, 4L, 5L)
        .timestampMillisecondsColumn(983750400000L, 1288483200000L, 782611200000L, 656208000000L)
        .build();
         Table table = Table.readJSON(Schema.INFERRED, opts, data)) {
      assertTablesAreEqual(expected, table);
    }
  }

Error is:

java.lang.AssertionError: ColumnVectors can't be null or empty
	at ai.rapids.cudf.Table.<init>(Table.java:57)
	at ai.rapids.cudf.Table.readJSON(Table.java:1058)
	at ai.rapids.cudf.Table.readJSON(Table.java:1008)
	at ai.rapids.cudf.Table.readJSON(Table.java:971)

Schema.INFERRED is actually an empty schema: new Schema()
The new parser did not infer the columns.

The Spark also does not handle this kind of query:
The behavior of Spark:

$ cat /tmp/b.json 
[1,2,3]
[1,2,3]

>>> spark.read.json("/tmp/b.json").show()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/chongg/local-disk/progs/sparks/spark-home/python/pyspark/sql/dataframe.py", line 606, in show
    print(self._jdf.showString(n, 20, vertical))
  File "/home/chongg/local-disk/progs/sparks/spark-home/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
  File "/home/chongg/local-disk/progs/sparks/spark-home/python/pyspark/sql/utils.py", line 196, in deco
    raise converted from None
pyspark.sql.utils.AnalysisException: 
Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).csv(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).csv(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).csv(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().
      

We can delete this case, what do you think?
@revans2

@revans2
Copy link
Collaborator Author

revans2 commented Feb 3, 2023

The tests were there to verify that we have JNI setup correctly. That is not valid JSON so the test is technically bogus to begin with and it is only adding that we can get CUDF to infer the types. I am fine if we replace the test with one that still does type inference, but with correct JSON. perhaps.

    byte[] data = ("[false,\"A\",1,2,\"05/03/2001\"]\n" +
        "[true,\"B\",2,3,\"31/10/2010\"]'\n" +
        "[false,\"C\",3,4,\"20/10/1994\"]\n" +
        "[true,\"D\",4,5,\"18/10/1990\"]").getBytes(StandardCharsets.UTF_8);

or even

    byte[] data = ("{\"a\": false, \"b\": \"A\", \"c\": 1, \"d\": 2, \"e\": \"05/03/2001\"]\n" ...

The only problem with the latter one is that we don't know the order in which the columns are returned so that might change and still be a valid result, but we will fail.

A top level JSON array is not supported by Spark, but that is separate and not completely relevant because the point of the API is to expose CUDF and not to match what Spark does.

rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Feb 13, 2023
JNI switches to nested JSON reader
closes NVIDIA/spark-rapids#7518

Note: 
The new reader read `05/03/2001` as String, so I removed the timestamps in the test case `testReadJSONBufferInferred`

Authors:
  - Chong Gao (https://github.com/res-life)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Nghia Truong (https://github.com/ttnghia)

URL: #12732
@sameerz sameerz changed the title [FEA] Switch to nested JSON reader [FEA] Switch to nested JSON reader [skip ci] Feb 22, 2023
@sameerz sameerz changed the title [FEA] Switch to nested JSON reader [skip ci] [FEA] Switch to nested JSON reader Feb 22, 2023
@res-life
Copy link
Collaborator

I'd like to close this issue since #7791 and rapidsai/cudf#12732 were merged.
@revans2

@revans2 revans2 closed this as completed Feb 24, 2023
@mattahrens mattahrens added the feature request New feature or request label Feb 24, 2023
@sameerz sameerz removed the feature request New feature or request label Apr 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants