-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Switch to nested JSON reader #7518
Comments
A relevant issue #7616 |
I tried the new parser, and all the JSON JNI cases passed except the following test case: Test case is: @Test
void testReadJSONBufferInferred() {
JSONOptions opts = JSONOptions.builder()
.withDayFirst(true)
.build();
byte[] data = ("[false,A,1,2,05/03/2001]\n" +
// Note:
// "[true,B,2,3,31/10/2010]'\n" +
// ===>>>
// "[true,B,2,3,31/10/2010]\n" +
"[true,B,2,3,31/10/2010]\n" +
"[false,C,3,4,20/10/1994]\n" +
"[true,D,4,5,18/10/1990]").getBytes(StandardCharsets.UTF_8);
try (Table expected = new Table.TestBuilder()
.column(false, true, false, true)
.column("A", "B", "C", "D")
.column(1L, 2L, 3L, 4L)
.column(2L, 3L, 4L, 5L)
.timestampMillisecondsColumn(983750400000L, 1288483200000L, 782611200000L, 656208000000L)
.build();
Table table = Table.readJSON(Schema.INFERRED, opts, data)) {
assertTablesAreEqual(expected, table);
}
} Error is:
The Spark also does not handle this kind of query:
We can delete this case, what do you think? |
The tests were there to verify that we have JNI setup correctly. That is not valid JSON so the test is technically bogus to begin with and it is only adding that we can get CUDF to infer the types. I am fine if we replace the test with one that still does type inference, but with correct JSON. perhaps.
or even
The only problem with the latter one is that we don't know the order in which the columns are returned so that might change and still be a valid result, but we will fail. A top level JSON array is not supported by Spark, but that is separate and not completely relevant because the point of the API is to expose CUDF and not to match what Spark does. |
JNI switches to nested JSON reader closes NVIDIA/spark-rapids#7518 Note: The new reader read `05/03/2001` as String, so I removed the timestamps in the test case `testReadJSONBufferInferred` Authors: - Chong Gao (https://github.com/res-life) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - Nghia Truong (https://github.com/ttnghia) URL: #12732
I'd like to close this issue since #7791 and rapidsai/cudf#12732 were merged. |
Is your feature request related to a problem? Please describe.
rapidsai/cudf#12544 switched the default JSON parser to the nested parser, but left our java APIs on the "legacy" parser. In my testing all of the plugin integration tests pass with the new parser, but some of the CUDF unit tests don't. The unit tests are technically invalid JSON and are for a format that we don't care about from a spark perspective. They just worked and it was simple to use that format for tests. We should update the CUDF code to use the nested/default parser and the tests so that they pass with valid JSON data. We should also update the documentation for JSON support to indicate any of the differences between the new and old parser, like crashing on invalid data instead of returning nulls.
Added bonus points if we can easily add in simple nested support. If it does not just work out of the box then this should be a follow on issue after we switch over to the new parser.
The text was updated successfully, but these errors were encountered: