-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] from_json
generated inconsistent result comparing with CPU for input column with nested json strings
#8558
Comments
I just picked this issue up and ran the test in the issue description. The behavior has changed since this issue was filed. The query falls back to CPU due to:
|
Changing the output schema to
|
The first issue here is that GpuOverrides checks are incorrect and disable the JSON to struct functionality. We should allow
|
Ok, I now realize I have gone full circle on this and now understand why the tests were xfailed and why the feature was disabled. To support parsing JSON to struct and to support reading some parts of the JSON as string (per the example here), we will need something like rapidsai/cudf#14239 |
we ask to read the
We either need cuDF to return this as unparsed string or we need to implement our own parsing, using cuDF for the JSON tokenizer. |
The C++ for the JSON parser returns a table_with_metadata. https://github.com/rapidsai/cudf/blob/29556a2514f4d274164a27a80539410da7e132d6/cpp/include/cudf/io/types.hpp#L231 We strip off much of the metadata to try and make the API consistent with the other reader APIs that just return the data in the same layout as the schema we passed in. You could use that, but then what happens if you have mixed data types? Like if one of the rows it happens to be a string and the others are structs? I think the only long term solution is to have separate processing for JSOn after the tokenization similar to what we do for map. The special map processing code already does this. But in speaking with some people in CUDF they are going to investigate if this is something that they want to support themselves or if we just need to write our own parser after the CUDF tokenization. There are enough differences already that I am leaning towards our own custom parsing. |
I started on a prototype for this issue in #10326 and this needs updating now that rapidsai/cudf#14954 has been merged |
With the most recent changes (including #10575) in we are now getting an exception instead of the wrong data. With
rapidsai/cudf#15278 is the issue that was filed to fix it in CUDF. |
@andygrove do you still plan on trying to fix this? |
I am not actively working on this, so have unassigned myself. |
Now the issue seems already being fixed somehow:
Close as fixed. |
Describe the bug
from_json
generated inconsistent result comparing with CPU for input column with nested json strings.Steps/Code to reproduce bug
Here is a repro case:
For GPU, the output is:
For CPU, the output is:
Expected behavior
Same as above CPU output.
Environment details (please complete the following information)
The text was updated successfully, but these errors were encountered: