-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] JSON reader: Provide option to treat quoted strings as null values #10283
Comments
This issue has been labeled |
Moved this to P1 as it is a corner case that is not that common. It would be very nice to be able to support this at some point sooner than later. |
In this case, is it okay to keep the quotes on the values in the actual string columns as well? |
If we keep the quotes then we will have to perform an additional transformation in the plugin to remove them so this doesn't seem ideal. If we can get the raw string value (without quotes) and an indication of whether the value was quoted or not then I think we have everything we need. |
Once #11574 is merged, the new nested JSON reader (currently available as Otherwise, I would need to better understand the expected behaviour.
Is this a mixup, or would spark really treat the second value as I think having a mapping of a tuple of ( |
Yes keep_quotes would do what we want. |
@revans2 the |
Sure |
Is your feature request related to a problem? Please describe.
This is part of NVIDIA/spark-rapids#9
In order to be consistent with Spark when reading JSON on the GPU, we would like to ask cuDF to read non-string primitive values as strings and then cast them to the required type. This approach already works well for valid inputs but we do not have a way to treat quoted strings as null to match Spark's behavior.
Here is an example JSON input to demonstrate the problem.
The first entry is a valid JSON boolean value and the second entry is a JSON string. If we ask cuDF to read this attribute as a string then we get the same value in both cases. Spark would treat the second entry as invalid and return null.
Describe the solution you'd like
There are a few possible approaches to this:
Describe alternatives you've considered
None
Additional context
None
The text was updated successfully, but these errors were encountered: