You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
In Spark we have a requirement to be able to pass in a column of strings and parse them as JSON. Ideally we would just pass this directly to CUDF, but none of the input formats really support this, and neither do any of the pre-processing steps that the JSON reader has put in for us. What we do today is first check to see if a line separator (carriage return) is in the data set. If there is one, then we throw an exception. If not, then we concat the lines together into a single buffer with a line separator in between the inputs. (we do some fixup for NULLs/empty rows too).
This has the problem that we throw an exception when we see a bad character in the data, which is valid for Spark to have in the data.
I think that there are a few options that we have to fix this kind of a problem.
Expose the API that removes unneeded white space. We could then remove the unneeded data from the buffer and replace any remaining line separators with '\n' because then they should only be in quoted strings. (we might need to do single quote normalization too because I am not sure which one comes first)
Provide a way to set a different line separator (Ideally something really unlikely to show up NUL \0). This would not fix the problem 100%, but it would make it super rare, and I would feel okay with a solution like this.
Do nothing and we just take the hit when we see a line with this in it. We would then have to pull back those lines to the CPU and process them on the CPU, and push them back to the GPU afterwards.
I personally like option 2, but I am likely to implement option 3 in the short term unless I hear from CUDF that this is simple to do and can be done really quickly.
The text was updated successfully, but these errors were encountered:
…delimiter (#15556)
Addresses #15277
Given a JSON lines buffer with records separated by a delimiter passed at runtime, the idea is to modify the JSON tokenization FST to consider the passed delimiter to generate EOL token instead of the newline character currently hard-coded.
This PR does not modify the whitespace normalization FST to [strip out unquoted `\n` and `\r`](#14865 (comment)). Whitespace normalization will be handled in follow-up works.
Note that this is not a multi-object JSON reader since we are not using the offsets data in the string column, and hence there is no resetting of the start state at every row offset.
Current status:
- [X] Semantic bracket/brace DFA
- [X] DFA removing excess characters after record in line
- [X] Pushdown automata generating tokens
- [x] Test passing arbitrary delimiter that does not occur in input to the reader
Authors:
- Shruti Shivakumar (https://github.com/shrshi)
Approvers:
- Paul Mattione (https://github.com/pmattione-nvidia)
- Vukasin Milovanovic (https://github.com/vuule)
- Elias Stehle (https://github.com/elstehle)
- Karthikeyan (https://github.com/karthikeyann)
URL: #15556
Is your feature request related to a problem? Please describe.
In Spark we have a requirement to be able to pass in a column of strings and parse them as JSON. Ideally we would just pass this directly to CUDF, but none of the input formats really support this, and neither do any of the pre-processing steps that the JSON reader has put in for us. What we do today is first check to see if a line separator (carriage return) is in the data set. If there is one, then we throw an exception. If not, then we concat the lines together into a single buffer with a line separator in between the inputs. (we do some fixup for NULLs/empty rows too).
This has the problem that we throw an exception when we see a bad character in the data, which is valid for Spark to have in the data.
I think that there are a few options that we have to fix this kind of a problem.
I personally like option 2, but I am likely to implement option 3 in the short term unless I hear from CUDF that this is simple to do and can be done really quickly.
The text was updated successfully, but these errors were encountered: