You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
We have a version of from_json and json_scan that uses the CUDF tokenizer/parser. But we are seeing both performance and correctness issues. Not that the CUDF code is wrong. Just that Spark parses/validates JSON differently and trying to get all of the parts to match is not simple.
The get_json_object code is decently fast and is accurate. So the goal is that once we have enough configuration options #2031 and we can support all the nesting we need #2033 then we can figure out how to use that code to get out the data we need.
Please note that this task might be broken down into smaller tasks to be able to fully support all of the features that we need, because when this is done we want to fully support from_json for the default configs on the GPU on close to all of the data types.
For get_json_object, because it outputs only strings, it runs in two passes. The first pass will parse the input data, validate it, calculate the size of the output string column, and write that size out. Next a scan is done to turn the sizes into offsets. Finally a second kernel is run, which will parse the input data, validate it again, and write the data out.
Things get extra complicated here because we have to deal with multiple different nesting levels. I still think we can do it all in two passes of the parsing kernel.
Before we process the data at all we are going to allocate a column of sizes for each nesting level and call a kernel with pointers to that data. The sizes are offsets for each nesting level per input row, not per output row.
So after the kernel runs, and it validates the input/etc. we end up with the following
For the first row a_level0 has 3 in it, because there are 3 entries in the top level array under a. For a_level1 it has 4 in it because there are a total 4 entries in all of the arrays at the next level. For a_level2 it has 6 in it because there are a total of 6 bytes needed to encode the leaf node string values.
Once we have these values we will do a scan on each of the levels to get offsets on a per-row basis.
With this information we can now allocate all of the final output buffers and run the kernel one last time. That kernel would look at the starting offset for each level of nesting and keep track of where it is at while it output the final offsets and data. a_level0 would be reused as the top level offsets for a. The rest of the temp offset columns would be thrown away.
a offsets_1
3
5
5
5
a offsets_2
3
3
4
4
6
6
a offsets_3
1
3
4
6
7
9
9
a leaf_data
'1'
'2'
'0'
'3'
'4'
'0'
'5'
'6'
'0'
Even though the algorithm does not appear to be too complex on the surface, we have the problem that we need a place to store all of the different offsets/counts for the different nesting levels. Happily this is related to the schema size/depth, so we can know it up front without running anything. We might need to start off with a hard coded maximum size, and fall back to the CPU if that size is too large. Then we can have a follow on issue to look at how to support more.
Be aware that we need to support a Map<STRING,STRING> as a top level data type for this. It would be great if we could also support top level arrays, but that is not a requirement. Any data type that we cannot support out of the box with this, we need to file a follow on issue for that support, and we need to make sure that we fall back to the CPU.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
We have a version of from_json and json_scan that uses the CUDF tokenizer/parser. But we are seeing both performance and correctness issues. Not that the CUDF code is wrong. Just that Spark parses/validates JSON differently and trying to get all of the parts to match is not simple.
The get_json_object code is decently fast and is accurate. So the goal is that once we have enough configuration options #2031 and we can support all the nesting we need #2033 then we can figure out how to use that code to get out the data we need.
Please note that this task might be broken down into smaller tasks to be able to fully support all of the features that we need, because when this is done we want to fully support
from_json
for the default configs on the GPU on close to all of the data types.For
get_json_object
, because it outputs only strings, it runs in two passes. The first pass will parse the input data, validate it, calculate the size of the output string column, and write that size out. Next a scan is done to turn the sizes into offsets. Finally a second kernel is run, which will parse the input data, validate it again, and write the data out.Things get extra complicated here because we have to deal with multiple different nesting levels. I still think we can do it all in two passes of the parsing kernel.
For example:
Lets say I have data that looks like
and a schema like
STRUCT<a: Array<Array<String>>>
Before we process the data at all we are going to allocate a column of sizes for each nesting level and call a kernel with pointers to that data. The sizes are offsets for each nesting level per input row, not per output row.
So after the kernel runs, and it validates the input/etc. we end up with the following
For the first row
a_level0
has 3 in it, because there are 3 entries in the top level array under a. Fora_level1
it has 4 in it because there are a total 4 entries in all of the arrays at the next level. Fora_level2
it has 6 in it because there are a total of 6 bytes needed to encode the leaf node string values.Once we have these values we will do a scan on each of the levels to get offsets on a per-row basis.
With this information we can now allocate all of the final output buffers and run the kernel one last time. That kernel would look at the starting offset for each level of nesting and keep track of where it is at while it output the final offsets and data.
a_level0
would be reused as the top level offsets for a. The rest of the temp offset columns would be thrown away.Even though the algorithm does not appear to be too complex on the surface, we have the problem that we need a place to store all of the different offsets/counts for the different nesting levels. Happily this is related to the schema size/depth, so we can know it up front without running anything. We might need to start off with a hard coded maximum size, and fall back to the CPU if that size is too large. Then we can have a follow on issue to look at how to support more.
Be aware that we need to support a
Map<STRING,STRING>
as a top level data type for this. It would be great if we could also support top level arrays, but that is not a requirement. Any data type that we cannot support out of the box with this, we need to file a follow on issue for that support, and we need to make sure that we fall back to the CPU.The text was updated successfully, but these errors were encountered: