-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds type inference and type conversion for leaf-columns to the nested JSON parser #11574
Adds type inference and type conversion for leaf-columns to the nested JSON parser #11574
Conversation
…fea-json-col-cast
/* TT_OOS */ {{{'{'}, {'['}, {'}'}, {']'}, {'x'}, {'x'}, {'x'}}}, | ||
/* TT_STR */ {{{'x'}, {'x'}, {'x'}, {'x'}, {'x'}, {'x'}, {'x'}}}, | ||
/* TT_ESC */ {{{'x'}, {'x'}, {'x'}, {'x'}, {'x'}, {'x'}, {'x'}}}}}; | ||
/* TT_OOS */ {{{'{'}, {'['}, {'}'}, {']'}, {}, {}, {}}}, | ||
/* TT_STR */ {{{}, {}, {}, {}, {}, {}, {}}}, | ||
/* TT_ESC */ {{{}, {}, {}, {}, {}, {}, {}}}}}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is x
not needed anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stack supports a sparse representation. Basically: (stack_op
, op_index
) pairs, where op_index
describes at which index such stack operation happens. All omitted indexes are populated with what is on top of the stack.
This is described in more detail in the logical stack PR description:
#11078 (comment)
// Prepare iterator that returns (string_offset, string_length)-pairs needed by inference | ||
auto string_ranges_it = | ||
thrust::make_transform_iterator(offset_length_it, [] __device__(auto ip) { | ||
return thrust::pair<json_column::row_offset_t, std::size_t>{ | ||
thrust::get<0>(ip), static_cast<std::size_t>(thrust::get<1>(ip))}; | ||
}); | ||
|
||
// Prepare iterator that returns (string_ptr, string_length)-pairs needed by type conversion | ||
auto string_spans_it = thrust::make_transform_iterator( | ||
offset_length_it, [data = d_input.data()] __device__(auto ip) { | ||
return thrust::pair<const char*, std::size_t>{ | ||
data + thrust::get<0>(ip), static_cast<std::size_t>(thrust::get<1>(ip))}; | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably in another PR, unless it is trivial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few minor corrections. Approving to expedite the merge, with the assumption that these will be addressed.
…f-column-type-conversion
@gpucibot merge |
Description
Adds type inference and type conversion for leaf-columns to the nested JSON parser
Note to the reviewers:
It's important to note that we're talking about two different stages of quote-stripping here.
true
using aconstexpr bool
)Currently, we always include quotes in the tokenizer stage (1), such that the type casting stage (2) can differentiate between string values and literals (e.g.
[true, "true"]
) and, based on the user-provided choice injson_reader_options::keep_quotes
, can strip off the quotes or keep them in the values returned to the user.In addition to adding type inference and type casting:
[null,{"a":1},"foo"] => List<Struct<a:int>> with struct col validity: 0, 1, 0
keep_quotes
to differentiate between string values and numeric & literal values, like (123.4
,true
,false
,null
).[{"b":1, "c":1}, {"a":1}] => order: <b, c, a>
Performance comparison
Tokenizer
The following is a comparison of the JSON tokenizer stage before this PR and after:
Before
After
JSON Parser
The overall parser performance is obviously impacted as we're now also doing type conversion instead of just returning string columns.
Before
After
Supported escape sequences:
Checklist