-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement from_json_to_structs
#2510
Conversation
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
This reverts commit dcb463e.
CC @karthikeyann. |
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
…eded Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
build |
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
This reverts commit 8a17651.
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
This reverts commit cf9d6bf.
This reverts commit 553d7d0. Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question about a possible issue, but otherwise looks good.
Signed-off-by: Nghia Truong <[email protected]>
This reverts commit d2b6fb5.
Signed-off-by: Nghia Truong <[email protected]>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks god to me, but I am not C++/CUDA expert so I would love it if someone with more experience there could also look at this.
* @param opts The options for parsing JSON strings | ||
* @param isUSLocale Whether the current local is US locale, used when converting strings to | ||
* decimal types | ||
* @return A struct column in which each row is parsed from the corresponding json string | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that this version of extractRawMapFromJsonString was just not used so it is okay for us to delete it.
auto valids = rmm::device_uvector<bool>(string_count, stream); | ||
|
||
// Since the strings store integer numbers, they should be very short. | ||
// As such, using one thread per string should be fine. | ||
thrust::tabulate(rmm::exec_policy_nosync(stream), | ||
valids.begin(), | ||
valids.end(), | ||
[chars = input_sv.chars_begin(stream), | ||
offsets = input_offsets_it, | ||
valid_input = valid_input_it] __device__(cudf::size_type idx) -> bool { | ||
if (!valid_input[idx]) { return false; } | ||
|
||
auto in_ptr = chars + offsets[idx]; | ||
auto const in_end = chars + offsets[idx + 1]; | ||
while (in_ptr != in_end) { | ||
if (*in_ptr == '.' || *in_ptr == 'e' || *in_ptr == 'E') { return false; } | ||
++in_ptr; | ||
} | ||
|
||
return true; | ||
}); | ||
|
||
auto const [null_mask, null_count] = | ||
cudf::detail::valid_if(valids.begin(), | ||
valids.end(), | ||
thrust::identity{}, | ||
stream, | ||
cudf::get_current_device_resource_ref()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be further condensed to single kernel call and direct valid mask.
auto chars_data = cudf::strings::detail::make_chars_buffer( | ||
offsets_column->view(), bytes, string_pairs.begin(), string_count, stream, mr); | ||
|
||
if (nullify_if_not_quoted) { | ||
auto output = cudf::make_strings_column(string_count, | ||
std::move(offsets_column), | ||
chars_data.release(), | ||
0, | ||
rmm::device_buffer{0, stream, mr}); | ||
|
||
auto [null_mask, null_count] = cudf::detail::valid_if( | ||
string_pairs.begin(), | ||
string_pairs.end(), | ||
[] __device__(string_index_pair const& pair) { return pair.first != nullptr; }, | ||
stream, | ||
mr); | ||
if (null_count > 0) { output->set_null_mask(std::move(null_mask), null_count); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: There is a make_strings_column
function to take care of the nulls directly. https://docs.rapids.ai/api/libcudf/nightly/group__column__factories.html#ga4081b9d80987d86cf9b43cfe4df2d0d1
This implements the function
from_json_to_structs
, which performs parsing a strings column into a structs column with a given data schema. This is the case of runningfrom_json
SQL function in Spark with struct schema.Closes #2468.
Contribute to NVIDIA/spark-rapids#11560.
Depends on:
string_to_decimal
and add more tests for non-empty nulls #2607