Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement from_json_to_structs #2510

Merged
merged 86 commits into from
Nov 23, 2024
Merged

Conversation

ttnghia
Copy link
Collaborator

@ttnghia ttnghia commented Oct 17, 2024

This implements the function from_json_to_structs, which performs parsing a strings column into a structs column with a given data schema. This is the case of running from_json SQL function in Spark with struct schema.

Closes #2468.

Contribute to NVIDIA/spark-rapids#11560.

Depends on:

Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
@ttnghia
Copy link
Collaborator Author

ttnghia commented Oct 21, 2024

CC @karthikeyann.

revans2
revans2 previously approved these changes Nov 15, 2024
src/main/cpp/src/from_json_to_structs.cu Outdated Show resolved Hide resolved
@ttnghia
Copy link
Collaborator Author

ttnghia commented Nov 15, 2024

build

Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
This reverts commit 8a17651.
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
This reverts commit cf9d6bf.
This reverts commit 553d7d0.

Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
hyperbolic2346
hyperbolic2346 previously approved these changes Nov 19, 2024
Copy link
Collaborator

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question about a possible issue, but otherwise looks good.

src/main/cpp/src/from_json_to_structs.cu Outdated Show resolved Hide resolved
@ttnghia
Copy link
Collaborator Author

ttnghia commented Nov 21, 2024

build

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks god to me, but I am not C++/CUDA expert so I would love it if someone with more experience there could also look at this.

* @param opts The options for parsing JSON strings
* @param isUSLocale Whether the current local is US locale, used when converting strings to
* decimal types
* @return A struct column in which each row is parsed from the corresponding json string
*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that this version of extractRawMapFromJsonString was just not used so it is okay for us to delete it.

@ttnghia ttnghia merged commit 4080f49 into NVIDIA:branch-24.12 Nov 23, 2024
4 checks passed
@ttnghia ttnghia deleted the convert_table branch November 23, 2024 02:21
Comment on lines +217 to +244
auto valids = rmm::device_uvector<bool>(string_count, stream);

// Since the strings store integer numbers, they should be very short.
// As such, using one thread per string should be fine.
thrust::tabulate(rmm::exec_policy_nosync(stream),
valids.begin(),
valids.end(),
[chars = input_sv.chars_begin(stream),
offsets = input_offsets_it,
valid_input = valid_input_it] __device__(cudf::size_type idx) -> bool {
if (!valid_input[idx]) { return false; }

auto in_ptr = chars + offsets[idx];
auto const in_end = chars + offsets[idx + 1];
while (in_ptr != in_end) {
if (*in_ptr == '.' || *in_ptr == 'e' || *in_ptr == 'E') { return false; }
++in_ptr;
}

return true;
});

auto const [null_mask, null_count] =
cudf::detail::valid_if(valids.begin(),
valids.end(),
thrust::identity{},
stream,
cudf::get_current_device_resource_ref());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be further condensed to single kernel call and direct valid mask.

Comment on lines +571 to +587
auto chars_data = cudf::strings::detail::make_chars_buffer(
offsets_column->view(), bytes, string_pairs.begin(), string_count, stream, mr);

if (nullify_if_not_quoted) {
auto output = cudf::make_strings_column(string_count,
std::move(offsets_column),
chars_data.release(),
0,
rmm::device_buffer{0, stream, mr});

auto [null_mask, null_count] = cudf::detail::valid_if(
string_pairs.begin(),
string_pairs.end(),
[] __device__(string_index_pair const& pair) { return pair.first != nullptr; },
stream,
mr);
if (null_count > 0) { output->set_null_mask(std::move(null_mask), null_count); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: There is a make_strings_column function to take care of the nulls directly. https://docs.rapids.ai/api/libcudf/nightly/group__column__factories.html#ga4081b9d80987d86cf9b43cfe4df2d0d1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Implement a function to convert a list of strings columns into a structs column with desired schema
4 participants