Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement from_json_to_structs #2510

Merged
merged 86 commits into from
Nov 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
86 commits
Select commit Hold shift + click to select a range
1376061
Implement `castStringsToBooleans`
ttnghia Oct 16, 2024
ff2f340
Merge branch 'branch-24.12' into convert_table
ttnghia Oct 16, 2024
c3fa10d
Implement `removeQuotes`
ttnghia Oct 16, 2024
ae2b41f
Rewrite using offsets and chars
ttnghia Oct 16, 2024
8d7ad2e
Fix empty input
ttnghia Oct 17, 2024
9e759c4
Misc
ttnghia Oct 17, 2024
2fff949
Add `nullifyIfNotQuoted` option for `removeQuotes`
ttnghia Oct 17, 2024
d09de41
Implement `castStringsToDecimals`
ttnghia Oct 18, 2024
576b65c
Implement `removeQuotesForFloats`
ttnghia Oct 18, 2024
2bd5335
Fix `removeQuotesForFloats`
ttnghia Oct 18, 2024
21c80a5
Implement `castStringsToIntegers`
ttnghia Oct 18, 2024
1a7d192
Implement non-legacy `castStringsToDates`
ttnghia Oct 18, 2024
dcb463e
WIP for `cast_strings_to_dates_legacy`
ttnghia Oct 21, 2024
f059c21
Revert "WIP for `cast_strings_to_dates_legacy`"
ttnghia Oct 21, 2024
207d6a3
Merge branch 'branch-24.12' into convert_table
ttnghia Oct 23, 2024
07b23ea
Fix compile issues
ttnghia Oct 23, 2024
de83a25
WIP: Implement `from_json_to_structs`
ttnghia Oct 24, 2024
443ca38
Merge branch 'branch-24.12' into convert_table
ttnghia Oct 24, 2024
6c2bd5e
Fix cmake
ttnghia Oct 24, 2024
904d857
Fix compile issues
ttnghia Oct 24, 2024
d84f1fe
Implement `castStringsToFloats`
ttnghia Oct 24, 2024
3024583
WIP
ttnghia Oct 24, 2024
d33d8e2
WIP: Implementing `fromJSONToStructs`
ttnghia Oct 25, 2024
295c36c
Merge branch 'branch-24.12' into convert_table
ttnghia Oct 28, 2024
1ea9cc8
Fix compile errors
ttnghia Oct 29, 2024
c1bb2d4
Cleanup
ttnghia Oct 29, 2024
f6634b4
Revert code as we still need them
ttnghia Oct 29, 2024
06b2c19
Add error check
ttnghia Oct 29, 2024
2dcdd11
Add more comments
ttnghia Oct 29, 2024
f3c391b
Cleanup
ttnghia Oct 29, 2024
52c42a6
Return as-is if the column is date/time
ttnghia Oct 29, 2024
19c64be
Update test
ttnghia Oct 30, 2024
cb9d252
Merge branch 'branch-24.12' into convert_table
ttnghia Oct 30, 2024
5d07db1
Update cudf
ttnghia Oct 30, 2024
39e3a9b
Revert "Update cudf"
ttnghia Oct 30, 2024
8628136
Merge branch 'branch-24.12' into convert_table
ttnghia Oct 30, 2024
df1428d
Update cudf
ttnghia Oct 30, 2024
0fd8d0e
Merge branch 'branch-24.12' into convert_table
ttnghia Nov 8, 2024
1d48906
Update cudf
ttnghia Nov 8, 2024
d9e1db5
Change header
ttnghia Nov 9, 2024
0f053a6
Rewrite JSONUtils.cpp
ttnghia Nov 9, 2024
8912e00
Implement a common function for converting column
ttnghia Nov 12, 2024
3614718
Rewrite `convert_data_type`
ttnghia Nov 12, 2024
6d9bbdc
Remove `cast_strings_to_dates`
ttnghia Nov 12, 2024
a832938
Implement `convert_data_type`
ttnghia Nov 13, 2024
44b885b
Fix compile errors
ttnghia Nov 13, 2024
ab45de8
Add `CUDF_FUNC_RANGE();`
ttnghia Nov 13, 2024
89e74a0
Fix schema
ttnghia Nov 13, 2024
27ef532
Complete `from_json_to_structs`
ttnghia Nov 13, 2024
5b65712
Fix null mask
ttnghia Nov 13, 2024
6788471
Write Javadoc
ttnghia Nov 13, 2024
49c78ce
Rewrite JNI
ttnghia Nov 13, 2024
9d16d43
Merge branch 'branch-24.12' into convert_table
ttnghia Nov 13, 2024
bb9029b
Remove deprecated function
ttnghia Nov 14, 2024
1243599
Revert test
ttnghia Nov 14, 2024
6f89fcd
Remove header
ttnghia Nov 14, 2024
deb3ebf
Rewrite Javadoc
ttnghia Nov 14, 2024
9dc641f
Rename variable
ttnghia Nov 14, 2024
53b121d
Rewrite docs
ttnghia Nov 14, 2024
69265b4
Revert test
ttnghia Nov 14, 2024
da4d1f6
Cleanup headers
ttnghia Nov 14, 2024
1d91e64
Cleanup
ttnghia Nov 14, 2024
d0fa2ae
Rewrite the conversion functions
ttnghia Nov 14, 2024
f375a4d
Move code
ttnghia Nov 14, 2024
034a5ec
Remove call to `make_structs_column`
ttnghia Nov 14, 2024
74d858c
Cleanup
ttnghia Nov 14, 2024
7a32b6f
Merge branch 'branch-24.12' into convert_table
ttnghia Nov 14, 2024
32edcbf
Optimize conversion further, avoiding to materialize column if not ne…
ttnghia Nov 15, 2024
fe8e359
Rewrite docs and change function name
ttnghia Nov 15, 2024
5a819d0
Reorganize code
ttnghia Nov 15, 2024
fa1946a
Handle schema mismatching
ttnghia Nov 15, 2024
553d7d0
Add test
ttnghia Nov 16, 2024
8a17651
Add another test
ttnghia Nov 16, 2024
3773a27
Revert "Add another test"
ttnghia Nov 16, 2024
34cc98d
Fix schema mismatch
ttnghia Nov 16, 2024
bfd461b
Cleanup
ttnghia Nov 16, 2024
cf9d6bf
Add another test
ttnghia Nov 16, 2024
11ff5a7
Revert "Add another test"
ttnghia Nov 16, 2024
b3f4882
Revert "Add test"
ttnghia Nov 16, 2024
45670a2
Add prefix `spark_rapids_jni::`
ttnghia Nov 16, 2024
23288da
Merge branch 'branch-24.12' into convert_table
ttnghia Nov 16, 2024
d2b6fb5
Remove handling for schema mismatching
ttnghia Nov 18, 2024
27f5551
Avoid materializing a column when converting strings
ttnghia Nov 19, 2024
32f2181
Merge branch 'branch-24.12' into convert_table
ttnghia Nov 21, 2024
efd68dc
Revert "Remove handling for schema mismatching"
ttnghia Nov 21, 2024
c725394
Fix handling for schema mismatching in case of `column_view` input
ttnghia Nov 21, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion src/main/cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -207,13 +207,14 @@ add_library(
src/bloom_filter.cu
src/case_when.cu
src/cast_decimal_to_string.cu
src/format_float.cu
src/cast_float_to_string.cu
src/cast_string.cu
src/cast_string_to_float.cu
src/datetime_rebase.cu
src/decimal_utils.cu
src/format_float.cu
src/from_json_to_raw_map.cu
src/from_json_to_structs.cu
src/get_json_object.cu
src/histogram.cu
src/json_utils.cu
Expand Down
126 changes: 97 additions & 29 deletions src/main/cpp/src/JSONUtilsJni.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -166,50 +166,118 @@ JNIEXPORT jlong JNICALL Java_com_nvidia_spark_rapids_jni_JSONUtils_extractRawMap
CATCH_STD(env, 0);
}

JNIEXPORT jlongArray JNICALL Java_com_nvidia_spark_rapids_jni_JSONUtils_concatenateJsonStrings(
JNIEnv* env, jclass, jlong j_input)
JNIEXPORT jlong JNICALL
Java_com_nvidia_spark_rapids_jni_JSONUtils_fromJSONToStructs(JNIEnv* env,
jclass,
jlong j_input,
jobjectArray j_col_names,
jintArray j_num_children,
jintArray j_types,
jintArray j_scales,
jintArray j_precisions,
jboolean normalize_single_quotes,
jboolean allow_leading_zeros,
jboolean allow_nonnumeric_numbers,
jboolean allow_unquoted_control,
jboolean is_us_locale)
{
JNI_NULL_CHECK(env, j_input, "j_input is null", 0);
JNI_NULL_CHECK(env, j_col_names, "j_col_names is null", 0);
JNI_NULL_CHECK(env, j_num_children, "j_num_children is null", 0);
JNI_NULL_CHECK(env, j_types, "j_types is null", 0);
JNI_NULL_CHECK(env, j_scales, "j_scales is null", 0);
JNI_NULL_CHECK(env, j_precisions, "j_precisions is null", 0);

try {
cudf::jni::auto_set_device(env);
auto const input_cv = reinterpret_cast<cudf::column_view const*>(j_input);

// Currently, set `nullify_invalid_rows = false` as `concatenateJsonStrings` is used only for
// `from_json` with struct schema.
auto [joined_strings, delimiter, should_be_nullify] = spark_rapids_jni::concat_json(
cudf::strings_column_view{*input_cv}, /*nullify_invalid_rows*/ false);

// The output array contains 5 elements:
// [0]: address of the cudf::column object `is_valid` in host memory
// [1]: address of data buffer of the concatenated strings in device memory
// [2]: data length
// [3]: address of the rmm::device_buffer object (of the concatenated strings) in host memory
// [4]: delimiter char
auto out_handles = cudf::jni::native_jlongArray(env, 5);
out_handles[0] = reinterpret_cast<jlong>(should_be_nullify.release());
out_handles[1] = reinterpret_cast<jlong>(joined_strings->data());
out_handles[2] = static_cast<jlong>(joined_strings->size());
out_handles[3] = reinterpret_cast<jlong>(joined_strings.release());
out_handles[4] = static_cast<jlong>(delimiter);
return out_handles.get_jArray();
auto const input_cv = reinterpret_cast<cudf::column_view const*>(j_input);
auto const col_names = cudf::jni::native_jstringArray(env, j_col_names).as_cpp_vector();
auto const num_children = cudf::jni::native_jintArray(env, j_num_children).to_vector();
auto const types = cudf::jni::native_jintArray(env, j_types).to_vector();
auto const scales = cudf::jni::native_jintArray(env, j_scales).to_vector();
auto const precisions = cudf::jni::native_jintArray(env, j_precisions).to_vector();

CUDF_EXPECTS(col_names.size() > 0, "Invalid schema data: col_names.");
CUDF_EXPECTS(col_names.size() == num_children.size(), "Invalid schema data: num_children.");
CUDF_EXPECTS(col_names.size() == types.size(), "Invalid schema data: types.");
CUDF_EXPECTS(col_names.size() == scales.size(), "Invalid schema data: scales.");
CUDF_EXPECTS(col_names.size() == precisions.size(), "Invalid schema data: precisions.");

return cudf::jni::ptr_as_jlong(
spark_rapids_jni::from_json_to_structs(cudf::strings_column_view{*input_cv},
col_names,
num_children,
types,
scales,
precisions,
normalize_single_quotes,
allow_leading_zeros,
allow_nonnumeric_numbers,
allow_unquoted_control,
is_us_locale)
.release());
}
CATCH_STD(env, 0);
}

JNIEXPORT jlong JNICALL Java_com_nvidia_spark_rapids_jni_JSONUtils_makeStructs(
JNIEnv* env, jclass, jlongArray j_children, jlong j_is_null)
JNIEXPORT jlong JNICALL
Java_com_nvidia_spark_rapids_jni_JSONUtils_convertFromStrings(JNIEnv* env,
jclass,
jlong j_input,
jintArray j_num_children,
jintArray j_types,
jintArray j_scales,
jintArray j_precisions,
jboolean allow_nonnumeric_numbers,
jboolean is_us_locale)
{
JNI_NULL_CHECK(env, j_children, "j_children is null", 0);
JNI_NULL_CHECK(env, j_is_null, "j_is_null is null", 0);
JNI_NULL_CHECK(env, j_input, "j_input is null", 0);
JNI_NULL_CHECK(env, j_num_children, "j_num_children is null", 0);
JNI_NULL_CHECK(env, j_types, "j_types is null", 0);
JNI_NULL_CHECK(env, j_scales, "j_scales is null", 0);
JNI_NULL_CHECK(env, j_precisions, "j_precisions is null", 0);

try {
cudf::jni::auto_set_device(env);
auto const children =
cudf::jni::native_jpointerArray<cudf::column_view>{env, j_children}.get_dereferenced();
auto const is_null = *reinterpret_cast<cudf::column_view const*>(j_is_null);
return cudf::jni::ptr_as_jlong(spark_rapids_jni::make_structs(children, is_null).release());

auto const input_cv = reinterpret_cast<cudf::column_view const*>(j_input);
auto const num_children = cudf::jni::native_jintArray(env, j_num_children).to_vector();
auto const types = cudf::jni::native_jintArray(env, j_types).to_vector();
auto const scales = cudf::jni::native_jintArray(env, j_scales).to_vector();
auto const precisions = cudf::jni::native_jintArray(env, j_precisions).to_vector();

CUDF_EXPECTS(num_children.size() > 0, "Invalid schema data: num_children.");
CUDF_EXPECTS(num_children.size() == types.size(), "Invalid schema data: types.");
CUDF_EXPECTS(num_children.size() == scales.size(), "Invalid schema data: scales.");
CUDF_EXPECTS(num_children.size() == precisions.size(), "Invalid schema data: precisions.");

return cudf::jni::ptr_as_jlong(
spark_rapids_jni::convert_from_strings(cudf::strings_column_view{*input_cv},
num_children,
types,
scales,
precisions,
allow_nonnumeric_numbers,
is_us_locale)
.release());
}
CATCH_STD(env, 0);
}

JNIEXPORT jlong JNICALL Java_com_nvidia_spark_rapids_jni_JSONUtils_removeQuotes(
JNIEnv* env, jclass, jlong j_input, jboolean nullify_if_not_quoted)
{
JNI_NULL_CHECK(env, j_input, "j_input is null", 0);

try {
cudf::jni::auto_set_device(env);
auto const input_cv = reinterpret_cast<cudf::column_view const*>(j_input);
return cudf::jni::ptr_as_jlong(
spark_rapids_jni::remove_quotes(cudf::strings_column_view{*input_cv}, nullify_if_not_quoted)
.release());
}
CATCH_STD(env, 0);
}

} // extern "C"
Loading
Loading