Add JNI for GetJsonObject #1862

thirtiseven · 2024-03-12T08:46:09Z

This PR adds JNI for getJsonObject.

I haven't tested it because the feature isn't ready yet, but we can merge it to the dev branch first.

Signed-off-by: Haoyang Li <[email protected]>

revans2 · 2024-03-12T15:02:54Z

src/main/cpp/src/get_json_object.cu

@@ -305,6 +304,8 @@ std::unique_ptr<cudf::column> get_json_object(cudf::strings_column_view const& c
  // parse the json_path into a command buffer
  auto path_commands_optional = parse_path(json_path);

+  auto options = json_parser_options{};


As a follow on for cleanup. If we don't have a need to pass down a json_parser_options class, then do we even have a need for the class? Would it be better to just remove it because there is only one set of options we ever run with?

This commit removed the json_parser_options.
When merging, will remove it also.

ttnghia · 2024-03-12T21:35:49Z

src/main/cpp/src/getJsonObjectJni.cpp

@@ -0,0 +1,37 @@
+/*


File name for JNI should have all first letter Capitalized.

In addition, we may have other JSON functionalities in the future, beyond getJsonObject. Thus, it's better to call JSONUtilsJni.cpp and JSONUtils.java or similar.

Notice that JSON, not Json, by convention.

Thanks, done.

ttnghia · 2024-03-12T21:37:05Z

src/main/cpp/src/getJsonObjectJni.cpp

+    cudf::jni::auto_set_device(env);
+    auto const input = reinterpret_cast<cudf::column_view const*>(input_column);
+    auto const n_strings_col_view(*n_column_view);
+    auto const* n_scalar_path = reinterpret_cast<cudf::string_scalar*>(j_scalar_handle);


Where does j_scalar_handle come frome?

I forgot to update CmakeLists so this cpp file did not work at all 😅, updated.

Signed-off-by: Haoyang Li <[email protected]>

src/main/cpp/src/JSONUtilsJni.cpp

ttnghia · 2024-03-13T20:35:03Z

src/main/cpp/src/get_json_object.cu

@@ -74,10 +74,37 @@ struct path_instruction {
 };

 // TODO parse JSON path


Still have TODO?

ttnghia · 2024-03-13T20:41:57Z

src/main/cpp/src/get_json_object.cu

+        path_instructions.back().name =
+          cudf::string_view(path_names[i].c_str(), path_names[i].size());
+        break;
+      default: throw std::runtime_error("Invalid path type");
+    }
+  }
+  return cudf::detail::make_device_uvector_sync(
+    path_instructions, stream, rmm::mr::get_current_device_resource());


Oh this will crash. The cudf::string_view objects are viewing host strings. We cannot process in device using view of the host strings.

In general, we only create string_view for local processing and should never pass it across host<=>device. Depending on the size of path_names, we may want to create an std::vector of string_scalar or a strings column and use it instead. The corresponding instruction can hold offsets of the string path in the strings column.

Thanks! Updated to use table_view to pass parsed path to jni and split it to uvector<path_insturction> here.

src/main/cpp/src/get_json_object.cu

ttnghia · 2024-03-13T20:45:15Z

src/main/cpp/src/get_json_object.cu

+      path_commands.data(),
+      path_commands.size(),


device_span.

ttnghia · 2024-03-13T20:47:36Z

src/main/cpp/src/get_json_object.hpp

+  std::vector<int32_t> const& path_types,
+  std::vector<std::string> const& path_names,
+  std::vector<int64_t> const& path_indexes,


If this set of values always come together, let make a separate struct for them:

struct path_info { // this name is just a suggestion, it can be something else if that's better int32_t type; std::string name; int64_t indices; }; .... get_json_object(..., std::vector<path_info> paths, ...);

Updated to Table now

ttnghia · 2024-03-13T20:48:13Z

src/main/java/com/nvidia/spark/rapids/jni/JSONUtils.java

+    assert(path_ins_types.length == path_ins_names.length) : "path_ins_types and path_ins_names must have the same size";
+    assert(path_ins_types.length == path_ins_indexes.length) : "path_ins_types and path_ins_indexes must have the same size";


Using struct as suggested above, we will not have these asserts.

src/main/java/com/nvidia/spark/rapids/jni/JSONUtils.java

Co-authored-by: Nghia Truong <[email protected]>

Signed-off-by: Haoyang Li <[email protected]>

src/main/cpp/src/get_json_object.cu

ttnghia · 2024-03-14T21:13:16Z

src/main/cpp/src/get_json_object.cu

+                     d_names   = *d_ins_names,
+                     d_indexes = *d_ins_indexes] __device__(auto idx) {
+                      path_instruction instruction(path_instruction_type::named);
+                      auto const type_str = d_types.element<cudf::string_view>(idx);


Why don't use path_instruction_type but string type here? We need string type for the last case, but comparing string is expensive. Maybe we should use both path_instruction_type for comparing and string type for storing into path instruction.

The path_instructions are transferred from JAVA to C++.
We can use int to represent path_instruction_type , like 1-> wildcard, 2-> key, ...
But these integers will be magic number in JNI repo.
Becasue the size of path_instructions is probably small, so the perf impact will be small.
There is a tradoff between readability and performance.
@ttnghia which one do you like?

If we have just a small number of path instructions to process then it may even be better like @revans2 said, using CPU to process them could be better. We should use the enum value only on a large dataset.

Co-authored-by: Nghia Truong <[email protected]>

Signed-off-by: Haoyang Li <[email protected]>

revans2 · 2024-03-15T14:56:35Z

src/main/cpp/src/get_json_object.cu

-// TODO parse JSON path
-thrust::optional<rmm::device_uvector<path_instruction>> parse_path(
-  cudf::string_scalar const& json_path)
+rmm::device_uvector<path_instruction> construct_path_commands(cudf::table_view const& instructions,


Why is the input to this a table? We only support a scalar value as input for the path. The data is small, so we are going to launch a kernel to parse a list of strings on the GPU, when the CPU could do it and send the results to the GPU faster than this. Please change it so that we have a list/vector of C++ objects that represent the parsed data. This cannot be a string because I may want to use the word star, key or really any other of the special case words you have here in my path, and I don't want it to fail. The magic numbers should not pass through into the cuda kernel, they should only be between the java and the JNI implementation. Java should have an enum for the type, and there should be a corresponding enum in C++. The java code should translate the enum to an int, or byte in this case because there are not many values.

If the input is a small number of instructions (?), let just do all processing on the CPU using host code, or even call the Spark's code to do such preprocessing. Then, we can create a device_uvector from the result using make_device_uvector_sync for further GPU processing.

Thanks all! updated to matched enum and struct in cpu.

Signed-off-by: Haoyang Li <[email protected]>

src/main/cpp/src/get_json_object.cu

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2024-03-20T01:44:43Z

Hi @ttnghia , I'm merging this into the feature branch for integration testing with the main flow. Please take another look and I will create follow up pr to fix your comments if needed, thanks.

* get-json-object: Add JSON parser and parser utility (#1836) * Add Json Parser; Add Json Parser utility; Define internal interfaces; Copy get-json-obj CUDA code from cuDF; Signed-off-by: Chong Gao <[email protected]> * Code format --------- Signed-off-by: Chong Gao <[email protected]> Co-authored-by: Chong Gao <[email protected]> * get-json-object: match current field name (#1857) Signed-off-by: Chong Gao <[email protected]> Co-authored-by: Chong Gao <[email protected]> * get-json-object: add utility write_escaped_text for JSON generator (#1863) Signed-off-by: Chong Gao <[email protected]> Co-authored-by: Chong Gao <[email protected]> * Add JNI for GetJsonObject (#1862) * Add JNI for GetJsonObject Signed-off-by: Haoyang Li <[email protected]> * clean up Signed-off-by: Haoyang Li <[email protected]> * Parse json path in plugin Signed-off-by: Haoyang Li <[email protected]> * Apply suggestions from code review Co-authored-by: Nghia Truong <[email protected]> * Use table_view Signed-off-by: Haoyang Li <[email protected]> * Update java Signed-off-by: Haoyang Li <[email protected]> * Apply suggestions from code review Co-authored-by: Nghia Truong <[email protected]> * clean up Signed-off-by: Haoyang Li <[email protected]> * use matched enum for type Signed-off-by: Haoyang Li <[email protected]> * clean up Signed-off-by: Haoyang Li <[email protected]> * upmerge Signed-off-by: Haoyang Li <[email protected]> * format Signed-off-by: Haoyang Li <[email protected]> --------- Signed-off-by: Haoyang Li <[email protected]> Co-authored-by: Nghia Truong <[email protected]> * get-json-object: main flow (#1868) Signed-off-by: Chong Gao <[email protected]> Co-authored-by: Chong Gao <[email protected]> * Optimize memory usage in match_current_field_name (#1889) * Optimize match_current_field_name using less memory Signed-off-by: Chong Gao <[email protected]> * Convert a function to device code * Add a JNI test case * Add JNI test case * Change nesting depth to 4 * Change nesting depth to 8 to fix test Signed-off-by: Haoyang Li <[email protected]> * remove clang format change Signed-off-by: Haoyang Li <[email protected]> --------- Signed-off-by: Chong Gao <[email protected]> Signed-off-by: Haoyang Li <[email protected]> Co-authored-by: Chong Gao <[email protected]> * get-json-object: Recursive to iterative (#1890) * Change recursive to iterative Signed-off-by: Chong Gao <[email protected]> --------- Signed-off-by: Chong Gao <[email protected]> Co-authored-by: Chong Gao <[email protected]> * Fix bug * Format * Use uppercase for path_instruction_type Signed-off-by: Haoyang Li <[email protected]> * Add test cases from Baidu * Fix escape char error; add test case * getJsonObject number normalization (#1897) * Support number normalization Signed-off-by: Haoyang Li <[email protected]> * delete cpp test and add a java test case Signed-off-by: Haoyang Li <[email protected]> --------- Signed-off-by: Haoyang Li <[email protected]> * Add test case * Fix a escape/unescape size bug Signed-off-by: Haoyang Li <[email protected]> * Fix bug: handle leading zeros for number; Refactor * Apply suggestions from code review Co-authored-by: Nghia Truong <[email protected]> * Address comments Signed-off-by: Haoyang Li <[email protected]> * fix java test Signed-off-by: Haoyang Li <[email protected]> * Add test cases; Fix a bug * follow up escape/unescape bug fix Signed-off-by: Haoyang Li <[email protected]> * Minor refactor * Add a case; Fix bug --------- Signed-off-by: Chong Gao <[email protected]> Signed-off-by: Haoyang Li <[email protected]> Co-authored-by: Chong Gao <[email protected]> Co-authored-by: Haoyang Li <[email protected]> Co-authored-by: Nghia Truong <[email protected]>

thirtiseven added 2 commits March 12, 2024 16:39

Add JNI for GetJsonObject

55e2f79

Signed-off-by: Haoyang Li <[email protected]>

clean up

86fba52

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven self-assigned this Mar 12, 2024

revans2 approved these changes Mar 12, 2024

View reviewed changes

ttnghia reviewed Mar 12, 2024

View reviewed changes

Parse json path in plugin

2d7e9be

Signed-off-by: Haoyang Li <[email protected]>

ttnghia reviewed Mar 13, 2024

View reviewed changes

src/main/cpp/src/JSONUtilsJni.cpp Outdated Show resolved Hide resolved

ttnghia reviewed Mar 13, 2024

View reviewed changes

src/main/cpp/src/JSONUtilsJni.cpp Outdated Show resolved Hide resolved

ttnghia reviewed Mar 13, 2024

View reviewed changes

src/main/cpp/src/JSONUtilsJni.cpp Outdated Show resolved Hide resolved

ttnghia reviewed Mar 13, 2024

View reviewed changes

src/main/cpp/src/get_json_object.cu Outdated Show resolved Hide resolved

ttnghia reviewed Mar 13, 2024

View reviewed changes

src/main/java/com/nvidia/spark/rapids/jni/JSONUtils.java Outdated Show resolved Hide resolved

thirtiseven and others added 3 commits March 14, 2024 09:44

Apply suggestions from code review

10d549d

Co-authored-by: Nghia Truong <[email protected]>

Use table_view

140f51a

Signed-off-by: Haoyang Li <[email protected]>

Update java

f7fa563

Signed-off-by: Haoyang Li <[email protected]>

ttnghia reviewed Mar 14, 2024

View reviewed changes

src/main/cpp/src/get_json_object.cu Outdated Show resolved Hide resolved

ttnghia reviewed Mar 14, 2024

View reviewed changes

src/main/cpp/src/get_json_object.cu Outdated Show resolved Hide resolved

ttnghia reviewed Mar 14, 2024

View reviewed changes

thirtiseven and others added 2 commits March 15, 2024 13:07

Apply suggestions from code review

200d966

Co-authored-by: Nghia Truong <[email protected]>

clean up

c57db73

Signed-off-by: Haoyang Li <[email protected]>

revans2 reviewed Mar 15, 2024

View reviewed changes

use matched enum for type

bab58cb

Signed-off-by: Haoyang Li <[email protected]>

res-life mentioned this pull request Mar 19, 2024

get-json-object: main flow #1868

Merged

clean up

7644059

Signed-off-by: Haoyang Li <[email protected]>

res-life reviewed Mar 20, 2024

View reviewed changes

src/main/cpp/src/get_json_object.cu Show resolved Hide resolved

thirtiseven added 3 commits March 20, 2024 09:29

Merge branch 'get-json-object-feature' into get-json-object-jni

f4ae6e4

upmerge

25a0a8d

Signed-off-by: Haoyang Li <[email protected]>

format

6e31567

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven merged commit 1f06b34 into NVIDIA:get-json-object-feature Mar 20, 2024
2 checks passed

This was referenced Mar 20, 2024

[FEA] GetJsonObject: Implement JNI interface for get-json-object kernel #1848

Closed

[FEA] GetJsonObject: Implement a Path parser like Spark does, can refer to cuDF current Path parser #1830

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add JNI for GetJsonObject #1862

Add JNI for GetJsonObject #1862

thirtiseven commented Mar 12, 2024

revans2 Mar 12, 2024

res-life Mar 20, 2024

ttnghia Mar 12, 2024 •

edited

Loading

thirtiseven Mar 13, 2024

ttnghia Mar 12, 2024

thirtiseven Mar 13, 2024

ttnghia Mar 13, 2024

thirtiseven Mar 14, 2024

ttnghia Mar 13, 2024 •

edited

Loading

thirtiseven Mar 14, 2024

ttnghia Mar 13, 2024

ttnghia Mar 13, 2024

thirtiseven Mar 14, 2024

ttnghia Mar 13, 2024

ttnghia Mar 14, 2024 •

edited

Loading

res-life Mar 15, 2024

ttnghia Mar 18, 2024

revans2 Mar 15, 2024

ttnghia Mar 19, 2024

thirtiseven Mar 19, 2024

thirtiseven commented Mar 20, 2024

		@@ -74,10 +74,37 @@ struct path_instruction {
		};

		// TODO parse JSON path

		assert(path_ins_types.length == path_ins_names.length) : "path_ins_types and path_ins_names must have the same size";
		assert(path_ins_types.length == path_ins_indexes.length) : "path_ins_types and path_ins_indexes must have the same size";

Add JNI for GetJsonObject #1862

Add JNI for GetJsonObject #1862

Conversation

thirtiseven commented Mar 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia Mar 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia Mar 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thirtiseven commented Mar 20, 2024

ttnghia Mar 12, 2024 •

edited

Loading

ttnghia Mar 13, 2024 •

edited

Loading

ttnghia Mar 14, 2024 •

edited

Loading