Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JNI for GetJsonObject #1862

Merged

Conversation

thirtiseven
Copy link
Collaborator

Closes #1848

This PR adds JNI for getJsonObject.

I haven't tested it because the feature isn't ready yet, but we can merge it to the dev branch first.

Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven thirtiseven self-assigned this Mar 12, 2024
@@ -305,6 +304,8 @@ std::unique_ptr<cudf::column> get_json_object(cudf::strings_column_view const& c
// parse the json_path into a command buffer
auto path_commands_optional = parse_path(json_path);

auto options = json_parser_options{};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow on for cleanup. If we don't have a need to pass down a json_parser_options class, then do we even have a need for the class? Would it be better to just remove it because there is only one set of options we ever run with?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit removed the json_parser_options.
When merging, will remove it also.

@@ -0,0 +1,37 @@
/*
Copy link
Collaborator

@ttnghia ttnghia Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File name for JNI should have all first letter Capitalized.

In addition, we may have other JSON functionalities in the future, beyond getJsonObject. Thus, it's better to call JSONUtilsJni.cpp and JSONUtils.java or similar.

Notice that JSON, not Json, by convention.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done.

cudf::jni::auto_set_device(env);
auto const input = reinterpret_cast<cudf::column_view const*>(input_column);
auto const n_strings_col_view(*n_column_view);
auto const* n_scalar_path = reinterpret_cast<cudf::string_scalar*>(j_scalar_handle);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does j_scalar_handle come frome?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to update CmakeLists so this cpp file did not work at all 😅, updated.

Signed-off-by: Haoyang Li <[email protected]>
@@ -74,10 +74,37 @@ struct path_instruction {
};

// TODO parse JSON path
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still have TODO?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted.

Comment on lines 100 to 107
path_instructions.back().name =
cudf::string_view(path_names[i].c_str(), path_names[i].size());
break;
default: throw std::runtime_error("Invalid path type");
}
}
return cudf::detail::make_device_uvector_sync(
path_instructions, stream, rmm::mr::get_current_device_resource());
Copy link
Collaborator

@ttnghia ttnghia Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this will crash. The cudf::string_view objects are viewing host strings. We cannot process in device using view of the host strings.

In general, we only create string_view for local processing and should never pass it across host<=>device. Depending on the size of path_names, we may want to create an std::vector of string_scalar or a strings column and use it instead. The corresponding instruction can hold offsets of the string path in the strings column.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Updated to use table_view to pass parsed path to jni and split it to uvector<path_insturction> here.

Comment on lines 387 to 388
path_commands.data(),
path_commands.size(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device_span.

Comment on lines 34 to 36
std::vector<int32_t> const& path_types,
std::vector<std::string> const& path_names,
std::vector<int64_t> const& path_indexes,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this set of values always come together, let make a separate struct for them:

struct path_info { // this name is just a suggestion, it can be something else if that's better
  int32_t type;
  std::string name;
  int64_t indices;
};
....
get_json_object(..., 
                             std::vector<path_info> paths,
...);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to Table now

Comment on lines 29 to 30
assert(path_ins_types.length == path_ins_names.length) : "path_ins_types and path_ins_names must have the same size";
assert(path_ins_types.length == path_ins_indexes.length) : "path_ins_types and path_ins_indexes must have the same size";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using struct as suggested above, we will not have these asserts.

thirtiseven and others added 3 commits March 14, 2024 09:44
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
d_names = *d_ins_names,
d_indexes = *d_ins_indexes] __device__(auto idx) {
path_instruction instruction(path_instruction_type::named);
auto const type_str = d_types.element<cudf::string_view>(idx);
Copy link
Collaborator

@ttnghia ttnghia Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't use path_instruction_type but string type here? We need string type for the last case, but comparing string is expensive. Maybe we should use both path_instruction_type for comparing and string type for storing into path instruction.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path_instructions are transferred from JAVA to C++.
We can use int to represent path_instruction_type , like 1-> wildcard, 2-> key, ...
But these integers will be magic number in JNI repo.
Becasue the size of path_instructions is probably small, so the perf impact will be small.
There is a tradoff between readability and performance.
@ttnghia which one do you like?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have just a small number of path instructions to process then it may even be better like @revans2 said, using CPU to process them could be better. We should use the enum value only on a large dataset.

thirtiseven and others added 2 commits March 15, 2024 13:07
Signed-off-by: Haoyang Li <[email protected]>
// TODO parse JSON path
thrust::optional<rmm::device_uvector<path_instruction>> parse_path(
cudf::string_scalar const& json_path)
rmm::device_uvector<path_instruction> construct_path_commands(cudf::table_view const& instructions,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the input to this a table? We only support a scalar value as input for the path. The data is small, so we are going to launch a kernel to parse a list of strings on the GPU, when the CPU could do it and send the results to the GPU faster than this. Please change it so that we have a list/vector of C++ objects that represent the parsed data. This cannot be a string because I may want to use the word star, key or really any other of the special case words you have here in my path, and I don't want it to fail. The magic numbers should not pass through into the cuda kernel, they should only be between the java and the JNI implementation. Java should have an enum for the type, and there should be a corresponding enum in C++. The java code should translate the enum to an int, or byte in this case because there are not many values.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the input is a small number of instructions (?), let just do all processing on the CPU using host code, or even call the Spark's code to do such preprocessing. Then, we can create a device_uvector from the result using make_device_uvector_sync for further GPU processing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks all! updated to matched enum and struct in cpu.

Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven
Copy link
Collaborator Author

Hi @ttnghia , I'm merging this into the feature branch for integration testing with the main flow. Please take another look and I will create follow up pr to fix your comments if needed, thanks.

@thirtiseven thirtiseven merged commit 1f06b34 into NVIDIA:get-json-object-feature Mar 20, 2024
2 checks passed
thirtiseven added a commit that referenced this pull request Mar 27, 2024
* get-json-object:  Add JSON parser and parser utility (#1836)

* Add Json Parser;
Add Json Parser utility;
Define internal interfaces;
Copy get-json-obj CUDA code from cuDF;

Signed-off-by: Chong Gao <[email protected]>

* Code format

---------

Signed-off-by: Chong Gao <[email protected]>
Co-authored-by: Chong Gao <[email protected]>

* get-json-object: match current field name (#1857)

Signed-off-by: Chong Gao <[email protected]>
Co-authored-by: Chong Gao <[email protected]>

* get-json-object: add utility write_escaped_text for JSON generator (#1863)

Signed-off-by: Chong Gao <[email protected]>
Co-authored-by: Chong Gao <[email protected]>

* Add JNI for GetJsonObject (#1862)

* Add JNI for GetJsonObject

Signed-off-by: Haoyang Li <[email protected]>

* clean up

Signed-off-by: Haoyang Li <[email protected]>

* Parse json path in plugin

Signed-off-by: Haoyang Li <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nghia Truong <[email protected]>

* Use table_view

Signed-off-by: Haoyang Li <[email protected]>

* Update java

Signed-off-by: Haoyang Li <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nghia Truong <[email protected]>

* clean up

Signed-off-by: Haoyang Li <[email protected]>

* use matched enum for type

Signed-off-by: Haoyang Li <[email protected]>

* clean up

Signed-off-by: Haoyang Li <[email protected]>

* upmerge

Signed-off-by: Haoyang Li <[email protected]>

* format

Signed-off-by: Haoyang Li <[email protected]>

---------

Signed-off-by: Haoyang Li <[email protected]>
Co-authored-by: Nghia Truong <[email protected]>

* get-json-object: main flow (#1868)

Signed-off-by: Chong Gao <[email protected]>
Co-authored-by: Chong Gao <[email protected]>

* Optimize memory usage in match_current_field_name (#1889)

* Optimize match_current_field_name using less memory

Signed-off-by: Chong Gao <[email protected]>

* Convert a function to device code

* Add a JNI test case

* Add JNI test case

* Change nesting depth to 4

* Change nesting depth to 8 to fix test

Signed-off-by: Haoyang Li <[email protected]>

* remove clang format change

Signed-off-by: Haoyang Li <[email protected]>

---------

Signed-off-by: Chong Gao <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Co-authored-by: Chong Gao <[email protected]>

* get-json-object: Recursive to iterative (#1890)

* Change recursive to iterative

Signed-off-by: Chong Gao <[email protected]>

---------

Signed-off-by: Chong Gao <[email protected]>
Co-authored-by: Chong Gao <[email protected]>

* Fix bug

* Format

* Use uppercase for path_instruction_type

Signed-off-by: Haoyang Li <[email protected]>

* Add test cases from Baidu

* Fix escape char error; add test case

* getJsonObject number normalization (#1897)

* Support number normalization

Signed-off-by: Haoyang Li <[email protected]>

* delete cpp test and add a java test case

Signed-off-by: Haoyang Li <[email protected]>

---------

Signed-off-by: Haoyang Li <[email protected]>

* Add test case

* Fix a escape/unescape size bug

Signed-off-by: Haoyang Li <[email protected]>

* Fix bug: handle leading zeros for number; Refactor

* Apply suggestions from code review

Co-authored-by: Nghia Truong <[email protected]>

* Address comments

Signed-off-by: Haoyang Li <[email protected]>

* fix java test

Signed-off-by: Haoyang Li <[email protected]>

* Add test cases; Fix a bug

* follow up escape/unescape bug fix

Signed-off-by: Haoyang Li <[email protected]>

* Minor refactor

* Add a case; Fix bug

---------

Signed-off-by: Chong Gao <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Co-authored-by: Chong Gao <[email protected]>
Co-authored-by: Haoyang Li <[email protected]>
Co-authored-by: Nghia Truong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants