-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add JNI for GetJsonObject #1862
Add JNI for GetJsonObject #1862
Conversation
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
@@ -305,6 +304,8 @@ std::unique_ptr<cudf::column> get_json_object(cudf::strings_column_view const& c | |||
// parse the json_path into a command buffer | |||
auto path_commands_optional = parse_path(json_path); | |||
|
|||
auto options = json_parser_options{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow on for cleanup. If we don't have a need to pass down a json_parser_options class, then do we even have a need for the class? Would it be better to just remove it because there is only one set of options we ever run with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit removed the json_parser_options
.
When merging, will remove it also.
@@ -0,0 +1,37 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File name for JNI should have all first letter Capitalized.
In addition, we may have other JSON functionalities in the future, beyond getJsonObject
. Thus, it's better to call JSONUtilsJni.cpp
and JSONUtils.java
or similar.
Notice that JSON
, not Json
, by convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done.
cudf::jni::auto_set_device(env); | ||
auto const input = reinterpret_cast<cudf::column_view const*>(input_column); | ||
auto const n_strings_col_view(*n_column_view); | ||
auto const* n_scalar_path = reinterpret_cast<cudf::string_scalar*>(j_scalar_handle); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does j_scalar_handle
come frome?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to update CmakeLists
so this cpp file did not work at all 😅, updated.
Signed-off-by: Haoyang Li <[email protected]>
src/main/cpp/src/get_json_object.cu
Outdated
@@ -74,10 +74,37 @@ struct path_instruction { | |||
}; | |||
|
|||
// TODO parse JSON path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still have TODO
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted.
src/main/cpp/src/get_json_object.cu
Outdated
path_instructions.back().name = | ||
cudf::string_view(path_names[i].c_str(), path_names[i].size()); | ||
break; | ||
default: throw std::runtime_error("Invalid path type"); | ||
} | ||
} | ||
return cudf::detail::make_device_uvector_sync( | ||
path_instructions, stream, rmm::mr::get_current_device_resource()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh this will crash. The cudf::string_view
objects are viewing host strings. We cannot process in device using view of the host strings.
In general, we only create string_view
for local processing and should never pass it across host<=>device. Depending on the size of path_names
, we may want to create an std::vector
of string_scalar
or a strings column and use it instead. The corresponding instruction can hold offsets of the string path in the strings column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Updated to use table_view to pass parsed path to jni and split it to uvector<path_insturction>
here.
src/main/cpp/src/get_json_object.cu
Outdated
path_commands.data(), | ||
path_commands.size(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
device_span
.
src/main/cpp/src/get_json_object.hpp
Outdated
std::vector<int32_t> const& path_types, | ||
std::vector<std::string> const& path_names, | ||
std::vector<int64_t> const& path_indexes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this set of values always come together, let make a separate struct for them:
struct path_info { // this name is just a suggestion, it can be something else if that's better
int32_t type;
std::string name;
int64_t indices;
};
....
get_json_object(...,
std::vector<path_info> paths,
...);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to Table
now
assert(path_ins_types.length == path_ins_names.length) : "path_ins_types and path_ins_names must have the same size"; | ||
assert(path_ins_types.length == path_ins_indexes.length) : "path_ins_types and path_ins_indexes must have the same size"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using struct as suggested above, we will not have these asserts.
Co-authored-by: Nghia Truong <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
src/main/cpp/src/get_json_object.cu
Outdated
d_names = *d_ins_names, | ||
d_indexes = *d_ins_indexes] __device__(auto idx) { | ||
path_instruction instruction(path_instruction_type::named); | ||
auto const type_str = d_types.element<cudf::string_view>(idx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't use path_instruction_type
but string type here? We need string type for the last case, but comparing string is expensive. Maybe we should use both path_instruction_type
for comparing and string type for storing into path instruction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The path_instructions are transferred from JAVA to C++.
We can use int to represent path_instruction_type
, like 1-> wildcard, 2-> key, ...
But these integers will be magic number in JNI repo.
Becasue the size of path_instructions is probably small, so the perf impact will be small.
There is a tradoff between readability and performance.
@ttnghia which one do you like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have just a small number of path instructions to process then it may even be better like @revans2 said, using CPU to process them could be better. We should use the enum value only on a large dataset.
Co-authored-by: Nghia Truong <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
src/main/cpp/src/get_json_object.cu
Outdated
// TODO parse JSON path | ||
thrust::optional<rmm::device_uvector<path_instruction>> parse_path( | ||
cudf::string_scalar const& json_path) | ||
rmm::device_uvector<path_instruction> construct_path_commands(cudf::table_view const& instructions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the input to this a table? We only support a scalar value as input for the path. The data is small, so we are going to launch a kernel to parse a list of strings on the GPU, when the CPU could do it and send the results to the GPU faster than this. Please change it so that we have a list/vector of C++ objects that represent the parsed data. This cannot be a string because I may want to use the word star
, key
or really any other of the special case words you have here in my path, and I don't want it to fail. The magic numbers should not pass through into the cuda kernel, they should only be between the java and the JNI implementation. Java should have an enum for the type, and there should be a corresponding enum in C++. The java code should translate the enum to an int, or byte in this case because there are not many values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the input is a small number of instructions (?), let just do all processing on the CPU using host code, or even call the Spark's code to do such preprocessing. Then, we can create a device_uvector
from the result using make_device_uvector_sync
for further GPU processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks all! updated to matched enum and struct in cpu.
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Hi @ttnghia , I'm merging this into the feature branch for integration testing with the main flow. Please take another look and I will create follow up pr to fix your comments if needed, thanks. |
* get-json-object: Add JSON parser and parser utility (#1836) * Add Json Parser; Add Json Parser utility; Define internal interfaces; Copy get-json-obj CUDA code from cuDF; Signed-off-by: Chong Gao <[email protected]> * Code format --------- Signed-off-by: Chong Gao <[email protected]> Co-authored-by: Chong Gao <[email protected]> * get-json-object: match current field name (#1857) Signed-off-by: Chong Gao <[email protected]> Co-authored-by: Chong Gao <[email protected]> * get-json-object: add utility write_escaped_text for JSON generator (#1863) Signed-off-by: Chong Gao <[email protected]> Co-authored-by: Chong Gao <[email protected]> * Add JNI for GetJsonObject (#1862) * Add JNI for GetJsonObject Signed-off-by: Haoyang Li <[email protected]> * clean up Signed-off-by: Haoyang Li <[email protected]> * Parse json path in plugin Signed-off-by: Haoyang Li <[email protected]> * Apply suggestions from code review Co-authored-by: Nghia Truong <[email protected]> * Use table_view Signed-off-by: Haoyang Li <[email protected]> * Update java Signed-off-by: Haoyang Li <[email protected]> * Apply suggestions from code review Co-authored-by: Nghia Truong <[email protected]> * clean up Signed-off-by: Haoyang Li <[email protected]> * use matched enum for type Signed-off-by: Haoyang Li <[email protected]> * clean up Signed-off-by: Haoyang Li <[email protected]> * upmerge Signed-off-by: Haoyang Li <[email protected]> * format Signed-off-by: Haoyang Li <[email protected]> --------- Signed-off-by: Haoyang Li <[email protected]> Co-authored-by: Nghia Truong <[email protected]> * get-json-object: main flow (#1868) Signed-off-by: Chong Gao <[email protected]> Co-authored-by: Chong Gao <[email protected]> * Optimize memory usage in match_current_field_name (#1889) * Optimize match_current_field_name using less memory Signed-off-by: Chong Gao <[email protected]> * Convert a function to device code * Add a JNI test case * Add JNI test case * Change nesting depth to 4 * Change nesting depth to 8 to fix test Signed-off-by: Haoyang Li <[email protected]> * remove clang format change Signed-off-by: Haoyang Li <[email protected]> --------- Signed-off-by: Chong Gao <[email protected]> Signed-off-by: Haoyang Li <[email protected]> Co-authored-by: Chong Gao <[email protected]> * get-json-object: Recursive to iterative (#1890) * Change recursive to iterative Signed-off-by: Chong Gao <[email protected]> --------- Signed-off-by: Chong Gao <[email protected]> Co-authored-by: Chong Gao <[email protected]> * Fix bug * Format * Use uppercase for path_instruction_type Signed-off-by: Haoyang Li <[email protected]> * Add test cases from Baidu * Fix escape char error; add test case * getJsonObject number normalization (#1897) * Support number normalization Signed-off-by: Haoyang Li <[email protected]> * delete cpp test and add a java test case Signed-off-by: Haoyang Li <[email protected]> --------- Signed-off-by: Haoyang Li <[email protected]> * Add test case * Fix a escape/unescape size bug Signed-off-by: Haoyang Li <[email protected]> * Fix bug: handle leading zeros for number; Refactor * Apply suggestions from code review Co-authored-by: Nghia Truong <[email protected]> * Address comments Signed-off-by: Haoyang Li <[email protected]> * fix java test Signed-off-by: Haoyang Li <[email protected]> * Add test cases; Fix a bug * follow up escape/unescape bug fix Signed-off-by: Haoyang Li <[email protected]> * Minor refactor * Add a case; Fix bug --------- Signed-off-by: Chong Gao <[email protected]> Signed-off-by: Haoyang Li <[email protected]> Co-authored-by: Chong Gao <[email protected]> Co-authored-by: Haoyang Li <[email protected]> Co-authored-by: Nghia Truong <[email protected]>
Closes #1848
This PR adds JNI for getJsonObject.
I haven't tested it because the feature isn't ready yet, but we can merge it to the dev branch first.